Conversation
Update all actions to latest major versions: - actions/checkout: v1/v2/v3 -> v4 - actions/setup-java: v3 -> v5 - android-actions/setup-android: v2 -> v3 - microsoft/setup-msbuild: v1.1 -> v2 - actions/upload-artifact: v2 -> v4 Fix Android CI: - Use sdkmanager from PATH (setup-android v3 adds it automatically) instead of hardcoded cmdline-tools/7.0/bin directory - Replace hardcoded C:\Users\runneradmin with \C:\Users\bhamehta Fix Windows CI: - Update vs-version from [16,) to [17,) to match VS 2022 runners Fix iOS/macOS CI: - Replace deprecated macos-13 runner with macos-14 - Update iOS deployment target to 14.0 - Update simulator matrix for macos-14/macos-15 runners General: - Add concurrency groups to cancel superseded workflow runs - Updates apply to both active and disabled (.off) workflows Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
xcodebuild on macos-14 runners lists duplicate simulator entries, causing it to pick the 'Any iOS Device' placeholder as the first match. This results in: 'Tests must be run on a concrete device'. Fix by resolving the simulator UUID via xcrun simctl and passing -destination id=<UUID> which is completely unambiguous. This is the recommended approach per xcodebuild docs and actions/runner-images#8621. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use per-instance ephemeral NSURLSession to prevent stale connection reuse and ensure clean teardown without use-after-destroy crashes - Move session invalidation to CancelAllRequests for safer lifecycle - Fix response body leak in HttpResponseDecoder - Relax flaky test tolerances for CI simulator environments Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…hread - Move m_runningLatency assignment inside LOCKGUARD to prevent concurrent read/write race condition - Clean up remaining task queues on WorkerThread shutdown instead of just logging, preventing memory leak Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove LOG_TRACE from ~Logger() to prevent crash when the logging subsystem is torn down before Logger instances during static destruction (observed on iOS simulator). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use standard __ANDROID__ macro instead of non-standard ANDROID in HttpClientFactory.cpp - Mark unused log tag with __attribute__((unused)) in HttpClient_Android.cpp - Add #ifndef guards in config-default.h to prevent duplicate macro definitions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace 'localhost' with '127.0.0.1' in functional tests to avoid IPv6 resolution failures on some CI environments - Add synchronization loop for storage-full callback in APITest to avoid race condition in assertions - Increase timing tolerances in PalTests for slower CI runners Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The queue cleanup must not run after detach() — the detached thread still needs the shutdown item to exit its event loop. Cleaning up after detach deletes the shutdown signal, causing the thread to hang and then access destroyed members (use-after-free). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
m_runningLatency is read without locks at lines 219 and 376, while written from other threads. Making it std::atomic eliminates the undefined behavior without requiring additional locks (which could deadlock through upload callbacks). The existing locks remain for other shared state they protect. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sary) Now that m_runningLatency is std::atomic, the write doesn't need to be inside the lock scope. Restore the original position from main to minimize the diff. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move m_runningLatency and m_scheduledUploadTime back to before the LOCKGUARD scope, matching the original code on main. The .cpp now has zero diff from main — all threading fixes are in the .hpp (std::atomic declaration). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2c28a4b to
cc259a3
Compare
MSVC rejects passing std::atomic<T> to variadic functions (deleted copy constructor). Use explicit .load() to pass the underlying EventLatency value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-only - Split WARN_FLAGS into GCC and Clang/AppleClang branches so each only gets flags it supports (-Wno-unknown-warning-option is Clang-only) - Add -Wno-reorder as CXX_EXTRA_WARN_FLAGS to suppress member-init order warnings in submodule code - Add -fno-finite-math-only to all non-MSVC REL_FLAGS to restore IEEE 754 compliance (INFINITY macro) needed by sqlite3 and tests when -ffast-math is enabled Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
fd030ce to
2513dbe
Compare
The iOS simulator's network stack on CI runners has higher latency than native builds. The SO_CONNECTION_IDLE socket option is not available, causing NSURLSession to retry connections with delays. - Replace std::this_thread::yield() with PAL::sleep(10) in waitForEvents() to reduce CPU contention on single-core runners - Increase all waitForEvents timeouts from 1-3s to 5-10s - Replace PAL::sleep(2000) with waitForEvents where possible - Increase killSwitchWorks upload wait from 2s to 5s Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
0f8459a to
700ad68
Compare
- Make RequestHandler::m_count atomic to fix data race between server thread (writing) and test thread (reading). - Increase wait timeouts from 10s to 20s to accommodate slow CI simulators where NSURLSession connection setup can be delayed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Increase PAL::sleep from 10s to 20s to give the production collector upload more time on slow iOS simulator CI environments. Keeps full test coverage including network upload assertions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The second half of LogManager_Initialize_DebugEventListener uploads to a real production collector endpoint, which is unreliable on CI simulators. Gate it with TARGET_OS_SIMULATOR so the storage-full and event-counting assertions still run, while the network upload assertions only run on real devices and non-Apple platforms. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f4b3bd9 to
1872937
Compare
These changes cause deterministic test failures where meta-stats start events are not delivered. The WorkerThread task deletion on join is the most likely cause — it deletes pending upload tasks (including start events) during FlushAndTeardown. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert all changes after 7c89cf6 that caused test regressions. Restore HttpClientManager lock fix, HttpResponseDecoder leak fix, and WorkerThread cleanup to their original working versions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| void ensureSession(); | ||
| std::mutex m_requestsMtx; | ||
| std::map<std::string, IHttpRequest*> m_requests; | ||
| void* m_session = nullptr; // NSURLSession*, opaque in header |
There was a problem hiding this comment.
m_session is now an instance field that is created/read in ensureSession() / GetOrCreateSession() and released/nullled in CancelAllRequests() / the destructor, but those accesses are not synchronized. That means a request-start path can race with teardown and observe a session pointer while another thread is invalidating/transferring it. The old code had different lifecycle issues, but not this per-instance unsynchronized session race. I think we need to guard m_session lifetime with synchronization here, ideally with a very small critical section so we don’t hold a lock across the rest of request setup. Though I am still not comfortable introducing locks in the request processing path,
There was a problem hiding this comment.
I have removed setting m_session to nullptr in CancelAllRequests and now let only the destructor handle it. I think this should be thread-safe, since after we call CancelAllRequests in FlushAndTeardown, we set m_httpClient = nullptr;, which means no other code in HttpClient should trigger and ensureSession() / GetOrCreateSession()) will not create a new session. So, the destructor for HttpClient will then run when the reference count to httpClient hits 0, and set m_session to nullptr without any race conditions in HttpClient.
There was a problem hiding this comment.
This assumes a user correctly calls FlushAndTeardown, which should
- call
m_system = nullptrdestroyed the TelemetrySystem, including HttpClientManager which ultimately calls HttpClient_Apple::CancelAllRequests() (draining all requests), and the task dispatcher should now be idle and no longer trying to set requests - then set
m_httpClient = nullptr;which should call the destructor for HttpClient safely, without any races
But since this is assumed in proper usage of the SDK, I think it is safe
There was a problem hiding this comment.
I think we can fix the original cancellation bug with a smaller change: keep the existing shared NSURLSession (defaultSessionConfiguration + dispatch_once) and just remove
the [session getTasksWithCompletionHandler:] block from HttpRequestApple::Cancel(), so we only cancel m_dataTask.
That async block is the source of the cross-instance cancel: it uses the static session after SDK teardown and can end up cancelling tasks from a newly initialized HttpClient_Apple. Scoping cancellation to m_dataTask should fix that without changing session ownership.
The per-instance session and ephemeralSessionConfiguration changes feel broader than needed and introduce the m_session lifetime concern discussed above. If there is a separate stale-connection/test flake being addressed by ephemeralSessionConfiguration, I’d prefer to handle that separately rather than changing production session behavior in this PR.
| { | ||
| if (!m_session) { | ||
| @autoreleasepool { | ||
| NSURLSessionConfiguration* sessionConfig = [NSURLSessionConfiguration ephemeralSessionConfiguration]; |
There was a problem hiding this comment.
Switching from defaultSessionConfiguration to ephemeralSessionConfiguration is a real behavior change for Apple customers, not just an internal cleanup. Even if that is intended, it can change OS managed session semantics such as caching/cookie/persistence behavior across iOS/macOS versions. Given how broadly this SDK is used, I think we should call out and validate that compatibility impact explicitly.
There was a problem hiding this comment.
call out and validate that compatibility impact explicitly
Should I add a note about this to the docs and test it on more iPhone / iPad versions in the iOS Simulator?
Replace PAL::sleep(500) workaround with unique-per-invocation DB paths for phase 2. The SDK closes SQLite via sqlite3_close_v2(), which defers file-descriptor cleanup when prepared statements linger. Reusing a fixed .phase2 path and std::remove()-ing the old files while deferred fds are still open caused iOS 'vnode unlinked while in use' errors that could invalidate the new DB's descriptors, leading to numCached == 0. A fresh, never-before-seen path (atomic counter suffix) eliminates the collision entirely — no stale files, no vnode warnings, no sleep needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Set TransmitProfile_RealTime before UploadNow() in tests that expect events from multiple latency queues. Without this, UploadNow() triggers one batch but subsequent queues wait for timer ticks, causing timeouts on loaded CI runners. Affected tests: restartRecoversEventsFromStorage, sendMetaStatsOnStart, DiagLevelRequiredOnly_* (2 tests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- BasicFuncTests::Initialize(): Reset CFG_INT_CACHE_FILE_SIZE, CFG_INT_STORAGE_FULL_PCT, and CFG_INT_STORAGE_FULL_CHECK_TIME to defaults. These were left contaminated by the DebugEventListener test which modifies the global LogManager config via reference. - BasicFuncTests::Initialize(): Set TransmitProfile_RealTime so all timer delays are 0, preventing multi-batch upload timing issues. - Add SetTransmitProfile(RealTime) before UploadNow in 5 more tests and revert their inflated timeouts back to originals. - APITest DebugEventListener Phase 3: Add UploadNow + sleep for reliable upload verification before FlushAndTeardown. - PalTests: Tighten timing upper bounds (1000ms/1500ms vs 2000ms). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On iOS simulator, SO_CONNECTION_IDLE is unavailable, causing NSURLSession to reuse stale connections and hit -1005 errors. The default 3s retry backoff leaves insufficient time for retries within the 5s test timeout. Setting 500ms initial backoff allows quick recovery from transient connection failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 2 no longer calls UploadNow + sleep(10000) before pausing. On iOS simulator, the 10s production upload window hits -1005 (stale connection) errors that corrupt SDK state, preventing Phase 3 from caching events (numCached=0). Phase 3 still verifies upload functionality via ResumeTransmission + UploadNow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
117493b to
710a531
Compare
Declare m_scheduledUploadTime as std::atomic<uint64_t> and move its update back outside the LOCKGUARD, consistent with m_runningLatency. All accesses were already under LOCKGUARD(m_scheduledUploadMutex), but atomic makes the thread-safe intent self-documenting and guards against future accesses that might omit the lock. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7f4e119 to
79342fe
Compare
The iOS simulator tests can deadlock in CancelAllRequests spin loops when an NSURLSession completion handler never fires. Without a timeout the job runs for 6 hours until GitHub Actions kills it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A rapid second push to main would cancel the first build, potentially letting a broken commit go undetected until the next trigger. Only cancel in-progress runs for pull_request events where superseded builds are redundant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
38ff440 to
b6d247a
Compare
The global -Wno-reorder suppressed member-init-order warnings across the entire build, which could hide real bugs. The only file that triggers the warning is Sanitizer.cpp (from the lib/modules submodule) where Sanitizer.hpp declares m_sanitizerprovider before m_notificationEventName but the constructor initializes them in reverse order. Scope the suppression to that single file via set_source_files_properties until the submodule fixes the declaration order. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1d83157 to
ebdf617
Compare
Restores visibility into dropped work that was lost when the old LOG_WARN on non-empty queues was removed. Logs queue sizes before the join/detach decision so operators can see pending tasks in both code paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ebdf617 to
c917965
Compare
m_requests.empty() was read without holding m_requestsMtx while Erase and Add mutate it under the lock. Read the map under the lock to avoid undefined behavior, matching the fix already applied to HttpClientManager::cancelAllRequests(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| xcodebuild test -scheme iOSUnitTests -destination "platform=iOS Simulator,name=$SIMULATOR" | ||
| # Resolve simulator UUID to avoid ambiguous destination matching. | ||
| # Grab the last (newest OS) available device matching the requested name. | ||
| SIM_ID=$(xcrun simctl list devices available | grep "$SIMULATOR" | grep -oE '[A-F0-9-]{36}' | tail -1) |
There was a problem hiding this comment.
Can we resolve the simulator from xcrun simctl list devices available --json and require device.name == "$SIMULATOR" before taking the UDID?
grep "$SIMULATOR" is more of substring match, and can resolve to multiple devices.
| if [[ "${{ matrix.os }}" == "macos-13" ]]; then | ||
| export IOS_DEPLOYMENT_TARGET=13.0; | ||
| if [[ "${{ matrix.os }}" == "macos-14" ]]; then | ||
| export IOS_DEPLOYMENT_TARGET=14.0; |
There was a problem hiding this comment.
The runner update from macos-13 to macos-14 makes sense, but this also raises the CI deployment target from IOS_DEPLOYMENT_TARGET=13.0 to 14.0. Is that intentional?
The current macos-14 log shows the job is using Xcode 15.4, which still supports lower iOS deployment targets, and README.md still documents iOS 12+ as supported. Since runner OS and deployment target are independent, I think we should keep this leg on macos-14 but continue exporting IOS_DEPLOYMENT_TARGET=13.0 unless we are intentionally dropping that lower-iOS coverage
Modernize all GitHub Actions workflows to use latest action versions and fix pre-existing bugs across iOS, Android, and Windows that cause CI failures, crashes, data races, and flaky tests.
Several CI pipelines were failing or logging extensive warnings due to:
Changes
xcodebuild on macos-14 lists duplicate simulator entries, causing it to pick the "Any iOS Device" placeholder instead of a concrete simulator (actions/runner-images#8621). Fix by resolving the simulator UUID via xcrun simctl and using -destination "id=".
FlushAndTeardown() remains required before shutdown, as documented.
Remove LOG_TRACE from ~Logger() to prevent crash when logging subsystem is destroyed before Logger instances during static destruction. Order of namespace destruction can change depending on system, but the original implementation was causing issues for the simulator (where debugLogMutex may be destroyed before the Logger instance). If this is an issue, I can revert it, as I am not sure if it would be an actual issue in a real iOS deployment scenario.
__ANDROID__macro instead of non-standardANDROIDTesting
All changes have been validated through: