Skip to content

feat: WithCapture helper + capture watchdog (T2.1a, T4.1)#96

Merged
dndungu merged 5 commits intomainfrom
wave-4a-integration
Apr 16, 2026
Merged

feat: WithCapture helper + capture watchdog (T2.1a, T4.1)#96
dndungu merged 5 commits intomainfrom
wave-4a-integration

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Apr 16, 2026

Summary

Wave 4a of the GB10 CUDA graph capture fix (docs/plan.md E2+E4). Closes #93 partially — adds the two foundational primitives that E2 fix and E4 fallback build on.

  • T2.1a: (*GPUEngine[T]).WithCapture(fn func() error) (GraphHandle, error) — safe one-call API for entering a CUDA graph capture region through the engine. Correctly engages CaptureAwareAllocator for the duration of fn. Returns a replayable GraphHandle. 6 CPU-mock unit tests (nil stream, fn error propagation, begin/end error propagation, error precedence, valid handle return).
  • T4.1: captureWatchdog in graph/cuda_graph.go — 30s timeout watchdog goroutine that samples StreamCaptureStatus every second during capture. Detects Invalidated status and stalls. Sentinel errors ErrCaptureTimeout and ErrCaptureInvalidated. Wired into captureAndRun. 4 CPU-mock unit tests (nil-stream no-op, cancel stops goroutine, sentinel identity, default timeout).

These unblock Wave 4b (T2.2 capture-aware allocWeight routing, T2.3 workspace pre-allocation) and Wave 5 (T4.2 CaptureSafe helper).

Refs #93.

Verification

  • Build: go build ./... PASS
  • Test: go test ./compute/... ./graph/... -race -timeout 120s PASS (10 new tests total)
  • Vet: delta zero vs origin/main
  • Merge safety (M0-M5): PASS, zero file overlap between branches
  • Stub audit: zero hits in production diff

Test plan

  • go build ./...
  • go test ./compute/... ./graph/... -race -timeout 120s
  • go vet ./... (delta zero)
  • CI green (auto)

dndungu added 5 commits April 16, 2026 09:11
…mpling

Add a captureWatchdog goroutine that monitors CUDA graph capture health
during stream capture. The watchdog:

- Polls cuda.StreamCaptureStatus every 1 second
- Detects CaptureStatusInvalidated and force-ends capture
- Enforces a 30-second total timeout via context.WithTimeout
- Treats probe stalls (>5s) as hang signals
- Is a no-op when stream is nil (CPU-only builds)
- Cleans up via cancel() when capture completes normally

The watchdog is wired into captureAndRun between StreamBeginCapture
and StreamEndCapture. On error, capture falls back to uncaptured
execution via the existing failure path.

Tests in capture_watchdog_test.go cover nil-stream no-op, cancel
stops goroutine, sentinel error identity, and default timeout value.
All tests run without CUDA.
…ifecycle

WithCapture(fn) wraps BeginCapture/EndCapture into a single call that
ensures the CaptureAwareAllocator is engaged for the duration of fn.
Returns the GraphHandle on success so callers can replay the captured
graph. fn error takes precedence over EndCapture error; the graph is
destroyed on fn failure.

Also introduces test-swappable indirection for StreamBeginCapture,
StreamEndCapture, GraphInstantiate, and GraphDestroy — following the
existing captureStatusFn pattern — so WithCapture can be unit-tested
without real CUDA hardware.
@dndungu dndungu merged commit 6efe00c into main Apr 16, 2026
1 check passed
@dndungu dndungu deleted the wave-4a-integration branch April 16, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GB10 CUDA graph capture silently hangs during multi-tensor weight upload (CrossAsset training)

1 participant