Skip to content

feat: GB10 CUDA graph capture probes (T1.1, T1.2)#94

Merged
dndungu merged 6 commits intomainfrom
wave-1-integration
Apr 16, 2026
Merged

feat: GB10 CUDA graph capture probes (T1.1, T1.2)#94
dndungu merged 6 commits intomainfrom
wave-1-integration

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Apr 16, 2026

Summary

Wave 1 of the GB10 CUDA graph capture hang fix (docs/plan.md E1).

  • T1.1: Add cuda.StreamCaptureStatus purego binding wrapping cudaStreamGetCaptureInfo. Three-valued enum (None/Active/Invalidated). Safe no-op on CPU-only runtimes.
  • T1.2: Add ensureNotCapturing() guard on *GPUEngine[T] that trips when a weight alloc/upload is attempted during active capture. Returns new sentinel compute.ErrCaptureIncompatibleAllocation. Wired into allocWeight and uploadBytes. This makes the silent GB10 hang observable from the call site so we can fail fast (or fall back in future waves).

This lands the probes only. Follow-on work: T1.3 (reproduction test under //go:build dgxgb10), T1.4 (Spark manifest + hardware run), then E2 fixes (T2.1a WithCapture helper, T2.2 capture-aware allocWeight routing).

Verification report

  • Merge safety protocol (M0-M5): PASS. T1.2 was rebased atop T1.1; merged both via --no-ff onto wave-1-integration. Silent-revert check: every non-context line from each branch's M1 patch is reflected in git diff main...HEAD.
  • Build: go build ./... PASS.
  • Test: go test ./... -race -timeout 180s PASS (all packages, including compute 2.8s with -race).
  • Vet: delta from origin/main is zero (28 pre-existing possible misuse of unsafe.Pointer warnings in internal GPU shim packages, none in files this PR touches).
  • Lint (golangci-lint run): zero new issues. Pre-existing findings in compute/gpu_fp8.go, compute/cpu_engine_quant_test.go, compute/gpu_paged_gqa_test.go, compute/ternary_gemv.go (none touched by this PR).
  • Stub audit: zero TODO/FIXME/Stub/Mock/Fake/Placeholder/NotImplemented in production diff.
  • Use case: UC-003 (fail-fast on capture-incompatible alloc) exercised by compute/capture_guard_test.go (nil-stream safe path + errors.Is sanity).

Files touched

  • internal/cuda/purego.go (+8 −6) — register cudaStreamGetCaptureInfo symbol
  • internal/cuda/runtime_purego.go (+35) — StreamCaptureStatus + CaptureStatus* constants
  • internal/cuda/runtime_purego_test.go (+58, new) — binding smoke tests
  • compute/errors.go (+11, new) — ErrCaptureIncompatibleAllocation sentinel
  • compute/gpu_engine.go (+35 −2) — ensureNotCapturing method; wire into allocWeight + uploadBytes
  • compute/capture_guard_test.go (+40, new) — sentinel + nil-stream tests
  • docs/plan.md — mark T1.1/T1.2 complete

Test plan

  • go build ./...
  • go test ./... -race -timeout 180s
  • go vet ./... (delta zero vs origin/main)
  • golangci-lint run on touched packages (zero new findings)
  • CI green on this PR (auto)

dndungu added 6 commits April 15, 2026 21:00
Replaces closed Issue-79 investigation plan with comprehensive 6-epic
execution plan for resolving silent hang on NVIDIA DGX Spark GB10
(arm64 Grace Hopper) during multi-tensor weight uploads with CUDA
graph capture active.

Waves 1-8 define parallel agent counts, deliverables (E1 reproduction,
E2 capture-aware alloc, E3 conditional Mmap investigation, E4 fail-fast
fallback, E5 downstream rollout, E6 release), milestones M1-M5, risk
register, and Spark operational notes.
Wraps cudaStreamGetCaptureInfo so callers can detect stream capture
state before recording incompatible operations. Returns None without
error when the runtime is unavailable (CPU-only builds).

Used by T1.2's ensureNotCapturing guard in compute/gpu_engine.go to
block sync allocations during graph capture.
…atibleAllocation

Weight allocation and host-to-device uploads now fail fast with a
typed error when invoked while a CUDA graph capture is active on the
engine's stream. On GB10 the legacy path silently hangs because
cudaMalloc / MallocManaged are not capturable; this guard surfaces
that condition to callers via errors.Is(err, ErrCaptureIncompatibleAllocation).

The guard queries cuda.StreamCaptureStatus (T1.1). On CPU-only
runtimes or nil streams the probe returns nil and the code path is
unchanged. Probe failures are propagated rather than swallowed.

Refs E1 T1.2.
@dndungu dndungu merged commit b183ff8 into main Apr 16, 2026
1 check passed
@dndungu dndungu deleted the wave-1-integration branch April 16, 2026 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant