Skip to content

Jsewill/xchplot2

Repository files navigation

xchplot2

GPU plotter for Chia v2 proofs of space (CHIP-48). Produces farmable .plot2 files byte-identical to the pos2-chip CPU reference.

Status — work in progress. The plotter produces correct, spec-compliant .plot2 output: per-phase parity tests verify byte-identical agreement with pos2-chip's CPU reference at every stage, the CUB and SYCL backends produce bit-identical files, and determinism holds across runs. The project is still actively under development — performance, cross-vendor support (AMD / Intel), and the install / CI story are evolving. Expect rough edges; use the cuda-only branch if you want the most-tested code path.

Branches: main carries the SYCL/AdaptiveCpp port that lets the plotter run on AMD and Intel GPUs (with an opt-out CUB sort path preserved for NVIDIA). The original CUDA-only implementation, which is ~1.5× faster on NVIDIA than the SYCL fallback at k=28, lives on the cuda-only branch — use it if you only ever target NVIDIA and want the last bit of throughput.

Hardware compatibility

  • GPU: NVIDIA, compute capability ≥ 6.1 (Pascal / GTX 10-series and newer). Builds auto-detect the installed GPU's compute_cap via nvidia-smi; override with $CUDA_ARCHITECTURES for fat or cross-target builds (see Build).
  • VRAM: 8 GB minimum. Cards with less than ~17 GB free transparently use the streaming pipeline; 18 GB+ cards reliably use the persistent buffer pool for faster steady-state. Both paths produce byte-identical plots. Detailed breakdown in VRAM.
  • PCIe: Gen4 x16 or wider recommended. A physically narrower slot (e.g. Gen4 x4) adds ~240 ms per plot to the final fragment D2H copy; check cat /sys/bus/pci/devices/*/current_link_width under load if throughput looks off.
  • Host RAM: ≥ 16 GB recommended; batch mode pins ~4 GB of host memory for D2H double-buffering (pool or streaming).
  • CUDA Toolkit: 12+ required to build (tested on 13.x). Runtime users on RTX 50-series (Blackwell, sm_120) need a driver bundle that ships Toolkit 12.8+; earlier toolkits lack Blackwell codegen.
  • OS: Linux (tested on modern glibc distributions). Windows and macOS are not currently tested.

Build

Three ways to get the dependencies in place, easiest first:

1. Container (podman compose or docker compose)

Easiest path — let the wrapper detect your GPU and pick the right compose service automatically:

./scripts/build-container.sh    # auto: nvidia-smi → cuda, rocminfo → rocm
podman compose run --rm cuda plot -k 28 -n 10 -f <farmer-pk> -c <pool-contract> -o /out

compose.yaml defines three vendor-specific services sharing one Containerfile; the script just runs compose build against whichever matches your hardware. Override manually if you prefer:

# NVIDIA (default sm_89; override via $CUDA_ARCH=120 etc.)
podman compose build cuda

# AMD ROCm — set $ACPP_GFX from `rocminfo | grep gfx`.
ACPP_GFX=gfx1031 podman compose build rocm    # Navi 22
ACPP_GFX=gfx1100 podman compose build rocm    # Navi 31 (default)

# Intel oneAPI (experimental, untested).
podman compose build intel

Plot files land in ./plots/ on the host. The container also bundles the parity tests (sycl_sort_parity, sycl_g_x_parity, etc.) under /usr/local/bin/ for quick first-port validation on a new GPU:

podman compose run --rm --entrypoint /usr/local/bin/sycl_sort_parity rocm

First build is ~15-30 min (AdaptiveCpp + LLVM 18 compile from source); subsequent rebuilds reuse the cached layers. GPU performance inside the container is identical to native (devices pass through via CDI on NVIDIA, /dev/kfd+/dev/dri on AMD; kernels run on real hardware).

2. Native install via scripts/install-deps.sh

./scripts/install-deps.sh        # auto-detects distro + GPU vendor

Installs the toolchain via the system package manager (Arch, Ubuntu / Debian, Fedora) plus AdaptiveCpp from source into /opt/adaptivecpp. Pass --gpu amd to force the AMD path (CUDA Toolkit headers only, plus ROCm). Pass --no-acpp to skip the AdaptiveCpp build and let CMake fall back to FetchContent.

3. Manual / FetchContent fallback

If you'd rather install dependencies yourself, the toolchain is:

Dep Notes
AdaptiveCpp 25.10+ SYCL implementation. CMake auto-fetches it via FetchContent if find_package(AdaptiveCpp) fails — first build adds ~15-30 min. Disable with -DXCHPLOT2_FETCH_ADAPTIVECPP=OFF if you want a hard error.
CUDA Toolkit 12+ (headers) Required on every build path because AdaptiveCpp's half.hpp includes cuda_fp16.h. nvcc itself only runs when XCHPLOT2_BUILD_CUDA=ON (default; pass OFF for AMD/Intel).
LLVM / Clang ≥ 18 clang + libclang dev packages.
C++20 compiler clang ≥ 18 or gcc ≥ 13.
CMake ≥ 3.24, Ninja, Python 3 build tools.
Boost.Context, libnuma, libomp AdaptiveCpp runtime deps.
Rust toolchain (stable) for keygen-rs and cargo install.

pos2-chip and FSE are auto-fetched at CMake configure time (FetchContent); override -DPOS2_CHIP_DIR=/abs/path for a local checkout.

For non-NVIDIA targets, the build also probes:

  • ROCm 6+ (rocminfo): if found, sets ACPP_TARGETS=hip:gfxXXXX.
  • Intel oneAPI (Level Zero / compute-runtime): manual ACPP_TARGETS.

cargo install

cargo install --git https://github.com/Jsewill/xchplot2

build.rs auto-detects the local GPU's compute capability by querying nvidia-smi --query-gpu=compute_cap and builds for only that architecture. That keeps the binary small and the build fast when the install and the target GPU are the same machine.

If auto-detection fails (no nvidia-smi in PATH, or nvidia-smi can't see a GPU — common when building inside a container or on a headless build host that lacks the CUDA driver), the build falls back to sm_89.

If you need to target a GPU that isn't the one doing the build — or if you want a single "fat build" binary that covers multiple architectures — override with $CUDA_ARCHITECTURES:

# Fat build for Ada (4090) and Blackwell (5090):
CUDA_ARCHITECTURES="89;120" cargo install --git https://github.com/Jsewill/xchplot2

# Single target (e.g. Turing 2080 Ti):
CUDA_ARCHITECTURES=75 cargo install --git https://github.com/Jsewill/xchplot2

Common values: 61 GTX 10-series, 70 Volta, 75 Turing, 80 A100, 86 RTX 30-series, 89 RTX 40-series, 90 H100, 120 RTX 50-series.

CMake (also builds the parity tests)

cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

pos2-chip is auto-fetched via FetchContent; override with -DPOS2_CHIP_DIR=/abs/path/to/pos2-chip to point at a local checkout.

Outputs:

  • build/tools/xchplot2/xchplot2
  • build/tools/parity/{aes,xs,t1,t2,t3}_parity — bit-exact CPU/GPU tests

Use

Standalone (farmable plots)

xchplot2 plot -k 28 -n 10 \
    -f <farmer-pk> \
    -c <pool-contract-address> \
    -o <output-dir>

Pool variants: -p <pool-pk> or --pool-ph <pool-ph>. Other common flags: -s <strength>, -T testnet, -S <seed> for reproducible runs, -v verbose. Full help: xchplot2 -h.

Grouping plots: -i <plot-index> and -g <meta-group>

Both are v2 PoS fields and default to 0. <plot-index> (u16) is the within-group identifier; plot -n N uses it as the base and increments per plot (so -i 0 -n 1000 produces plots with plot_index 0..999). <meta-group> (u8) is a challenge-isolation boundary — plots with different meta_group values are guaranteed never to pass the same challenge.

The PoS2 spec defines a grouped-plot file layout (multiple plots interleaved into one container per storage device, for harvester seek amortization), but the on-disk format is not yet defined upstream in pos2-chip / chia-rs. xchplot2 currently produces one .plot2 file per plot — this is in lieu of those upstream decisions. When the grouped layout lands, the auto-incrementing <plot-index> above is the per-plot within-group identifier it will expect.

Lower-level subcommands

xchplot2 test  <k> <plot-id-hex> [strength] ...   # single plot, raw inputs
xchplot2 batch <manifest.tsv> [-v]                # batched, raw inputs

Testing farming on a testnet

v2 (CHIP-48) farming in stock chia-blockchain is presently unfinished upstream — services aren't wired into the farmer group, a message handler's signature doesn't match its decorator, ProofOfSpace. challenge is computed from the wrong input, and the dependency pin on chia_rs excludes the 0.42 release where compute_plot_id_v2 lives. contrib/testnet-farming.patch is a minimal self-contained fix-up that gets a private testnet running end-to-end:

git clone https://github.com/Chia-Network/chia-blockchain
cd chia-blockchain
git checkout 39f8bec88   # 2.7.0 Checkpoint Merge
git apply /path/to/xchplot2/contrib/testnet-farming.patch

The patch's header comment describes each hunk. None of the changes are xchplot2-specific — they're the farmer / harvester / daemon pieces any v2 plot needs for farming, regardless of who produced it.

Architecture

src/gpu/                 CUDA kernels — AES, Xs, T1, T2, T3
src/host/
├── GpuPipeline          Xs → T1 → T2 → T3 device orchestration;
│                          pool + streaming (low-VRAM) variants
├── GpuBufferPool        persistent device + 2× pinned host pool
├── BatchPlotter         producer / consumer batch driver
└── PlotFileWriterParallel  sole TU touching pos2-chip headers
tools/xchplot2/          CLI: plot / test / batch
tools/parity/            CPU↔GPU bit-exactness tests
keygen-rs/               Rust staticlib: plot_id_v2, BLS HD, bech32m

VRAM

PoS2 plots are k=28 by spec. Two code paths, dispatched automatically based on available VRAM:

  • Pool path (~16 GB device + ~6 GB pinned host; 18 GB+ cards reliably). The persistent buffer pool is sized worst-case and reused across plots in batch mode for amortised allocator cost and double-buffered D2H. Targets for steady-state: RTX 4090 / 5090, A6000, H100, etc. RTX 4080 (16 GB) may transparently fall back to streaming after driver overhead.
  • Streaming path (~8 GB). Allocates per-phase and frees between phases; T1/T2 sorts are tiled (N=2 and N=4 respectively) and the merge-with-gather is split into three passes so the live set stays under 8 GB. Targets 8 GB cards (GTX 1070 class and up). Slower per plot (~3.7 s vs ~2.4 s at k=28 on a 4090) because it pays per-phase cudaMalloc/cudaFree instead of amortising.

xchplot2 queries cudaMemGetInfo at pool construction; if the pool doesn't fit, it transparently falls back to the streaming pipeline with no flag needed. Force streaming on any card with XCHPLOT2_STREAMING=1, useful for testing or for users who want the smaller peak regardless.

Plot output is bit-identical between the two paths — the streaming code reorganises memory, not algorithms.

Performance

k=28, strength=2, RTX 4090 (sm_89), PCIe Gen4 x16. Steady-state per-plot wall from xchplot2 batch (10-plot manifest, mean):

Build Per plot Notes
pos2-chip CPU baseline ~50 s reference
cuda-only branch 2.15 s original CUDA-only path
main, XCHPLOT2_BUILD_CUDA=ON (CUB sort) 2.41 s NVIDIA fast path on the SYCL/AdaptiveCpp port
main, XCHPLOT2_BUILD_CUDA=OFF (hand-rolled SYCL radix) 3.79 s cross-vendor fallback (AMD/Intel) on AdaptiveCpp
streaming path, ≤8 GB cards ~3.7 s pool path is preferred when VRAM allows

The main/CUB row is +12% over cuda-only from extra AdaptiveCpp scheduling overhead. The SYCL row is +57% over CUB on the same NVIDIA hardware; ~88% of GPU compute is identical between the two paths (nsys per-kernel breakdown), and the gap is dominated by host-side runtime overhead in AdaptiveCpp's DAG manager rather than kernel performance. AMD and Intel runtimes are untested; expect roughly the SYCL-row latency adjusted for relative GPU throughput.

License

MIT — see LICENSE and NOTICE for third-party attributions. Built collaboratively with Claude.

About

A GPU-based PoS2 (Chia Blockchain) plotter.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages