xchplot2

GPU plotter for Chia v2 proofs of space (CHIP-48). Produces farmable .plot2 files byte-identical to the pos2-chip CPU reference.

Status — work in progress. The plotter produces correct, spec-compliant .plot2 output: per-phase parity tests verify byte-identical agreement with pos2-chip's CPU reference at every stage, the CUB and SYCL backends produce bit-identical files, and determinism holds across runs. The project is still actively under development — performance, cross-vendor support (AMD / Intel), and the install / CI story are evolving. Expect rough edges; use the cuda-only branch if you want the most-tested code path.

Branches: main carries the SYCL/AdaptiveCpp port that lets the plotter run on AMD and Intel GPUs (with an opt-out CUB sort path preserved for NVIDIA). The original CUDA-only implementation, which is ~1.5× faster on NVIDIA than the SYCL fallback at k=28, lives on the cuda-only branch — use it if you only ever target NVIDIA and want the last bit of throughput.

Hardware compatibility

GPU: NVIDIA, compute capability ≥ 6.1 (Pascal / GTX 10-series and newer). Builds auto-detect the installed GPU's compute_cap via nvidia-smi; override with $CUDA_ARCHITECTURES for fat or cross-target builds (see Build).
VRAM: 8 GB minimum. Cards with less than ~17 GB free transparently use the streaming pipeline; 18 GB+ cards reliably use the persistent buffer pool for faster steady-state. Both paths produce byte-identical plots. Detailed breakdown in VRAM.
PCIe: Gen4 x16 or wider recommended. A physically narrower slot (e.g. Gen4 x4) adds ~240 ms per plot to the final fragment D2H copy; check cat /sys/bus/pci/devices/*/current_link_width under load if throughput looks off.
Host RAM: ≥ 16 GB recommended; batch mode pins ~4 GB of host memory for D2H double-buffering (pool or streaming).
CUDA Toolkit: 12+ required to build (tested on 13.x). Runtime users on RTX 50-series (Blackwell, sm_120) need a driver bundle that ships Toolkit 12.8+; earlier toolkits lack Blackwell codegen.
OS: Linux (tested on modern glibc distributions). Windows and macOS are not currently tested.

Build

Three ways to get the dependencies in place, easiest first:

1. Container (`podman compose` or `docker compose`)

Easiest path — let the wrapper detect your GPU and pick the right compose service automatically:

./scripts/build-container.sh    # auto: nvidia-smi → cuda, rocminfo → rocm
podman compose run --rm cuda plot -k 28 -n 10 -f <farmer-pk> -c <pool-contract> -o /out

compose.yaml defines three vendor-specific services sharing one Containerfile; the script just runs compose build against whichever matches your hardware. Override manually if you prefer:

# NVIDIA (default sm_89; override via $CUDA_ARCH=120 etc.)
podman compose build cuda

# AMD ROCm — set $ACPP_GFX from `rocminfo | grep gfx`.
ACPP_GFX=gfx1031 podman compose build rocm    # Navi 22
ACPP_GFX=gfx1100 podman compose build rocm    # Navi 31 (default)

# Intel oneAPI (experimental, untested).
podman compose build intel

Plot files land in ./plots/ on the host. The container also bundles the parity tests (sycl_sort_parity, sycl_g_x_parity, etc.) under /usr/local/bin/ for quick first-port validation on a new GPU:

podman compose run --rm --entrypoint /usr/local/bin/sycl_sort_parity rocm

First build is ~15-30 min (AdaptiveCpp + LLVM 18 compile from source); subsequent rebuilds reuse the cached layers. GPU performance inside the container is identical to native (devices pass through via CDI on NVIDIA, /dev/kfd+/dev/dri on AMD; kernels run on real hardware).

2. Native install via `scripts/install-deps.sh`

./scripts/install-deps.sh        # auto-detects distro + GPU vendor

Installs the toolchain via the system package manager (Arch, Ubuntu / Debian, Fedora) plus AdaptiveCpp from source into /opt/adaptivecpp. Pass --gpu amd to force the AMD path (CUDA Toolkit headers only, plus ROCm). Pass --no-acpp to skip the AdaptiveCpp build and let CMake fall back to FetchContent.

3. Manual / FetchContent fallback

If you'd rather install dependencies yourself, the toolchain is:

Dep	Notes
AdaptiveCpp 25.10+	SYCL implementation. CMake auto-fetches it via FetchContent if `find_package(AdaptiveCpp)` fails — first build adds ~15-30 min. Disable with `-DXCHPLOT2_FETCH_ADAPTIVECPP=OFF` if you want a hard error.
CUDA Toolkit 12+ (headers)	Required on every build path because AdaptiveCpp's `half.hpp` includes `cuda_fp16.h`. `nvcc` itself only runs when `XCHPLOT2_BUILD_CUDA=ON` (default; pass `OFF` for AMD/Intel).
LLVM / Clang ≥ 18	clang + libclang dev packages.
C++20 compiler	clang ≥ 18 or gcc ≥ 13.
CMake ≥ 3.24, Ninja, Python 3	build tools.
Boost.Context, libnuma, libomp	AdaptiveCpp runtime deps.
Rust toolchain (stable)	for `keygen-rs` and `cargo install`.

pos2-chip and FSE are auto-fetched at CMake configure time (FetchContent); override -DPOS2_CHIP_DIR=/abs/path for a local checkout.

For non-NVIDIA targets, the build also probes:

ROCm 6+ (rocminfo): if found, sets ACPP_TARGETS=hip:gfxXXXX.
Intel oneAPI (Level Zero / compute-runtime): manual ACPP_TARGETS.

`cargo install`

cargo install --git https://github.com/Jsewill/xchplot2

build.rs auto-detects the local GPU's compute capability by querying nvidia-smi --query-gpu=compute_cap and builds for only that architecture. That keeps the binary small and the build fast when the install and the target GPU are the same machine.

If auto-detection fails (no nvidia-smi in PATH, or nvidia-smi can't see a GPU — common when building inside a container or on a headless build host that lacks the CUDA driver), the build falls back to sm_89.

If you need to target a GPU that isn't the one doing the build — or if you want a single "fat build" binary that covers multiple architectures — override with $CUDA_ARCHITECTURES:

# Fat build for Ada (4090) and Blackwell (5090):
CUDA_ARCHITECTURES="89;120" cargo install --git https://github.com/Jsewill/xchplot2

# Single target (e.g. Turing 2080 Ti):
CUDA_ARCHITECTURES=75 cargo install --git https://github.com/Jsewill/xchplot2

Common values: 61 GTX 10-series, 70 Volta, 75 Turing, 80 A100, 86 RTX 30-series, 89 RTX 40-series, 90 H100, 120 RTX 50-series.

CMake (also builds the parity tests)

cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

pos2-chip is auto-fetched via FetchContent; override with -DPOS2_CHIP_DIR=/abs/path/to/pos2-chip to point at a local checkout.

Outputs:

build/tools/xchplot2/xchplot2
build/tools/parity/{aes,xs,t1,t2,t3}_parity — bit-exact CPU/GPU tests

Use

Standalone (farmable plots)

xchplot2 plot -k 28 -n 10 \
    -f <farmer-pk> \
    -c <pool-contract-address> \
    -o <output-dir>

Pool variants: -p <pool-pk> or --pool-ph <pool-ph>. Other common flags: -s <strength>, -T testnet, -S <seed> for reproducible runs, -v verbose. Full help: xchplot2 -h.

Grouping plots: `-i <plot-index>` and `-g <meta-group>`

Both are v2 PoS fields and default to 0. <plot-index> (u16) is the within-group identifier; plot -n N uses it as the base and increments per plot (so -i 0 -n 1000 produces plots with plot_index 0..999). <meta-group> (u8) is a challenge-isolation boundary — plots with different meta_group values are guaranteed never to pass the same challenge.

The PoS2 spec defines a grouped-plot file layout (multiple plots interleaved into one container per storage device, for harvester seek amortization), but the on-disk format is not yet defined upstream in pos2-chip / chia-rs. xchplot2 currently produces one .plot2 file per plot — this is in lieu of those upstream decisions. When the grouped layout lands, the auto-incrementing <plot-index> above is the per-plot within-group identifier it will expect.

Lower-level subcommands

xchplot2 test  <k> <plot-id-hex> [strength] ...   # single plot, raw inputs
xchplot2 batch <manifest.tsv> [-v]                # batched, raw inputs

Testing farming on a testnet

v2 (CHIP-48) farming in stock chia-blockchain is presently unfinished upstream — services aren't wired into the farmer group, a message handler's signature doesn't match its decorator, ProofOfSpace. challenge is computed from the wrong input, and the dependency pin on chia_rs excludes the 0.42 release where compute_plot_id_v2 lives. contrib/testnet-farming.patch is a minimal self-contained fix-up that gets a private testnet running end-to-end:

git clone https://github.com/Chia-Network/chia-blockchain
cd chia-blockchain
git checkout 39f8bec88   # 2.7.0 Checkpoint Merge
git apply /path/to/xchplot2/contrib/testnet-farming.patch

The patch's header comment describes each hunk. None of the changes are xchplot2-specific — they're the farmer / harvester / daemon pieces any v2 plot needs for farming, regardless of who produced it.

Architecture

src/gpu/                 CUDA kernels — AES, Xs, T1, T2, T3
src/host/
├── GpuPipeline          Xs → T1 → T2 → T3 device orchestration;
│                          pool + streaming (low-VRAM) variants
├── GpuBufferPool        persistent device + 2× pinned host pool
├── BatchPlotter         producer / consumer batch driver
└── PlotFileWriterParallel  sole TU touching pos2-chip headers
tools/xchplot2/          CLI: plot / test / batch
tools/parity/            CPU↔GPU bit-exactness tests
keygen-rs/               Rust staticlib: plot_id_v2, BLS HD, bech32m

VRAM

PoS2 plots are k=28 by spec. Two code paths, dispatched automatically based on available VRAM:

Pool path (~16 GB device + ~6 GB pinned host; 18 GB+ cards reliably). The persistent buffer pool is sized worst-case and reused across plots in batch mode for amortised allocator cost and double-buffered D2H. Targets for steady-state: RTX 4090 / 5090, A6000, H100, etc. RTX 4080 (16 GB) may transparently fall back to streaming after driver overhead.
Streaming path (~8 GB). Allocates per-phase and frees between phases; T1/T2 sorts are tiled (N=2 and N=4 respectively) and the merge-with-gather is split into three passes so the live set stays under 8 GB. Targets 8 GB cards (GTX 1070 class and up). Slower per plot (~3.7 s vs ~2.4 s at k=28 on a 4090) because it pays per-phase cudaMalloc/cudaFree instead of amortising.

xchplot2 queries cudaMemGetInfo at pool construction; if the pool doesn't fit, it transparently falls back to the streaming pipeline with no flag needed. Force streaming on any card with XCHPLOT2_STREAMING=1, useful for testing or for users who want the smaller peak regardless.

Plot output is bit-identical between the two paths — the streaming code reorganises memory, not algorithms.

Performance

k=28, strength=2, RTX 4090 (sm_89), PCIe Gen4 x16. Steady-state per-plot wall from xchplot2 batch (10-plot manifest, mean):

Build	Per plot	Notes
pos2-chip CPU baseline	~50 s	reference
`cuda-only` branch	2.15 s	original CUDA-only path
`main`, `XCHPLOT2_BUILD_CUDA=ON` (CUB sort)	2.41 s	NVIDIA fast path on the SYCL/AdaptiveCpp port
`main`, `XCHPLOT2_BUILD_CUDA=OFF` (hand-rolled SYCL radix)	3.79 s	cross-vendor fallback (AMD/Intel) on AdaptiveCpp
streaming path, ≤8 GB cards	~3.7 s	pool path is preferred when VRAM allows

The main/CUB row is +12% over cuda-only from extra AdaptiveCpp scheduling overhead. The SYCL row is +57% over CUB on the same NVIDIA hardware; ~88% of GPU compute is identical between the two paths (nsys per-kernel breakdown), and the gap is dominated by host-side runtime overhead in AdaptiveCpp's DAG manager rather than kernel performance. AMD and Intel runtimes are untested; expect roughly the SYCL-row latency adjusted for relative GPU throughput.

License

MIT — see LICENSE and NOTICE for third-party attributions. Built collaboratively with Claude.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xchplot2

Hardware compatibility

Build

1. Container (`podman compose` or `docker compose`)

2. Native install via `scripts/install-deps.sh`

3. Manual / FetchContent fallback

`cargo install`

CMake (also builds the parity tests)

Use

Standalone (farmable plots)

Grouping plots: `-i <plot-index>` and `-g <meta-group>`

Lower-level subcommands

Testing farming on a testnet

Architecture

VRAM

Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
contrib		contrib
keygen-rs		keygen-rs
scripts		scripts
src		src
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Containerfile		Containerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.rs		build.rs
compose.yaml		compose.yaml

Folders and files

Latest commit

History

Repository files navigation

xchplot2

Hardware compatibility

Build

1. Container (podman compose or docker compose)

2. Native install via scripts/install-deps.sh

3. Manual / FetchContent fallback

cargo install

CMake (also builds the parity tests)

Use

Standalone (farmable plots)

Grouping plots: -i <plot-index> and -g <meta-group>

Lower-level subcommands

Testing farming on a testnet

Architecture

VRAM

Performance

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

1. Container (`podman compose` or `docker compose`)

2. Native install via `scripts/install-deps.sh`

`cargo install`

Grouping plots: `-i <plot-index>` and `-g <meta-group>`

Packages