feat(avm): dynamic work distribution in AVM sumcheck prover round by jeanmon · Pull Request #22643 · AztecProtocol/aztec-packages

jeanmon · 2026-04-17T16:54:50Z

Switch compute_univariate_avm from static per-thread assignment to dynamic chunk distribution via the atomic thread pool. Each work item processes all relations for rows_per_chunk=16 consecutive rows, and threads atomically pick up the next chunk as they finish.

Here are results of benchmarks for 32, 15, 7 CPU cores over 20 runs.

Best-of-5 mean (sumcheck ms) across thread counts:

                                                    
  ┌─────────┬──────────┬─────────┬─────────────────────┐                                                                                                                                    
  │ Threads │ Baseline │ Dynamic │ Dynamic vs Baseline │
  ├─────────┼──────────┼─────────┼─────────────────────┤                                                                                                                                    
  │ 32      │ 2440.4   │ 2009.6  │ −17.6%              │                                                                                                                                    
  ├─────────┼──────────┼─────────┼─────────────────────┤                                                                                                                                    
  │ 15      │ 3432.2   │ 2993.6  │ −12.8%              │                                                                                                                                    
  ├─────────┼──────────┼─────────┼─────────────────────┤                                                                                                                                    
  │ 7       │ 6448.4   │ 5548.8  │ −14.0%              │                                                                                                                                    
  └─────────┴──────────┴─────────┴─────────────────────┘

Median:


  ┌─────────┬──────────────┬─────────────┬────────────┐                                                                                                                                     
  │ Threads │ Baseline med │ Dynamic med │ Δ (median) │
  ├─────────┼──────────────┼─────────────┼────────────┤                                                                                                                                     
  │ 32      │ 2558.0       │ 2209.0      │ −13.6%     │                 
  ├─────────┼──────────────┼─────────────┼────────────┤                                                                                                                                     
  │ 15      │ 3733.5       │ 3211.0      │ −14.0%     │                 
  ├─────────┼──────────────┼─────────────┼────────────┤                                                                                                                                     
  │ 7       │ 6887.5       │ 5995.0      │ −13.0%     │
  └─────────┴──────────────┴─────────────┴────────────┘

The value rows_per_chunk = 16 was determined experimentally. A value 8 and 32 lead to slightly worse performance.

fcarreiro

Looks good! I'd suggest you run and benchmark with other cpu counts as well. E.g., smaller and maybe non power of 2.

fcarreiro · 2026-04-20T08:30:32Z


        // Accumulate the contribution from each sub-relation across each edge of the hyper-cube
-        parallel_for(num_threads, [&](size_t thread_idx) {
+        parallel_for(total_chunks, [&](size_t chunk_id) {


I always forget that parallel_for will always have a pool of get_num_cpus() and not of the number of iterations passed.

Yes this is something I realized while working on this! The first argument in parallel_for() is the number of tasks you will distribute to the thread pool. The variable "num_threads" naming was a bit misleading.

fcarreiro · 2026-04-20T08:37:17Z

        // Accumulate the contribution from each sub-relation across each edge of the hyper-cube
-        parallel_for(num_threads, [&](size_t thread_idx) {
+        parallel_for(total_chunks, [&](size_t chunk_id) {
+            thread_local size_t thread_idx = thread_counter.fetch_add(1);


This seems ok. I wonder if it could be a problem if multiple sumcheck rounds are run. Maybe ask @ludamad. Worst case, parallel_for could be modified to provide the thread_idx on top of the chunk_id, if the lambda accepts it. It would be provided here: https://github.com/AztecProtocol/aztec-packages/blob/next/barretenberg/cpp/src/barretenberg/common/parallel_for_mutex_pool.cpp#L87 and needs to be plumbed from here https://github.com/AztecProtocol/aztec-packages/blob/next/barretenberg/cpp/src/barretenberg/common/parallel_for_mutex_pool.cpp#L134 via the unused size_t.

In https://github.com/AztecProtocol/aztec-packages/blob/next/barretenberg/cpp/src/barretenberg/common/parallel_for_mutex_pool.cpp#L157
there is a safeguard to prevent a "parallel_for" to be nested.
@fcarreiro I guess this should be fine.

It's not about nesting. It's about running one sumcheck then another. static thread_only will keep the same index for a thread it used in the previous sumcheck round. I think it will be ok, but in general it's conceptually "dirty". This class has state, and that state in principle is only kept on the same class instante. However, the thread index will carry across class instances.

For example, if the number of cpus changed in between (which it shouldnt), things would break.

@fcarreiro ah I got it now. Thanks for the clarifications. It is a very good catch.

I think the static keyword for the atomic thread_counter should not be required. The main thread needs to wait that each thread is done with their task in parallel_for() otherwise we have another bigger problem. In other words, thread_counter as a standard local variable will be accessible to every thread of the current run before the main thread continues to the next computation ( for (auto& accumulators : thread_univariate_accumulators) {).
WDYT?: @fcarreiro

Sorry, it seems there is an issue if we simply remove the static keyword. Namely, if a thread in the second call runs for the first time it will be allocated with thread_idx = 0 which would collide with a thread of the previous run and we get some race condition.

That's right. In some sense the "static std::atomic<size_t> thread_counter{ 0 };" could probably be moved to the line before the "thread_local size_t thread_idx = thread_counter.fetch_add(1);". in any case it will be static.

That's the downside of this solution, those 2 statics (thread_local implies static) are difficult to reason about. I think the cleanest solution would be to be able to get the thread index as I mentioned before.

@fcarreiro @ledwards2225 I have a change in progress which would solve this without touching the thread.hpp interfaces.
We allocate the number of tasks equal to the number of threads and while fetching a new chunk the thread is using an atomic counter to get the next available chunk. According to Claude it is even faster than the mutex in the pool.

@fcarreiro @ledwards2225 I pushed a second commit implementing the new variant I was mentioning above. Benchmarks showed no regression (even 2% speedup on average but it is too small to say for sure it is even faster.).

Switch compute_univariate_avm from static per-thread assignment to dynamic chunk distribution via the atomic thread pool. Each work item processes all relations for rows_per_chunk=16 consecutive rows, and threads atomically pick up the next chunk as they finish. On a large AVM proof (173706 edges, 32 threads), this improves the best-of-5 sumcheck time from 2440ms to 2010ms (-17.6%).

Replace the static atomic counter + thread_local slot assignment with a work-stealing loop: parallel_for now dispatches get_num_cpus() outer tasks, each owning a fixed accumulator slot and pulling chunks from a local atomic counter. This removes the program-scoped state that carried across SumcheckProverRound instances. No functional change. Small median improvement (~1-2%) observed on /tmp/bb-RxZhBt (10 runs, 32 threads).

ledwards2225 · 2026-04-20T20:27:23Z

Nice!

Question: Would bumping chunk size from 2 to 16 with the old static approach have given about the same improvement? I was surprised to see that the previous (contiguous) chunk was only size 2, which seems rough for cache. I assume 2 was chosen to balance thread work across the inhomogeneous trace but I would have guessed a much larger chunk (e.g. 16-64) would still be sufficient for that purpose. Number of expected cache misses is a function of this contiguous chunk size and is independent of thread count - could that explain why the benefit is roughly flat across different thread counts?

In any case I like this change - simple and robust.

fcarreiro

Interesting that this is faster than the baseline approach since it's so similar :) notable that letting threads "pick up" chunks makes so much of a difference in the long term. cool!

PS: I hate while(true)s but that's life!

fcarreiro · 2026-04-20T20:57:22Z

Nice!

Question: Would bumping chunk size from 2 to 16 with the old static approach have given about the same improvement? I was surprised to see that the previous (contiguous) chunk was only size 2, which seems rough for cache. I assume 2 was chosen to balance thread work across the inhomogeneous trace but I would have guessed a much larger chunk (e.g. 16-64) would still be sufficient for that purpose. Number of expected cache misses is a function of this contiguous chunk size and is independent of thread count - could that explain why the benefit is roughly flat across different thread counts?

In any case I like this change - simple and robust.

I would hope that that wouldn't be enough. Cache will not help here I think, especially for the AVM. As Jean mentioned, cache goes column-wise, and sumcheck reads row-wise. I would never expect reading one row would keep sth in cache of the next row.

I think the gain comes from not prescribing chunks for each thread and allocating them dynamically.

ledwards2225 · 2026-04-20T22:08:57Z

I think the gain comes from not prescribing chunks for each thread and allocating them dynamically.

But that would suggest that a chunk size of 2 is not sufficient to basically achieve uniform distribution of work - how could that be? Wouldn't that suggest that there are at least two single rows that are ~30% more expensive to execute than any other two rows?

jeanmon · 2026-04-21T07:28:10Z

I think the gain comes from not prescribing chunks for each thread and allocating them dynamically.

But that would suggest that a chunk size of 2 is not sufficient to basically achieve uniform distribution of work - how could that be? Wouldn't that suggest that there are at least two single rows that are ~30% more expensive to execute than any other two rows?

For the former approach, I remember that we benchmarked different chunk sizes and increasing it did not perform better. I can make a try with a higher value.
For the current approach, the chunk size of 16 was experimentally the sweet spot but that was before I introduced the atomic counter. I probably should measure again.

@ledwards2225 I cannot definitely say the real reason of the improvement but the static work allocation to the threads are very likely never the best for most traces as it is very unflexible. The AVM trace is so inhomogeneous that I can imagine that the difference between the easiest task and the largest task (over 32 cores) can easily have a variation of 15%.

jeanmon · 2026-04-21T08:04:59Z

@ledwards2225 You are right that the static version with 16 rows provide better performance. However, the dynamic variant (this PR) is the clear winner.

  ┌────────────┬─────┬──────┬────────┬────────┬──────┐                                                                                                                                      
  │   Config   │  N  │ Min  │ Median │  Mean  │ Max  │                                                                                                                                      
  ├────────────┼─────┼──────┼────────┼────────┼──────┤                                                                                                                                      
  │ base-rpt16 │ 10  │ 2206 │ 2336.5 │ 2379.0 │ 2654 │
  ├────────────┼─────┼──────┼────────┼────────┼──────┤                                                                                                                                      
  │ base-rpt8  │ 10  │ 2105 │ 2388.0 │ 2402.8 │ 2574 │                                                                                                                                      
  ├────────────┼─────┼──────┼────────┼────────┼──────┤
  │ base-rpt2  │ 10  │ 2000 │ 2566.5 │ 2504.4 │ 2675 │                                                                                                                                      
  ├────────────┼─────┼──────┼────────┼────────┼──────┤                                                                                                                                      
  │ dyn-rpc16  │ 10  │ 1985 │ 2028.5 │ 2040.6 │ 2111 │                                                                                                                                      
  └────────────┴─────┴──────┴────────┴────────┴──────┘                                                                                                                                      
     ```
                                                                                                                                                                                            
  Dynamic (rpc=16) vs static baseline:                                                                                                                                                      
  - vs base-rpt16: −14.2% mean, −13.2% median                                                                                                                                               
  - vs base-rpt8:  −15.1% mean, −15.1% median                                                                                                                                               
  - vs base-rpt2:  −18.5% mean, −21.0% median
  
  P.S. I also ran benchmarks with 32 and 64 and it gets worse. Especially, you get sometimes outliers (MAX) which goes up to 5-6 seconds.

ledwards2225 · 2026-04-21T14:00:15Z

Interesting! Thanks for running these

Yeah TBC I agree dynamic is best regardless, I'm just always looking to better understand the tradeoffs in sumcheck because intuition has not always been a good guide. I set up the same approach to be used with Honk and found 64 to be the ideal chunk size. We see about a 15% improvement. Was hoping for even more since we were previously doing something more naive than AVM (just splitting trace into num_threads many contiguous chunks) but we also have substantially less regional variation than you guys

jeanmon force-pushed the jean/avm-sumcheck-dynamic-thread-chunks branch from 49f5125 to 211e033 Compare April 17, 2026 16:59

jeanmon marked this pull request as ready for review April 17, 2026 17:01

jeanmon changed the title ~~perf: dynamic work distribution in AVM sumcheck prover round~~ feat(avm): dynamic work distribution in AVM sumcheck prover round Apr 17, 2026

fcarreiro approved these changes Apr 20, 2026

View reviewed changes

jeanmon requested review from federicobarbacovi and ledwards2225 April 20, 2026 12:23

jeanmon added 2 commits April 20, 2026 17:27

jeanmon force-pushed the jean/avm-sumcheck-dynamic-thread-chunks branch from 211e033 to 28b569a Compare April 20, 2026 17:31

ledwards2225 approved these changes Apr 20, 2026

View reviewed changes

fcarreiro approved these changes Apr 20, 2026

View reviewed changes

jeanmon merged commit a31bef5 into merge-train/avm Apr 21, 2026
14 checks passed

jeanmon deleted the jean/avm-sumcheck-dynamic-thread-chunks branch April 21, 2026 08:12

Conversation

jeanmon commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fcarreiro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ledwards2225 commented Apr 20, 2026

Uh oh!

fcarreiro left a comment

Choose a reason for hiding this comment

Uh oh!

fcarreiro commented Apr 20, 2026

Uh oh!

ledwards2225 commented Apr 20, 2026

Uh oh!

jeanmon commented Apr 21, 2026

Uh oh!

jeanmon commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ledwards2225 commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeanmon commented Apr 17, 2026 •

edited

Loading

jeanmon commented Apr 21, 2026 •

edited

Loading