Skip to content

debug: config-gated assertion for columnar-to-row transition invariant [do not merge]#3966

Draft
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:debug-columnar-to-row-assertion
Draft

debug: config-gated assertion for columnar-to-row transition invariant [do not merge]#3966
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:debug-columnar-to-row-assertion

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 16, 2026

Which issue does this PR close?

Diagnostic probe for a customer-reported issue following #3879. No issue filed yet.

Rationale for this change

A user reports a plan in which Spark's `ColumnarToRowExec` ends up wrapping a child whose `supportsColumnar == false`. The initial plan (before AQE) is clean; the bad shape emerges after AQE re-planning. Not reproduced locally across several configuration sweeps.

This adds an off-by-default diagnostic that surfaces the bad plan as a loud driver-side failure with the full offending subtree, instead of a runtime symptom that is hard to trace back. Leaving it in as a gated config means it stays available for future transition regressions at zero production cost.

What changes are included in this PR?

  • New config `spark.comet.assertValidPlanTransitions.enabled` (default: `false`) under the `exec_explain` category.

  • When enabled, `EliminateRedundantTransitions.apply` walks the post-rule plan and throws `IllegalStateException` if any of the following wraps a non-columnar child:

    • Spark's `ColumnarToRowExec`
    • `CometColumnarToRowExec`
    • `CometNativeColumnarToRowExec`

    The exception message includes the offending child's class name and the full `treeString`.

How are these changes tested?

Built against `main` and exercised a local DPP + AQE sweep (V1/V2 sources, skew/coalesce, chained joins) with the config enabled. The assertion does not fire on clean plans. The intent is for the customer to set the flag and re-run their failing query so we can capture the offending plan shape.

Diagnostic-only assertion in EliminateRedundantTransitions to surface an
intermittent bad-plan shape reported after apache#3879 (DPP shuffle fallback).
Every ColumnarToRowExec / CometColumnarToRowExec / CometNativeColumnarToRowExec
in the post-rule plan must have child.supportsColumnar == true.

Not intended for merge. Remove once the root cause is identified.
…nabled

Converts the columnar-to-row invariant check in EliminateRedundantTransitions
from an always-on assertion into an off-by-default diagnostic config, so the
check stays available for future transition regressions without runtime cost
in production.
@andygrove andygrove changed the title debug: assert columnar-to-row transitions have columnar children debug: config-gated assertion for columnar-to-row transition invariant Apr 16, 2026
@andygrove andygrove changed the title debug: config-gated assertion for columnar-to-row transition invariant debug: config-gated assertion for columnar-to-row transition invariant [do not merge] Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant