Add ChaisemartinDHaultfoeuille (dCDH) DID_M estimator (Phase 1) by igerber · Pull Request #290 · igerber/diff-diff

igerber · 2026-04-11T20:09:50Z

Summary

Adds ChaisemartinDHaultfoeuille (alias DCDH) — Phase 1 of the de Chaisemartin-D'Haultfœuille estimator family. The only modern staggered DiD estimator in the library that handles non-absorbing (reversible) treatments — treatment can switch on AND off over time (marketing campaigns, seasonal promotions, on/off policy cycles, binary fuzzy designs).
Implements DID_M from de Chaisemartin & D'Haultfœuille (2020) AER, equivalently DID_1 (horizon l = 1) of the dynamic companion paper (NBER WP 29873).
Forward-compatible API: Phase 2 (multi-horizon event study via aggregate/L_max) and Phase 3 (covariate adjustment via controls, group-specific linear trends, HonestDiD) parameters present from day one with NotImplementedError gates pointing at the relevant phase.
Ships joiners-only (DID_+) and leavers-only (DID_-) decompositions, single-lag placebo (DID_M^pl) point estimate, optional multiplier bootstrap (Rademacher / Mammen / Webb), TWFE decomposition diagnostic from Theorem 1 of AER 2020 (per-cell weights, fraction negative, sigma_fe, beta_fe), plus a standalone twowayfeweights() helper.
Adds generate_reversible_did_data() generator with 7 patterns (single_switch default, joiners_only, leavers_only, mixed_single_switch, random, cycles, marketing).

Methodology references

Method: dCDH DID_M estimator (= DID_1 at horizon l = 1 of the dynamic paper)
Sources:
- de Chaisemartin, C. & D'Haultfœuille, X. (2020). Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects. American Economic Review, 110(9), 2964–2996. https://doi.org/10.1257/aer.20181169
- de Chaisemartin, C. & D'Haultfœuille, X. (2022, revised 2024). Difference-in-Differences Estimators of Intertemporal Treatment Effects. NBER Working Paper 29873. https://www.nber.org/papers/w29873
- Three paper reviews under docs/methodology/papers/chaisemartin-dhaultfoeuille-{2020-review,2020-appendix-review,dynamic-review}.md (committed in earlier work; the dynamic-paper review is the source for the cohort-recentered analytical variance from Web Appendix Section 3.7.3)
Reference implementation parity: R DIDmultiplegtDYN v2.3.3 (CRAN, maintained by the paper authors)

Intentional deviations from R `DIDmultiplegtDYN` (documented as `Note (deviation from R):` in REGISTRY.md)

Stable-control set definition. Python uses period-based stable definitions (stable_0(t) = any cell with D_{g,t-1} = D_{g,t} = 0, regardless of baseline D_{g,1}), matching the AER 2020 Theorem 3 cell-count notation N_{0,0,t} and N_{1,1,t} literally. R uses cohort-based definitions that additionally require D_{g,1} to match the side, matching the dynamic companion paper's cohort (D_{g,1}, F_g, S_g) framework. The two definitions agree exactly under any of: (a) the panel is pure-direction (joiners-only or leavers-only), (b) no joiner's post-switch state ever overlaps a period when leavers are switching (and vice versa), or (c) the worked example. They diverge by O(1%) on mixed-direction panels with both joiners and leavers when some joiners' post-switch cells could serve as leavers' controls (or vice versa). Parity tests use 1e-4 tolerance for pure-direction scenarios and 2.5% tolerance for mixed-direction scenarios, with the deviation explicitly documented in docs/methodology/REGISTRY.md.

Phase 1 placebo SE. Intentionally NaN with a UserWarning. The dynamic companion paper Section 3.7.3 derives the cohort-recentered analytical variance for DID_l only — not for the placebo DID_M^pl. Phase 2 will add multiplier-bootstrap support for the placebo via the dynamic paper's machinery; until then, placebo_se, placebo_t_stat, placebo_p_value, and placebo_conf_int remain NaN even when n_bootstrap > 0. The placebo point estimate is still computed and exposed via results.placebo_effect. A regression test (test_placebo_bootstrap_unavailable_in_phase_1) pins this contract.

Assumption 11 zero-retention convention. When some period t has joiners but no stable_0 controls (or leavers but no stable_1 controls), Python sets that period's DID_{+,t} (or DID_{-,t}) to zero but retains the switcher count in the N_S denominator, matching the AER 2020 Theorem 3 worked-example arithmetic. R drops the cohort entirely. Per-period flags (did_plus_t_a11_zeroed, did_minus_t_a11_zeroed) and a consolidated UserWarning make this visible to users. The R parity scenarios add never-treated and always-treated control cohorts so neither implementation triggers the convention, eliminating this as a source of disagreement in the parity tests.

Documented matches with R (worth noting because they're substantive choices)

drop_larger_lower=True default — drops multi-switch groups before estimation (matches R; required for the analytical variance to be consistent with the point estimate)
Singleton-baseline filter (footnote 15 of dynamic paper) — drops groups whose D_{g,1} is unique
Cohort-recentered analytical variance from Web Appendix Section 3.7.3 of the dynamic paper (subtracts cohort-conditional means, not a grand mean — the load-bearing variance correctness check that has its own dedicated test)
Conservative CI under Assumption 8 (independent groups), exact only under iid sampling — documented as a **Note:** in REGISTRY.md

No silent failures

Every drop / round / fallback emits an explicit warnings.warn() or ValueError. Five distinct warnings/errors fire on the relevant code paths: NaN treatment validation, NaN outcome validation, placebo not-computable, TWFE diagnostic denom == 0, never-switching groups filter from variance, plus the multi-switch and singleton-baseline filters and the consolidated A11 warning.

Validation

Test files added

tests/test_chaisemartin_dhaultfoeuille.py — 44 tests covering basic API, forward-compat gates (one per Phase 2/3 parameter), drop_larger_lower filter, A11 zero-retention, NaN handling, bootstrap (Rademacher / Mammen / Webb), results dataclass, and the new test_placebo_bootstrap_unavailable_in_phase_1 regression test
tests/test_methodology_chaisemartin_dhaultfoeuille.py — 8 fast + 2 slow Tier 2 methodology tests including the hand-calculable 4-group worked example (asserts DID_M = 2.5, DID_+ = 2.0, DID_- = 3.0 exactly), test_cohort_recentering_not_grand_mean (the load-bearing variance correctness check that catches the Add initial diff-diff library implementation #1 implementation bug), TWFE diagnostic correctness, and large-N recovery (slow)
tests/test_chaisemartin_dhaultfoeuille_parity.py — 5 R DIDmultiplegtDYN parity scenarios at horizon l = 1: single_switch_mixed, joiners_only, leavers_only, mixed_single_switch, and the hand-calculable worked example
tests/test_prep_dgp_reversible.py — 38 tests for the new generate_reversible_did_data() generator (shape, patterns, reproducibility, true-effect column, switcher-type column, validation errors)

Test files modified

tests/conftest.py — added _check_r_dcdh_available(), r_dcdh_available session fixture, and require_r_dcdh per-test fixture parallel to the existing r_available / require_r pattern, with DIFF_DIFF_R=skip env-var override

R parity reference

benchmarks/R/generate_dcdh_dynr_test_values.R — mirror generator + 5 scenarios, generates golden values via did_multiplegt_dyn(..., effects=1) from R DIDmultiplegtDYN v2.3.3
benchmarks/data/dcdh_dynr_golden_values.json — committed golden values (~73 KB)
The R script also exercises a polars r-universe backend that DIDmultiplegtDYN >= 2.x requires
The Python parity test fixture skips cleanly when R or DIDmultiplegtDYN is unavailable, so the test suite still runs on dev machines without R installed

Documentation surface

Per docs/doc-deps.yaml for the new chaisemartin_dhaultfoeuille group:

docs/methodology/REGISTRY.md — full ## ChaisemartinDHaultfoeuille section with assumption checks, estimator equations, the four **Note:** blocks (conservative CI, drop_larger_lower, A11 zero-retention, singleton-baseline filter), the **Note (deviation from R DIDmultiplegtDYN):** block, edge cases, requirements checklist
docs/api/chaisemartin_dhaultfoeuille.rst — autoclass directives + autofunction for chaisemartin_dhaultfoeuille() and twowayfeweights(), with usage examples
docs/api/index.rst — toctree entry; docs/api/prep.rst — generate_reversible_did_data autofunction
docs/choosing_estimator.rst — Quick Reference row, decision-flowchart question, Survey Compatibility Matrix row (-- in all four columns), Detailed Guidance subsection
docs/practitioner_decision_tree.rst — new "Reversible Treatment (On/Off Cycles)" section with business-language framing
docs/llms.txt and docs/llms-full.txt — bullet + ~100-line section
README.md — features bullet, alias table entry, full usage section
CHANGELOG.md — ### Added entry under ## [Unreleased]
ROADMAP.md — Phase 1 items (1a–1i) flipped from "Not started" to "Shipped"; Phase 2 / Phase 3 retained

Test count

96 dCDH tests passing locally (44 API + 8 methodology fast + 2 methodology slow [deselected by default] + 5 R parity + 38 generator – overlap = 95 deselectable + 1 placebo regression = 96 total). The hand-calculable 4-group worked example produces DID_M = 2.5 exactly matching R DIDmultiplegtDYN.

No tutorial in this PR

Per the multi-phase plan, a single comprehensive dCDH tutorial ships in Phase 3 after covariate support is added. Phase 1 deliberately ships no tutorial.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Phase 1 of the de Chaisemartin-D'Haultfoeuille estimator family — the only modern staggered DiD estimator in the library that handles non-absorbing (reversible) treatments. Treatment can switch on AND off over time (marketing campaigns, seasonal promotions, on/off policy cycles). Implements DID_M from de Chaisemartin & D'Haultfoeuille (2020) AER, equivalently DID_1 (horizon l=1) of the dynamic companion paper (NBER WP 29873). Ships: - Headline DID_M point estimate with cohort-recentered analytical SE from Web Appendix Section 3.7.3 of the dynamic companion paper - Joiners-only (DID_+) and leavers-only (DID_-) decompositions - Single-lag placebo DID_M^pl (Theorem 4 of AER 2020); analytical placebo SE deferred to Phase 2 (NaN with explicit warning) - Optional multiplier bootstrap clustered at group level (Rademacher / Mammen / Webb) - TWFE decomposition diagnostic from Theorem 1 of AER 2020 (per-cell weights, fraction negative, sigma_fe, beta_fe), exposed both as estimator field and standalone twowayfeweights() helper - Multi-switch group filtering (drop_larger_lower=True default, matches R DIDmultiplegtDYN); singleton-baseline filter (footnote 15 of dynamic paper); never-switching groups filter from variance — all with explicit warnings (no silent failures) - Forward-compatible fit() signature: Phase 2 (multi-horizon event study, aggregate, L_max) and Phase 3 (covariate adjustment via controls, group-specific linear trends, HonestDiD) parameters raise NotImplementedError with phase pointers - generate_reversible_did_data() generator with 7 patterns (single_switch, joiners_only, leavers_only, mixed_single_switch, random, cycles, marketing) - Validated against R DIDmultiplegtDYN v2.3.3 at horizon l=1 via tests/test_chaisemartin_dhaultfoeuille_parity.py (5 scenarios) Methodology notes documented in REGISTRY.md: - Conservative CI under Assumption 8 (independent groups), exact only under iid sampling - A11 zero-retention convention (paper-faithful, with consolidated warning) - drop_larger_lower default and inconsistency warning when False - Singleton-baseline filter (footnote 15) - Phase 1 placebo SE intentionally NaN (deferred to Phase 2) - Note (deviation from R DIDmultiplegtDYN): Python uses period-based stable-control sets matching AER 2020 Theorem 3 literally; R uses cohort-based controls from the dynamic paper. Agree exactly on pure-direction panels and the worked example; differ by ~1% on mixed-direction panels with both joiners and leavers. Tests: 95 passing locally (44 API, 8 fast methodology + 2 slow Tier 2, 38 generator, 5 R parity). Hand-calculable 4-group worked example produces DID_M = 2.5 exactly matching R. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s API Methodology / inference: - Bootstrap-updated overall/joiners/leavers inference now goes through safe_inference() instead of inline t_stat = effect / se. Matches the project anti-pattern rule and keeps NaN behavior consistent with the rest of the library. Placebo doc/code alignment: - Phase 1 placebo bootstrap is deferred to Phase 2 (the dynamic companion paper Section 3.7.3 derives the cohort-recentered analytical variance for DID_l only; the placebo's influence-function representation is not yet implemented). The bootstrap path covers DID_M, DID_+, and DID_- only. - Updated the source warning text, the placebo_inputs comment, README, CHANGELOG, REGISTRY, choosing_estimator.rst, llms-full.txt to drop the misleading "set n_bootstrap > 0 for a placebo SE today" promise and state clearly that placebo inference fields stay NaN even with bootstrap enabled. - Added test_placebo_bootstrap_unavailable_in_phase_1 to pin this contract so future contributors do not silently widen the bootstrap surface without wiring up the documented Phase 2 derivation. twowayfeweights() public API: - Added ValueError validation for missing outcome/group/time/treatment columns (was treatment-only). - Updated README, docs/api/chaisemartin_dhaultfoeuille.rst, and docs/llms-full.txt examples to pass outcome="outcome" so the documented usage actually runs. set_params(): - Re-runs the same __init__ validation rules so invalid parameter combinations cannot slip in after construction. Tests: - 96 passing locally (was 95). The new placebo test runs alongside TestBootstrap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Final doc inconsistency caught on re-review: practitioner_decision_tree.rst still said "set n_bootstrap > 0 for a bootstrap-based placebo SE today" while README, REGISTRY, choosing_estimator, and llms-full had been updated to reflect that placebo bootstrap is deferred to Phase 2 and placebo inference fields stay NaN even when n_bootstrap > 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T20:10:14Z

/ai-review

github-actions · 2026-04-11T20:18:58Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fc881fb78d806e089150f0eba1d584886f845a5e

Overall Assessment

⚠️ Needs changes

This PR adds a new estimator and touches methodology-critical code paths, but I found three unmitigated P1 issues in the estimation/inference logic. Static review only: I was not able to run the new test suite in this shell because the local Python runtime here is missing project deps such as pandas.

Executive Summary

The Phase 1 DID_M / DID_M^pl implementation is using per-cell observation counts n_gt as Theorem 3/4 weights, while the new Methodology Registry defines N_{a,b,t} as counts of (g,t) cells. That silently changes the estimand on individual-level inputs with uneven cell sizes.
The analytical SE path does not implement the registry’s full U^G_g = (Lambda^G_{g,l=1} * Y_g).sum() construction; it explicitly substitutes a first-switch-only “Phase 1 simplification,” and the bootstrap reuses that same vector.
The singleton-baseline filter is applied to the estimation sample, although the registry documents it as a variance-stage filter. Under the documented period-based stable-control deviation, that can change DID_M, not just the variance.
The new methodology-reference files linked from the roadmap/doc-deps are not present in the repo, and the added tests do not actually validate SE parity or the claimed “cohort recentering vs grand mean” distinction.
I did not find additional security issues in the changed files.

Methodology

Severity: P1. Impact: DID_M, DID_+, DID_-, and DID_M^pl are being weighted by per-cell sample size (n_gt) rather than by the transition-cell counts documented in the registry, so duplicating rows inside one (group,time) cell changes the estimator with no warning. The issue starts at cell aggregation and then propagates through the per-period numerators/denominators and the final DID_M aggregation. Concrete fix: collapse to one row per (g,t) cell for the Theorem 3/4 path and use simple cell counts for N_{1,0,t}, N_{0,0,t}, N_{0,1,t}, N_{1,1,t}; if a cell-size-weighted estimand is intended, document it explicitly in REGISTRY.md as a deviation and add a regression test showing whether duplicating within-cell rows should or should not move the estimate. Refs: docs/methodology/REGISTRY.md:L483-L514, diff_diff/chaisemartin_dhaultfoeuille.py:L437-L441, diff_diff/chaisemartin_dhaultfoeuille.py:L1041-L1084, diff_diff/chaisemartin_dhaultfoeuille.py:L577-L603, diff_diff/chaisemartin_dhaultfoeuille.py:L1177-L1218
Severity: P1. Impact: the analytical SE / CI / p-value path is not implementing the registry’s Section 3.7.3 formula. The registry says U^G_g must come from the full Lambda^G_{g,l=1} vector, but the code explicitly replaces that with a “Phase 1 simplification” that only uses the group’s first-switch-period contribution. That omits the periods where a switching group contributes as a stable control, so both analytical inference and the multiplier bootstrap can be wrong even when the point estimate is right. Concrete fix: implement the full Lambda^G_{g,l=1}-based influence function from the dynamic paper, then cohort-recenter and bootstrap that vector; add parity tests for SEs on pure-direction scenarios where the registry says Python and R should agree. Refs: docs/methodology/REGISTRY.md:L516-L529, diff_diff/chaisemartin_dhaultfoeuille.py:L1223-L1261, diff_diff/chaisemartin_dhaultfoeuille.py:L1348-L1422, diff_diff/chaisemartin_dhaultfoeuille.py:L657-L679, diff_diff/chaisemartin_dhaultfoeuille.py:L747-L755
Severity: P1. Impact: the singleton-baseline filter is applied before the point-estimate path, even though the registry documents it as a variance-stage filter. That matters here because the same registry section also documents a period-based stable_0(t) / stable_1(t) deviation, under which a baseline-unique always-treated or never-treated group can still be part of the Theorem 3 control set. In those cases the current filter changes DID_M, DID_+, or DID_-, not just the variance sample. Concrete fix: keep singleton-baseline groups in the estimation sample and exclude them only inside the variance/cohort code path, or document this as an additional point-estimate deviation and add a targeted parity test where a unique-baseline stable control would otherwise move the estimate. Refs: docs/methodology/REGISTRY.md:L475-L476, docs/methodology/REGISTRY.md:L547-L549, diff_diff/chaisemartin_dhaultfoeuille.py:L511-L528, diff_diff/chaisemartin_dhaultfoeuille.py:L962-L996

Code Quality

No additional material findings beyond the methodology issues above.

Performance

No material findings in the changed performance-sensitive paths beyond the correctness issues above.

Maintainability

No additional material findings beyond the documentation/test gaps below.

Tech Debt

No mitigating TODO.md entry covers the P1 issues above, so they are not deferrable under the project’s review policy.

Security

No material security findings in the changed files.

Documentation/Tests

Severity: P2. Impact: the PR adds methodology-reference links that do not resolve in this repo, and the new tests do not actually validate the claimed inference correctness. ROADMAP.md and docs/doc-deps.yaml now point reviewers to three docs/methodology/papers/chaisemartin-dhaultfoeuille-*.md files, but those files are not present in this checkout. Separately, the supposed “load-bearing” recentering test only asserts that overall_se > 0, and the R parity suite checks point estimates only, not SEs. That leaves the incorrect SE path above effectively unguarded. Concrete fix: commit the referenced methodology-review files, make doc-deps/path resolution fail CI when missing, turn test_cohort_recentering_not_grand_mean into a real λ-vs-grand-mean regression test, and add SE parity assertions for pure-direction scenarios. Refs: ROADMAP.md:L125-L129, docs/doc-deps.yaml:L261-L275, tests/test_methodology_chaisemartin_dhaultfoeuille.py:L162-L251, tests/test_chaisemartin_dhaultfoeuille_parity.py:L122-L168

Path to Approval

Rework the Theorem 3/4 aggregation to use transition-cell counts rather than n_gt observation counts, or explicitly document and test a weighted estimand as a deviation.
Replace the current first-switch-only SE shortcut with the full Lambda^G_{g,l=1} influence-function implementation from Section 3.7.3, and use that same vector for the bootstrap path.
Move the singleton-baseline filter out of the point-estimate sample unless you also document it as a point-estimate deviation and show parity/hand-worked evidence for the altered estimator.
Commit the missing methodology-review markdown files and strengthen the new tests so they actually fail on grand-mean recentering, wrong λ construction, and SE mismatches versus R on pure-direction designs.

Addresses three methodology issues in the dCDH Phase 1 implementation: 1. Cell-count weighting (P1): the per-period DID and placebo formulas were computing N_{a,b,t} as the sum of within-cell observation counts instead of the count of (g, t) cells. This silently changed the estimand on individual-level inputs with uneven cell sizes. Fixed by replacing n_curr[mask].sum() with int(mask.sum()) and using simple .mean() over cells instead of n_curr-weighted averages. The fix matches the AER 2020 Theorem 3 cell-count notation literally and matches R DIDmultiplegtDYN's default behavior on cell-aggregated input. Renamed the results field n_switcher_obs -> n_switcher_cells for accuracy. Test data uses balanced 1-obs-per-cell panels so the regression test_cell_count_weighting_unbalanced_input is needed to guard the unbalanced-input path. 2. Full Lambda^G_{g,l=1} influence function (P1): the analytical SE path was using a "first-switch only" shortcut where each switching group's U^G_g came from its own switch period only. This omitted the periods where switching groups serve as stable controls for other cohorts. Replaced with _compute_full_per_group_contributions, which sums role-weighted outcome differences (joiner +1, stable_0 -n_10/n_00, leaver -1, stable_1 +n_01/n_11) across all periods. The bootstrap path inherits the corrected vectors automatically. SE parity vs R DIDmultiplegtDYN on pure-direction scenarios narrowed from ~18% to ~3%; pure-direction parity tests now assert SE within 5% rtol. 3. Singleton-baseline filter at variance stage only (P1): footnote 15 of the dynamic paper says singleton-baseline groups have no cohort peer for the cohort-recentered variance. The previous code applied the filter as a point-estimate filter (dropping the groups before the per-period DIDs), which under Python's documented period-based stable-control interpretation could change DID_M when a unique- baseline always-treated or never-treated group would otherwise serve as a valid stable control. Fixed by computing the singleton-baseline group set in fit() and passing it to the variance helper, which excludes those groups from the cohort enumeration and U vector only. The cell DataFrame retains them. Updated TestSingletonBaselineFilter to assert the variance-only semantics and the warning text. Documentation: - doc-deps.yaml and ROADMAP.md no longer link to gitignored paper review files (REGISTRY.md is the canonical source). - REGISTRY.md updated with cell-count notation, the full IF formula, the variance-only singleton-filter scope, the new SE parity gap numbers (3% vs the prior 18%), and a clearer period-vs-cohort deviation note. Tests: - test_cohort_recentering_not_grand_mean now actually computes both formulas on the same data and asserts they differ materially. The prior version only checked overall_se > 0. - test_cell_count_weighting_unbalanced_input pins the cell-count contract on a deliberately unbalanced individual-level panel. - test_parity_joiners_only and test_parity_leavers_only now assert SE parity at 5% rtol (was previously skipped). - 97 dCDH tests passing (96 + the new cell-count regression). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T21:05:08Z

/ai-review

github-actions · 2026-04-11T21:18:26Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0d70f4d0deac473176a1c6fd467e06eb7aa810e4

Overall Assessment

⚠️ Needs changes

Static review only: this shell is missing pytest, pandas, and numpy, so I could not run the new dCDH test suite locally. Assessment is based on two unmitigated P1 findings in methodology-critical paths.

Executive Summary

The three prior P1 blockers from the last review appear resolved: Theorem 3 now uses cell counts, the full Lambda^G_{g,l=1} influence-function path is present, and the singleton-baseline filter is now variance-only.
P1 [Newly identified]: the analytical variance path still assumes a complete baseline observation and observed adjacent treatment pairs, but the estimator accepts ragged panels; missing periods can be misread as switches or crash cohort-key construction.
P1 [Newly identified]: standalone twowayfeweights() bypasses the estimator’s documented validation contract and can silently drop/round malformed treatment/outcome cells.
P2: joiner/leaver metadata is mislabeled and partially miscomputed, so summaries and to_dataframe(level="joiners_leavers") can report the wrong sample-size quantities.
P2: the parity suite still skips without live R even though committed JSON golden values are already available, which weakens regression coverage for a methodology-critical estimator.

Methodology

Affected methods: dCDH DID_M / DID_+ / DID_- variance path from Web Appendix Section 3.7.3, and the standalone TWFE diagnostic from AER 2020 Theorem 1. Re-review note: the prior findings on cell-count weighting, full IF construction, and variance-only singleton-baseline filtering look addressed in diff_diff/chaisemartin_dhaultfoeuille.py:L511-L554, diff_diff/chaisemartin_dhaultfoeuille.py:L1088-L1096, and diff_diff/chaisemartin_dhaultfoeuille.py:L1237-L1346.

Severity: P1 [Newly identified]. Impact: the estimator now tolerates missing (group, time) cells in the point-estimate path, but the cohort/variance path still treats raw NaN entries in D_mat as if they were observed treatment states. D_mat is built with NaN gaps for missing cells, the point-estimate path correctly gates transitions on present, but _compute_cohort_recentered_inputs() ignores that guard and detects first switches from raw D_mat[g, t] != D_mat[g, t-1]; a group missing the first global period can also hit int(np.nan) when cohort keys are built. Step 7 also uses each group’s first observed row, not guaranteed D_{g,1}, for singleton-baseline identification. This can silently misclassify cohorts and distort overall_se / joiners_se / leavers_se, or fail with an opaque error on late-entry groups. Concrete fix: after building N_mat, either explicitly reject groups missing the first global period or drop them with a user-facing warning, and compute first_switch_idx / switch_direction only on adjacent observed pairs (N_mat[:, t] > 0 and N_mat[:, t-1] > 0). Add regression tests for a missing-baseline group and an interior-gap group. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L524-L528, diff_diff/chaisemartin_dhaultfoeuille.py:L568-L583, diff_diff/chaisemartin_dhaultfoeuille.py:L1078-L1083, diff_diff/chaisemartin_dhaultfoeuille.py:L1443-L1467.
Severity: P1 [Newly identified]. Impact: the new public twowayfeweights() helper does not honor the dCDH validation rules documented in the registry and enforced in fit(). fit() explicitly rejects NaN treatment/outcome values, rejects non-binary treatment, and warns before majority-rounding within-cell mixed treatment. twowayfeweights() does none of that; it only coerces treatment numeric, then aggregates with mean()/count() and thresholds d_gt >= 0.5. That means NaN outcomes/treatments can be silently dropped by the aggregation, non-binary treatment can be silently thresholded into {0,1}, and within-cell varying treatment is rounded with no warning, yielding misleading Theorem 1 diagnostics. Concrete fix: factor the validation + cell-aggregation logic into a shared helper used by both fit() and twowayfeweights(), with identical NaN, binary-treatment, and within-cell-rounding behavior, and add tests for each case. Refs: docs/methodology/REGISTRY.md:L470-L473, docs/methodology/REGISTRY.md:L581-L583, diff_diff/chaisemartin_dhaultfoeuille.py:L392-L456, diff_diff/chaisemartin_dhaultfoeuille.py:L1765-L1809.

Code Quality

Severity: P2. Impact: joiner/leaver sample-size metadata is mislabeled and partially wrong. n_joiner_cells / n_leaver_cells are computed with count_nonzero(...), which counts periods containing any joiners/leavers rather than the number of switching (g,t) cells, while n_joiner_obs / n_leaver_obs are populated from switcher-cell totals but documented and displayed as “observations”. That makes the printed summary and to_dataframe(level="joiners_leavers") report inconsistent sample-size semantics. Concrete fix: set n_joiner_cells = int(n_10_t_arr.sum()) and n_leaver_cells = int(n_01_t_arr.sum()), then either rename n_*_obs to reflect switching-cell counts or compute true raw-observation totals from the underlying N_mat. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L647-L651, diff_diff/chaisemartin_dhaultfoeuille_results.py:L148-L162, diff_diff/chaisemartin_dhaultfoeuille_results.py:L488-L506, diff_diff/chaisemartin_dhaultfoeuille_results.py:L633-L666.

Performance

No material findings in the changed performance-sensitive paths.

Maintainability

Severity: P3. Impact: summary() still reports never-switching groups under “Groups dropped before estimation”, but the methodology registry now says n_groups_dropped_never_switching is backward-compat metadata and no longer represents an actual exclusion. That makes the results summary internally inconsistent with the documented estimator contract. Concrete fix: rename that block to something like “Group filter / metadata counts” and label the never-switching count as “reported, not dropped”. Refs: diff_diff/chaisemartin_dhaultfoeuille_results.py:L412-L429, docs/methodology/REGISTRY.md:L476-L476.

Tech Debt

No separate tech-debt finding, but TODO.md does not currently track either P1 above, so they are not mitigated under the project’s review rules. Ref: TODO.md:L51-L98.

Security

No material security findings in the changed files.

Documentation/Tests

Severity: P2. Impact: the parity tests already load committed JSON golden values, but every scenario still takes require_r_dcdh, so the whole regression suite skips when live R / DIDmultiplegtDYN is unavailable. That leaves the main methodology-regression checks inactive on ordinary Python-only environments even though the assertions themselves do not need R once the goldens are committed. Concrete fix: keep the R availability check only for the golden-value generation workflow, and let the JSON-backed parity assertions run whenever benchmarks/data/dcdh_dynr_golden_values.json exists. Refs: tests/test_chaisemartin_dhaultfoeuille_parity.py:L40-L58, tests/test_chaisemartin_dhaultfoeuille_parity.py:L122-L184, tests/conftest.py:L107-L178.
Severity: P3. Impact: the docs are internally inconsistent about the dCDH paper-review markdowns. ROADMAP.md now says the primary papers are not committed in-repo, while CHANGELOG.md says three dCDH paper reviews were committed under docs/methodology/papers/. Concrete fix: align the changelog with the roadmap, or add the promised review files. Refs: ROADMAP.md:L127-L129, CHANGELOG.md:L22-L22.

Path to Approval

Make the Section 3.7.3 variance/cohort path robust to ragged panels: either reject groups missing the baseline period with a clear ValueError, or base baseline/switch detection on observed adjacent pairs from N_mat instead of raw NaN comparisons. Add regression tests for a late-entry group and for an interior missing period.
Make twowayfeweights() reuse the same validation and cell-aggregation contract as fit(): reject NaN treatment/outcome, reject non-binary treatment before aggregation, and emit the same warning on within-cell mixed treatment. Add one test for each case.

Addresses two new P1s, two P2s, and two P3s from CI re-review: P1: ragged panel handling - Add fit() Step 5a validation: reject groups missing the first global period with a clear ValueError listing offenders, and drop groups with interior period gaps with an explicit UserWarning. Eliminates the silent NaN propagation in _compute_cohort_recentered_inputs that could crash the cohort enumeration with int(np.nan) or misclassify late-entry groups' baselines. - Fix singleton-baseline computation in fit() Step 7 to read the validated first global period explicitly instead of using groupby.first() which returned the first OBSERVED row per group. - Add defensive presence-gating in _compute_cohort_recentered_inputs: the helper now refuses to run if N_mat[:, 0] has any zero entries (a fit() validation regression tripwire), and the first-switch detection loop only counts transitions between adjacent OBSERVED periods. Even with fit() validation in place, the helper is now safe to call directly. - Add 2 regression tests: test_missing_baseline_period_raises_value_error and test_interior_gap_drops_group_with_warning. P1: twowayfeweights validation refactor - Extract _validate_and_aggregate_to_cells helper enforcing the dCDH validation contract (NaN treatment / NaN outcome / non-binary treatment all raise ValueError; within-cell varying treatment emits UserWarning before majority rounding). - Both fit() and twowayfeweights() now call the helper. Single source of truth for the validation rules, no drift between the two public entry points. - Add 4 regression tests for twowayfeweights() validation: test_twowayfeweights_rejects_nan_treatment, test_twowayfeweights_rejects_nan_outcome, test_twowayfeweights_rejects_non_binary_treatment, test_twowayfeweights_warns_on_within_cell_rounding. P2: joiner/leaver metadata - Fix n_joiner_cells = int(n_10_t_arr.sum()) (was count_nonzero counting PERIODS, not cells). Same for n_leaver_cells. - Compute n_joiner_obs and n_leaver_obs as actual observation counts (sum of n_gt over the joiner/leaver cells across periods), not as cell totals. For balanced one-obs-per-cell panels they equal n_*_cells; for individual-level inputs with multiple obs per cell they can be larger. Update results dataclass docstrings. P2: parity tests run on JSON without R - Decouple golden_values fixture from require_r_dcdh. Tests now run whenever the JSON file exists. R is only needed to regenerate the JSON via benchmarks/R/generate_dcdh_dynr_test_values.R. - Verified by running DIFF_DIFF_R=skip pytest tests/test_chaisemartin_dhaultfoeuille_parity.py — all 5 parity scenarios PASS (was previously skipping entirely). P3: summary() label rename - Rename "Groups dropped before estimation:" to "Group filter / metadata counts:". Label never-switching as "(reported, not dropped)". Reflects the Round 2 change where never-switching groups participate in the variance via stable-control roles. P3: CHANGELOG/ROADMAP consistency - Remove the CHANGELOG bullet that claimed three paper review files were committed under docs/methodology/papers/. Replace with a reference to REGISTRY.md as the canonical methodology surface (matching the ROADMAP wording). Tests: 103 dCDH passing (97 + 6 new). Worked example DID_M = 2.5 still exact. Pure-direction R parity tests still pass at 1e-4 / 5% rtol. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T21:41:45Z

/ai-review

github-actions · 2026-04-11T21:51:54Z

🔁 AI review rerun (requested by @igerber)

Head SHA: df7a0a389e78d3b82eeb3455948a9adc8d74ca12

Overall Assessment

✅ Looks good

Static review only: I could not run the new dCDH suite in this environment because pytest, numpy, and pandas are not installed.

Executive Summary

Prior re-review blockers appear addressed: ragged panels now reject missing-baseline groups / drop interior-gap groups in diff_diff/chaisemartin_dhaultfoeuille.py:L533, and twowayfeweights() now reuses the estimator’s shared validation/aggregation helper in diff_diff/chaisemartin_dhaultfoeuille.py:L94 and diff_diff/chaisemartin_dhaultfoeuille.py:L1983.
I did not find an unmitigated P0/P1 in the new DID_M point-estimate, variance, placebo, or TWFE-diagnostic paths.
The parity suite now runs from committed JSON goldens without requiring live R, which is a meaningful regression-coverage improvement for a methodology-critical estimator in tests/test_chaisemartin_dhaultfoeuille_parity.py:L11.
to_dataframe(level="joiners_leavers") still mixes cell counts and raw observation counts under one n_obs column, which is a confusing downstream contract in diff_diff/chaisemartin_dhaultfoeuille_results.py:L645.
A few user-facing docstrings/comments lag the implemented Phase 1 contract, especially around placebo bootstrap availability and never-switching groups.

Methodology

Methodology cross-check: the PR’s Phase 1 framing is consistent with the cited source material and with the registry’s current-scope definition for DID_M / DID_1; the previously flagged methodology blockers noted above also appear resolved in the implementation. (aeaweb.org)

Severity: P3 informational. Impact: the substantive deviations I found are documented rather than silent: Phase 1 placebo inference is intentionally NaN, and mixed-direction R parity is explicitly treated as a documented period-vs-cohort control-set deviation. The code and tests match that contract, so this is not a defect. Concrete fix: none. Refs: docs/methodology/REGISTRY.md:L554, docs/methodology/REGISTRY.md:L562, diff_diff/chaisemartin_dhaultfoeuille.py:L915, tests/test_chaisemartin_dhaultfoeuille_parity.py:L102.

Code Quality

Severity: P2. Impact: to_dataframe(level="joiners_leavers") exposes one n_obs column with different units by row: DID_M uses switcher-cell count, while DID_+ / DID_- use raw observation counts summed over switching cells. That makes the table easy to misread programmatically and breaks the semantic contract of a single count field. Concrete fix: split this into separate n_cells and n_obs columns, or rename the current field so all rows use the same unit consistently. Refs: diff_diff/chaisemartin_dhaultfoeuille_results.py:L645, diff_diff/chaisemartin_dhaultfoeuille.py:L796.

Performance

No material findings in the changed performance-sensitive paths.

Maintainability

No material maintainability finding beyond the documentation/contract drift noted below.

Tech Debt

No separate tech-debt blocker in this diff. The remaining issues are low-severity contract/documentation cleanup, not silent correctness bugs.

Security

No material security findings in the changed files. I did not see secrets, unsafe subprocess additions, or new network-facing behavior.

Documentation/Tests

Severity: P3. Impact: the results-object docs are stale relative to the implemented Phase 1 contract. DCDHBootstrapResults still says placebo bootstrap outputs are populated when available, but Phase 1 intentionally never surfaces placebo bootstrap inference; n_groups_dropped_never_switching still says never-switchers are excluded from variance, while the registry/code now treat that field as backward-compat metadata only. Concrete fix: update those docstrings to match current runtime behavior. Refs: diff_diff/chaisemartin_dhaultfoeuille_results.py:L45, diff_diff/chaisemartin_dhaultfoeuille_results.py:L237, docs/methodology/REGISTRY.md:L476, docs/methodology/REGISTRY.md:L554, diff_diff/chaisemartin_dhaultfoeuille.py:L952.
Severity: P3. Impact: a couple of new comments overstate what the fixtures/docs guarantee. The R golden-value script says set.seed() and NumPy default_rng() produce identical draws, and the reversible DGP doc says Assumption 5 is “automatically satisfied” for binary treatment even though the PR explicitly relies on dropping multi-switch groups for those paths. These are comment-level issues, but they can mislead future maintainers about parity and identification assumptions. Concrete fix: reword both comments to describe structural mirroring rather than RNG identity, and to say the default single-switch path is A5-safe while multi-switch patterns are stress tests for the drop filter. Refs: benchmarks/R/generate_dcdh_dynr_test_values.R:L37, diff_diff/prep_dgp.py:L1877.

…trings) P2: split joiners_leavers DataFrame into n_cells + n_obs columns - to_dataframe(level="joiners_leavers") previously had a single n_obs column with mixed semantics by row (DID_M used switcher cell count; DID_+/DID_- used raw observation counts). Two columns with consistent units across all rows: n_cells (count of switching (g, t) cells) and n_obs (sum of n_gt over the same cells). DID_M row uses union of joiner + leaver cells. Updated test_to_dataframe_joiners_leavers to pin the new contract. P3: stale docstrings on results object - DCDHBootstrapResults class docstring now states explicitly that placebo bootstrap fields ALWAYS remain None in Phase 1 (the previous wording said they were "populated when available"). Per-field docstrings for placebo_se / placebo_ci / placebo_p_value now point back to the class-level note. - n_groups_dropped_never_switching docstring now reflects the Round 2 full-IF fix: never-switching groups participate in the variance via stable-control roles and the field is reported for backwards compatibility only — no actual exclusion happens. - n_groups_dropped_singleton_baseline docstring clarifies the variance-only filter scope (cell DataFrame retains them as period-based stable controls). P3: misleading R-script + prep_dgp comments - benchmarks/R/generate_dcdh_dynr_test_values.R: clarified that the Python and R generators mirror each other STRUCTURALLY (same pattern logic, same FE/effect/noise model), not at the RNG level. R's set.seed and NumPy's default_rng use different RNGs. Parity tests load the R script's golden-value JSON so both sides operate on byte-identical input regardless of how it was originally generated. - prep_dgp.py generate_reversible_did_data: clarified that the default single_switch pattern is A5-safe by construction (every group has at most one transition). Other patterns (random/cycles/marketing) ARE allowed to violate A5 and exist primarily as stress tests for the drop_larger_lower=True filter. The cohort-recentered variance formula is derived under A5, which is why drop_larger_lower defaults to True. Tests: 103 dCDH passing (no new tests; the existing test_to_dataframe_joiners_leavers was strengthened to assert the new n_cells / n_obs contract). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T22:24:09Z

/ai-review

github-actions · 2026-04-11T22:37:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 787eae20a2aab0a52701ab1d6f3ea88ed449e895

Overall Assessment

Needs changes

Static review only: pytest, numpy, and pandas are unavailable in this environment, so I could not execute the new dCDH suite.

Executive Summary

The prior re-review items I spot-checked look addressed in the current diff: shared validation in twowayfeweights(), the joiners_leavers count contract, and the previously stale dCDH docstrings/comments.
P1 [Newly identified]: the analytical variance path can collapse to overall_se = 0.0 on valid panels where every exact (D_{g,1}, F_g, S_g) cohort is a singleton. The PR’s own 4-group worked example appears to hit that case, while the committed R golden file records overall_se = 1.8027756377.
P1 [Newly identified]: the placebo path silently applies A11 zero-retention when placebo switchers lack eligible stable controls, but it does not expose the warning/flags that the main estimator exposes.
The documented period-vs-cohort control-set deviation in REGISTRY.md does not mitigate the SE issue; the registry explicitly treats the 4-group worked example as a case where that deviation should not matter.
The new tests pin ATT parity on the worked example, but they still do not pin its SE or the placebo A11 warning path, which is why the above issues can escape.

Methodology

Severity: P1 [Newly identified]. Impact: _compute_cohort_recentered_inputs() builds exact cohort IDs from (D_{g,1}, F_g, S_g) and then _cohort_recenter() subtracts the mean within that exact cohort, but the only exclusion rule implemented/documented is the baseline-singleton filter from footnote 15. On the canonical 4-group worked example, that leaves four distinct exact cohorts for four groups, so the centered IF vector collapses to all zeros before _plugin_se() runs. That yields an analytical SE of 0.0 by the current logic on a valid Phase 1 panel, while the PR’s own committed R golden value on the same panel is 1.8027756377. This is an incorrect variance/SE path, not a documented deviation. Concrete fix: reconcile the cohorting/exclusion logic with the documented source material/reference implementation so exact-singleton cohorts do not silently force zero variance; at minimum, detect this case and return NaN plus a warning instead of 0.0, then pin the worked-example SE/CI against the committed golden value. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1513-L1535, diff_diff/chaisemartin_dhaultfoeuille.py:L1643-L1739, docs/methodology/REGISTRY.md:L516-L540, docs/methodology/REGISTRY.md:L560-L562, tests/test_methodology_chaisemartin_dhaultfoeuille.py:L45-L68, benchmarks/data/dcdh_dynr_golden_values.json:L115-L137
Severity: P1 [Newly identified]. Impact: REGISTRY.md says the placebo applies the same Theorem 3 logic, and the registered A11 contract says zero-retention is surfaced to users via flags/warnings. The main DID path does that, but _compute_placebo() silently sets placebo_plus_t = 0.0 / placebo_minus_t = 0.0 whenever placebo joiners/leavers lack stable controls, and fit() only warns when the entire placebo is unavailable. That can attenuate the placebo diagnostic toward zero with no user-visible signal. Concrete fix: either add placebo-side A11 flags plus a consolidated warning, mirroring _compute_per_period_dids(), or mark the placebo unavailable and return NaN whenever required placebo control sets are missing; document whichever contract you choose in REGISTRY.md. Refs: docs/methodology/REGISTRY.md:L507-L514, docs/methodology/REGISTRY.md:L558-L558, diff_diff/chaisemartin_dhaultfoeuille.py:L820-L846, diff_diff/chaisemartin_dhaultfoeuille.py:L1196-L1307, diff_diff/chaisemartin_dhaultfoeuille.py:L1323-L1398

Code Quality

No separate code-quality finding beyond the methodology defects above.

Performance

No material performance findings in the changed paths reviewed.

Maintainability

No separate maintainability finding beyond the methodology/test contract issues below.

Tech Debt

Severity: P3 informational. Impact: I did not find any TODO.md entry that tracks either methodology issue above, so they remain unmitigated under the project’s deferment rules and still count toward the assessment. Concrete fix: resolve these in code in this PR; only any leftover non-blocking cleanup should be deferred into TODO.md afterward. Refs: TODO.md:L51-L90

Security

No material security findings in the changed files reviewed. I did not see secrets, unsafe subprocess additions, or new network-facing behavior.

Documentation/Tests

Severity: P2. Impact: the new methodology/parity coverage still does not assert the worked-example SE, only the ATT, and there is no regression for placebo A11 visibility. That leaves both methodology issues above undetected despite the large new test surface. Concrete fix: extend the worked-example tests to assert overall_se and CI against the committed golden JSON, and add a placebo case with switchers but no eligible 3-period stable controls on one side to pin the warning/flag or NaN contract. Refs: tests/test_methodology_chaisemartin_dhaultfoeuille.py:L56-L68, tests/test_chaisemartin_dhaultfoeuille_parity.py:L185-L200, tests/test_chaisemartin_dhaultfoeuille.py:L597-L631, tests/test_chaisemartin_dhaultfoeuille.py:L771-L821, benchmarks/data/dcdh_dynr_golden_values.json:L115-L137

Path to Approval

Fix the analytical SE path so valid panels with singleton exact (D_{g,1}, F_g, S_g) cohorts do not silently collapse to overall_se = 0.0; if the variance is genuinely unidentified after the intended paper-consistent filtering, surface NaN plus an explicit warning instead.
Make placebo A11 handling explicit by either adding placebo A11 flags plus a consolidated warning or by marking the placebo unavailable when the required placebo control sets are missing.
Add regression coverage for both cases: assert the worked-example overall_se/CI against the committed golden JSON, and add a placebo panel that exercises missing placebo controls on one side.

Addresses two new P1s + a P2 from CI re-review: P1: degenerate-cohort SE collapses to 0.0 - _plugin_se now returns NaN when sum(U_centered**2) <= 0 (in addition to the existing n==0 and divisor<=0 guards). This catches the case where every variance-eligible group forms its own (D_{g,1}, F_g, S_g) cohort, so cohort recentering produces an identically-zero centered IF vector. Returning NaN rather than 0.0 prevents the silently-implies-infinite-precision failure mode. - fit() now detects this case after the analytical SE call and emits a UserWarning explaining that the variance is unidentified, the point estimate is still valid, and the bootstrap path inherits the same degeneracy on the same data. - Worked example test now asserts overall_se is NaN with the warning, alongside the existing DID_M = 2.5 exact assertion. P1: placebo A11 silent zeroing - _compute_placebo now mirrors the main path's A11 distinction: "no joiners" stays a natural zero (no flag), while "joiners but no 3-period stable_0 controls" sets the placebo period contribution to zero AND appends a warning string to placebo_a11_warnings. Symmetric for leavers/stable_1. - _compute_placebo's return signature widened from Optional[Tuple[float, bool]] to Optional[Tuple[float, bool, List[str]]] to expose the A11 warnings to the caller. - fit() unpacks the new tuple and surfaces a consolidated "Placebo (DID_M^pl) Assumption 11 violations" warning when any placebo period was zeroed by the A11 contract. Mirrors the main DID path's warning format. - New test_placebo_a11_violation_emits_warning constructs a panel where placebo joiners exist but no placebo stable_0 controls do (4-group T=3 with 2 joiners + 2 always-treated controls), asserts the placebo A11 warning fires. REGISTRY.md: two new Note blocks - The degenerate-cohort SE contract (NaN + warning, deviation from R) - The placebo A11 zero-retention contract (mirrors main DID path) Tests: 105 dCDH passing (was 103, added 2 new). The existing test_hand_calculable_4group_3period_joiners_and_leavers wraps fit() in a warnings.catch_warnings(simplefilter("ignore")) block since the degenerate-cohort warning now fires on this panel; the dedicated test_worked_example_se_is_unidentified_with_warning asserts the warning explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T22:57:55Z

/ai-review

github-actions · 2026-04-11T23:15:42Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1e05c9d86fb2cb12451f888fc0e123c3064f2594

Overall Assessment

⚠️ Needs changes

Static review only. I could not execute the new dCDH suite here because pytest, numpy, and pandas are unavailable in this environment.

Executive Summary

The prior re-review blockers I spot-checked look addressed: degenerate singleton-cohort analytical SE now surfaces NaN plus a warning instead of 0.0, and placebo A11 zero-retention now emits a consolidated warning, with both behaviors documented and tested.
P1 [Newly identified]: ChaisemartinDHaultfoeuille.fit() now errors on groups missing the first global period and drops groups with interior period gaps before computing DID_M, DID_+, DID_-, or DID_M^pl, but the dCDH registry/docs do not document that sample restriction even though the PR also presents DIDmultiplegtDYN as the reference implementation and that manual explicitly supports unbalanced panels. diff_diff/chaisemartin_dhaultfoeuille.py:L534-L607, docs/methodology/REGISTRY.md:L468-L478, docs/methodology/REGISTRY.md:L544-L569, docs/api/chaisemartin_dhaultfoeuille.rst:L18-L43, tests/test_chaisemartin_dhaultfoeuille.py:L462-L507. citeturn0view0turn0view1
The documented Phase 1 placebo-SE=NaN contract and the documented degenerate-cohort overall_se=NaN behavior are not blockers.
P3: several changed docs/comments still describe never-switching groups as filtered from variance or as true “filter counts,” while the implementation and main registry note now say they are retained and that field is metadata only.

Methodology

Severity: P1 [Newly identified]. Impact: The affected methods are the full Phase 1 dCDH path: DID_M, DID_+, DID_-, and DID_M^pl, because Step 5a changes the estimation sample before any estimand is computed. The dynamic companion paper’s setup is written for fully observed group-time histories, so a strict full-history contract is defensible; however, the official DIDmultiplegtDYN manual explicitly allows unbalanced panels and documents dedicated handling for missing treatment before first switch. This PR instead rejects late-entry groups and drops interior-gap groups, yet the dCDH Methodology Registry and API docs do not document that restriction/deviation even though they do document other R deviations and advertise parity with DIDmultiplegtDYN. Under the project rules, that is an undocumented default-behavior change and should be treated as P1. Concrete fix: either document the exact ragged-panel contract in the dCDH registry/docs as an explicit Phase 1 requirement/deviation, or implement the reference package’s unbalanced-panel handling. diff_diff/chaisemartin_dhaultfoeuille.py:L534-L607, docs/methodology/REGISTRY.md:L468-L478, docs/methodology/REGISTRY.md:L544-L569, docs/api/chaisemartin_dhaultfoeuille.rst:L18-L43, tests/test_chaisemartin_dhaultfoeuille.py:L462-L507. citeturn0view0turn0view1
Severity: P3 informational. Impact: The previous re-review methodology findings appear resolved. _plugin_se() now returns NaN on the all-zero centered-IF case, fit() surfaces that with a warning, placebo A11 violations now emit a consolidated warning, and both behaviors are documented in REGISTRY.md and pinned in tests. Concrete fix: none. diff_diff/chaisemartin_dhaultfoeuille.py:L847-L867, diff_diff/chaisemartin_dhaultfoeuille.py:L894-L919, diff_diff/chaisemartin_dhaultfoeuille.py:L1796-L1834, docs/methodology/REGISTRY.md:L554-L558, tests/test_methodology_chaisemartin_dhaultfoeuille.py:L75-L117, tests/test_chaisemartin_dhaultfoeuille.py:L632-L691

Code Quality

No separate findings beyond the methodology item above.

Performance

No material performance findings in the changed paths reviewed.

Maintainability

No separate maintainability findings beyond the documentation drift noted below.

Tech Debt

Severity: P3 informational. Impact: I did not find a TODO.md entry tracking the P1 above, so it remains unmitigated under the project’s deferment rules. Concrete fix: resolve or document it in this PR rather than deferring it. TODO.md:L51-L95

Security

No material security findings in the changed files reviewed.

Documentation/Tests

Severity: P3. Impact: Several changed docs/comments still describe never-switching groups as a real variance filter / warning path, while the implementation and main dCDH registry text now say never-switchers remain in the variance and n_groups_dropped_never_switching is backwards-compat metadata only. That is misleading but not a numerical defect. Concrete fix: align the changelog bullet, README field description, dCDH registry checklist item, and stale test comments with the actual Round 2 behavior. CHANGELOG.md:L12-L19, README.md:L1206-L1213, docs/methodology/REGISTRY.md:L474-L476, docs/methodology/REGISTRY.md:L580-L583, diff_diff/chaisemartin_dhaultfoeuille_results.py:L250-L256, tests/test_methodology_chaisemartin_dhaultfoeuille.py:L155-L160

Path to Approval

Decide the intended dCDH unbalanced/ragged-panel contract and document it in docs/methodology/REGISTRY.md: if Phase 1 requires first-period observation plus consecutive periods, say so explicitly and label it as a deviation/limitation relative to DIDmultiplegtDYN; otherwise, implement the reference package’s unbalanced-panel handling.
Mirror that contract in the public API/README text that currently advertises DIDmultiplegtDYN parity, so users do not expect R-like behavior on late-entry or interior-gap panels.
Add one regression test that pins the remaining unspoken branch of the contract, e.g. whether terminal missingness / late exit is retained or rejected, so the documented behavior is fully specified.

Documents the dCDH Phase 1 ragged-panel handling contract that was implemented in Round 3 but not surfaced in any user-facing doc: - Balanced baseline required: groups missing the first global period raise ValueError with offending IDs - Interior period gaps: groups dropped with UserWarning - Terminal missingness (early exit / right-censoring): retained, contributes from observed periods only via the per-period present guard at three sites in the variance computation This is a Phase 1 limitation relative to R DIDmultiplegtDYN, which supports unbalanced panels with documented missing-treatment-before- first-switch handling. Added a new **Note (deviation from R)** block in REGISTRY.md, mirrored in the API rst, README, and CHANGELOG so users do not expect R-like behavior on late-entry or interior-gap panels. Also fixes doc drift on never-switching groups: after the Round 2 full-IF fix they participate in the variance via stable-control roles and are no longer filtered, but CHANGELOG/README/REGISTRY checklist and a methodology test comment still described them as filtered. The n_groups_dropped_never_switching field stays as backwards-compat metadata only (already documented correctly in the dataclass docstring since Round 2). Adds test_terminal_missingness_retained as a regression test pinning the contract: a group with periods [0, 1, 2] in a 5-period panel is retained in results.groups, the fit completes without error, and the per-period DIDs are populated. Files modified: - docs/methodology/REGISTRY.md (assumption check + new deviation Note + checklist) - docs/api/chaisemartin_dhaultfoeuille.rst (Phase 1 panel requirements section) - README.md (Note block + n_groups_dropped_never_switching field row split) - CHANGELOG.md (panel requirement sub-bullet + line 17 never-switching fix) - tests/test_chaisemartin_dhaultfoeuille.py (test_terminal_missingness_retained) - tests/test_methodology_chaisemartin_dhaultfoeuille.py (fix test comment) Test count: 105 -> 106 (one new regression test). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-11T23:50:45Z

/ai-review

github-actions · 2026-04-12T00:00:39Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cf639d44a87b2f2c322284bd3a289578c473b546

Overall Assessment

⚠️ Needs changes

Static review only. I could not execute the added dCDH tests in this runner because pytest and numpy are not installed here, although the project declares them and CI installs them in normal jobs. pyproject.toml:L48-L57, .github/workflows/rust-test.yml:L191-L199

Executive Summary

The prior ragged/unbalanced-panel documentation blocker looks resolved: the balanced-baseline, interior-gap, and terminal-missingness contract is now explicitly documented in the dCDH registry and user-facing docs. docs/methodology/REGISTRY.md:L477-L477, docs/api/chaisemartin_dhaultfoeuille.rst:L52-L72, README.md:L1233-L1233, CHANGELOG.md:L18-L18
Severity P1 [Newly identified]: Phase 1 exposes a public cluster parameter, but the dCDH analytical and bootstrap inference paths never read or validate it, so cluster=... silently does nothing and still returns the default group-level inference. diff_diff/chaisemartin_dhaultfoeuille.py:L249-L252, diff_diff/chaisemartin_dhaultfoeuille.py:L320-L377, diff_diff/chaisemartin_dhaultfoeuille.py:L412-L1118, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L62-L233, docs/methodology/REGISTRY.md:L517-L543, docs/methodology/REGISTRY.md:L583-L587
I did not find new inline inference / partial-NaN regressions in the changed dCDH paths; the estimator uses safe_inference() consistently. diff_diff/chaisemartin_dhaultfoeuille.py:L920-L942, diff_diff/chaisemartin_dhaultfoeuille.py:L1023-L1038
Severity P3: the API overview still says singleton-baseline groups are “filtered,” while the implementation and methodology registry now treat them as variance-only exclusions. docs/api/chaisemartin_dhaultfoeuille.rst:L20-L23, diff_diff/chaisemartin_dhaultfoeuille.py:L659-L703, diff_diff/chaisemartin_dhaultfoeuille_results.py:L243-L249, docs/methodology/REGISTRY.md:L475-L475

Methodology

The previous ragged-panel methodology blocker appears resolved; I do not see a remaining finding there. docs/methodology/REGISTRY.md:L477-L477, docs/api/chaisemartin_dhaultfoeuille.rst:L52-L72

Severity: P1 [Newly identified]. Impact: Affected method(s): Phase 1 dCDH inference for DID_M, DID_+, and DID_-. The dCDH registry and in-code references document cohort-recentered analytical SEs plus an optional multiplier bootstrap clustered at the group level, but the public constructor and sklearn-style params still expose cluster. Inside ChaisemartinDHaultfoeuille.fit() and _compute_dcdh_bootstrap(), that parameter is never read or validated, so cluster="state" silently produces the same default group-level inference as cluster=None. Under the project’s anti-pattern rules, this is a new __init__ parameter missing downstream and an undocumented dCDH methodology/API mismatch in the canonical registry section. Concrete fix: prefer raising NotImplementedError or ValueError whenever cluster is not None in Phase 1; alternatively, implement cluster-aware analytical/bootstrap inference end-to-end and document it in the dCDH REGISTRY.md section. diff_diff/chaisemartin_dhaultfoeuille.py:L249-L252, diff_diff/chaisemartin_dhaultfoeuille.py:L320-L377, diff_diff/chaisemartin_dhaultfoeuille.py:L412-L1118, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L62-L233, docs/methodology/REGISTRY.md:L517-L543, docs/methodology/REGISTRY.md:L583-L587

Code Quality

No additional findings beyond the public cluster no-op. The changed dCDH inference path consistently routes through safe_inference(). diff_diff/chaisemartin_dhaultfoeuille.py:L920-L942, diff_diff/chaisemartin_dhaultfoeuille.py:L1023-L1038

Performance

No material findings in the changed paths I reviewed.

Maintainability

Severity: P3. Impact: The fitted TWFE diagnostic claims to run on the “FULL pre-filter” sample, but in fit() it is computed after the Step 5a interior-gap drop, while standalone twowayfeweights() runs on the full aggregated input. On ragged panels, the two APIs can therefore diverge even though the in-code contract and test comments say otherwise. Concrete fix: either compute the fit-path diagnostic before Step 5a filtering, or update the comments/tests/docs to say that the fitted diagnostic uses the estimation sample. diff_diff/chaisemartin_dhaultfoeuille.py:L617-L625, diff_diff/chaisemartin_dhaultfoeuille.py:L2041-L2089, tests/test_chaisemartin_dhaultfoeuille.py:L1175-L1177

Tech Debt

Severity: P3. Impact: I did not find a dCDH entry in TODO.md that would mitigate the cluster issue, so it remains unmitigated under the project’s deferment rules. Concrete fix: resolve or explicitly document/gate the behavior in this PR rather than deferring it. TODO.md:L51-L97

Security

No material security findings in the changed files reviewed.

Documentation/Tests

Severity: P3. Impact: The API overview still says the estimator “Filters singleton-baseline groups,” which reads as a point-estimate sample drop. The implementation and methodology registry now say these groups are excluded from the variance computation only and remain available as stable controls in the point-estimate sample. Concrete fix: change step 3 in the API overview to say “Excludes singleton-baseline groups from the variance computation only.” docs/api/chaisemartin_dhaultfoeuille.rst:L20-L23, diff_diff/chaisemartin_dhaultfoeuille.py:L659-L703, diff_diff/chaisemartin_dhaultfoeuille_results.py:L243-L249, docs/methodology/REGISTRY.md:L475-L475

Path to Approval

Choose a Phase 1 cluster contract and make it explicit in code: either reject any non-None cluster with a clear error, or implement cluster-aware analytical/bootstrap inference end-to-end.
If cluster is intentionally forward-compatible only, add an explicit dCDH **Note:** in docs/methodology/REGISTRY.md and mirror that contract in the public docs so the behavior is not silent.
Add a regression test that pins the chosen cluster behavior, e.g. ChaisemartinDHaultfoeuille(cluster="state").fit(...) should either raise a clear Phase 1 message or exercise documented cluster-aware inference.

…nguage Resolves three findings from the prior CI review: P1 — cluster parameter was a public no-op. The constructor exposed `cluster: Optional[str] = None` and stored it on `self.cluster`, but neither `fit()` nor `_compute_dcdh_bootstrap()` ever read it, so `cluster="state"` silently produced the same group-level inference as `cluster=None`. The Phase 1 contract is now: dCDH always clusters at the group level (via the cohort-recentered influence function and the multiplier bootstrap), and any non-None cluster value raises `NotImplementedError` at construction time. The same gate fires from `set_params`. Added `test_cluster_parameter_raises_not_implemented` covering all four entry points (`__init__`, `set_params`, the `cluster=None` happy path, and the convenience function). P3 — TWFE diagnostic was running on the post-Step-5a sample but the inline comment claimed "FULL pre-filter cell dataset" and the standalone `twowayfeweights()` function used the truly pre-filter cell. On ragged panels with interior gaps, the two paths diverged. Fixed by reordering: the TWFE diagnostic now runs as Step 5a (was 5b) BEFORE the ragged-panel validation, which is now Step 5b (was 5a). The blocks are independent — the diagnostic just reads `cell` while the ragged-panel block mutates it. Both API surfaces now operate on the same `_validate_and_aggregate_to_cells()` output. P3 — API rst step 3 said "Filters singleton-baseline groups" which read as a point-estimate sample drop. After Round 3, singleton- baseline groups are excluded from the variance computation only (they remain in the point-estimate sample as period-based stable controls). Fixed the language to match. Documentation: - REGISTRY.md gets a new `**Note (Phase 1 cluster contract):**` block in the dCDH section explaining the always-group-level semantics and the construction-time NotImplementedError gate. - API rst step 3 reflects the variance-only singleton-baseline scope. - The dCDH `cluster` parameter docstring now describes the Phase 1 contract instead of "Currently unused — analytical SEs are always at the group level via the cohort-recentered plug-in." Test counts: 106 -> 107 (one new cluster gate regression test). Files modified: - diff_diff/chaisemartin_dhaultfoeuille.py (cluster gate in __init__ and set_params, docstring update, TWFE diagnostic block reorder with renumbered Step 5a/5b labels) - docs/methodology/REGISTRY.md (cluster contract Note + rename Step 5a -> Step 5b in the ragged-panel deviation Note) - docs/api/chaisemartin_dhaultfoeuille.rst (singleton-baseline language fix in step 3 of the overview) - tests/test_chaisemartin_dhaultfoeuille.py (test_cluster_parameter_raises_not_implemented in TestForwardCompatGates; rename Step 5a -> Step 5b in three ragged-panel test docstrings) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T00:26:52Z

/ai-review

github-actions · 2026-04-12T00:33:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 83cc093c82faeaac67b9dadceeefbc3dfe10b5e9

Overall Assessment

✅ Looks good

This re-review clears the prior P1 blocker: cluster is no longer a silent no-op in Phase 1, and I did not find a new unmitigated P0/P1 in the changed dCDH estimator or its supporting files. Static review only; I could not execute the added tests here because pytest is not installed in this runner.

Executive Summary

The previous cluster blocker is fixed: non-None cluster now raises in both __init__ and set_params, matching the registry and the new regression test (diff_diff/chaisemartin_dhaultfoeuille.py:L352, diff_diff/chaisemartin_dhaultfoeuille.py:L423, docs/methodology/REGISTRY.md:L559, tests/test_chaisemartin_dhaultfoeuille.py:L326).
The affected methods, DID_M / DID_+ / DID_- / DID_M^pl, align with the registry on Theorem 3 cell-count weighting, cohort-recentered variance, A11 zero-retention, placebo NaN inference in Phase 1, and the balanced-baseline/interior-gap/terminal-missingness contract (docs/methodology/REGISTRY.md:L480, docs/methodology/REGISTRY.md:L517, docs/methodology/REGISTRY.md:L555, docs/methodology/REGISTRY.md:L571, diff_diff/chaisemartin_dhaultfoeuille.py:L1327, diff_diff/chaisemartin_dhaultfoeuille.py:L1506, diff_diff/chaisemartin_dhaultfoeuille.py:L998, diff_diff/chaisemartin_dhaultfoeuille.py:L609).
I did not find new inline-inference or partial-NaN regressions; the changed estimator path uses safe_inference() consistently (diff_diff/chaisemartin_dhaultfoeuille.py:L950, diff_diff/chaisemartin_dhaultfoeuille.py:L1058, diff_diff/utils.py:L152).
Severity P3: docs/llms-full.txt still says cluster is “Reserved for future use,” which is stale now that Phase 1 rejects any non-None value (docs/llms-full.txt:L238, diff_diff/chaisemartin_dhaultfoeuille.py:L352, docs/methodology/REGISTRY.md:L559).
Severity P3: the bootstrap helper docstrings still describe the re-aggregation divisor as a group count, while the implementation and registry correctly use switching-cell denominators; this is documentation drift, not a numeric bug (diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L75, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L90, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L104, diff_diff/chaisemartin_dhaultfoeuille.py:L1038, docs/methodology/REGISTRY.md:L492).

Methodology

Affected methods: Phase 1 dCDH DID_M, DID_+, DID_-, DID_M^pl, plus the standalone Theorem 1 twowayfeweights() diagnostic.

No unmitigated P0/P1 findings. The changed implementation matches the registry on the core Theorem 3 / Theorem 4 formulas, the Section 3.7.3 cohort-recentered variance, and the explicitly documented deviations from R, so I am not treating those documented differences as defects (docs/methodology/REGISTRY.md:L480, docs/methodology/REGISTRY.md:L517, docs/methodology/REGISTRY.md:L565, docs/methodology/REGISTRY.md:L569, diff_diff/chaisemartin_dhaultfoeuille.py:L785, diff_diff/chaisemartin_dhaultfoeuille.py:L1506).

Code Quality

No material findings. The changed dCDH inference path uses safe_inference() rather than inline t-stat / p-value / CI logic, so I did not find new partial-NaN or inline-inference anti-patterns (diff_diff/chaisemartin_dhaultfoeuille.py:L950, diff_diff/utils.py:L152).

Performance

No material findings from static review. I did not benchmark in this runner.

Maintainability

Severity: P3. Impact: _compute_dcdh_bootstrap() still documents the bootstrap divisor as a “count of contributing groups” and illustrates the step as W @ u_centered / n_groups_target, but the live estimator passes switching-cell denominators (N_S, joiner cells, leaver cells) per the registry. The code computes the right numbers today, but the helper documentation is now semantically misleading for future maintainers (diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L75, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L90, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L104, diff_diff/chaisemartin_dhaultfoeuille.py:L1041, docs/methodology/REGISTRY.md:L497).
Concrete fix: rename the parameter/documentation from n_groups_* to divisor or n_switcher_cells_*, and update the helper docstring formula to describe cell-count denominators explicitly.

Tech Debt

Severity: P3 (tracked in TODO.md). Impact: the broader gap that allowed the .txt AI guide to drift is already tracked: .txt documentation is outside current smoke validation, so this does not block the PR under the repo’s deferment rules (TODO.md:L93, docs/llms-full.txt:L238).
Concrete fix: no blocker for this PR; follow the existing TODO by extending doc-snippet or consistency checks to docs/*.txt.

Security

No security findings in the changed files reviewed.

Documentation/Tests

Static only here: I could not run the added tests because pytest is unavailable in this runner. The changed suite itself does cover the key edge cases from the checklist, including missing baseline, interior gaps, terminal missingness, A11 zero-retention, placebo bootstrap deferral, and R golden-value parity (tests/test_chaisemartin_dhaultfoeuille.py:L509, tests/test_chaisemartin_dhaultfoeuille.py:L528, tests/test_chaisemartin_dhaultfoeuille.py:L556, tests/test_chaisemartin_dhaultfoeuille.py:L690, tests/test_chaisemartin_dhaultfoeuille.py:L922, tests/test_chaisemartin_dhaultfoeuille_parity.py:L123).

Severity: P3. Impact: docs/llms-full.txt still advertises cluster as “Reserved for future use,” but the actual Phase 1 contract is stricter: only cluster=None is supported, and any non-None value raises NotImplementedError. That is stale documentation, not a runtime bug, but it will mislead AI/automation consumers that use this file as a source of truth (docs/llms-full.txt:L238, diff_diff/chaisemartin_dhaultfoeuille.py:L352, docs/methodology/REGISTRY.md:L559, tests/test_chaisemartin_dhaultfoeuille.py:L326).
Concrete fix: update the dCDH constructor snippet in docs/llms-full.txt to say that cluster=None is the only supported Phase 1 value and that non-None raises NotImplementedError.

Two P3 doc-drift cleanups from the prior CI review. P3 — Bootstrap helper docstring drift. The dCDH bootstrap mixin documented the re-aggregation divisor as a "count of contributing groups" with the formula `W @ u_centered / n_groups_target`, but the live code (and the registry) use switching-cell denominators (`N_S`, joiner cells, leaver cells) per Theorem 3 of AER 2020. The numbers were correct; only the docstring drifted. Renamed the public mixin parameter from `n_groups_overall` to `divisor_overall` and updated the docstring to describe the divisor as the switching-cell count from the Theorem 3 weighting formula. The joiners/leavers tuple description in the docstring also says `(u_centered, divisor, original_effect)` instead of `(u_centered, n_groups, original_effect)` to match. The fit() call site at line 1041 passes `divisor_overall=N_S` under the new name. P3 — `docs/llms-full.txt` line 238 said the dCDH `cluster` parameter was "Reserved for future use," which is stale after Round 7's NotImplementedError gate. Updated to "Phase 1: must be None; non-None raises NotImplementedError" to match REGISTRY.md and the constructor contract. No behavior change; both fixes are purely textual / parameter naming. 107 tests still pass; black, ruff clean. Files modified: - diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py (rename n_groups_overall -> divisor_overall, expanded docstring describing the cell-count divisor and the formula `W @ u_centered / divisor`) - diff_diff/chaisemartin_dhaultfoeuille.py (call site at line 1041 updates the keyword argument to match) - docs/llms-full.txt (cluster comment matches the Phase 1 contract) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T00:51:18Z

/ai-review

github-actions · 2026-04-12T01:02:58Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8cafae13ad3d7d1516c7a198b9331730dd5414fb

Overall Assessment

⚠️ Needs changes — one unmitigated P1 finding.

Executive Summary

Previous blocker resolved: cluster is now rejected in both __init__ and set_params, matching the Phase 1 contract.
Affected methodology here is the dCDH DID_M / placebo surface plus the AER 2020 TWFE decomposition diagnostic, with analytical variance mapped to the intertemporal DiD working paper. (aeaweb.org)
Severity P1 [Newly identified]: fitted twfe_* outputs are computed on the pre-filter cell sample, while overall_att, results.groups, and inference use the post-filter sample. When interior-gap or multi-switch filters fire, the diagnostic is not actually from the “same fit” / “same data” as documented.
That deviation is not documented in docs/methodology/REGISTRY.md and is not tracked in TODO.md, so it remains unmitigated.
Static review only: this runner does not have numpy, pandas, or pytest, so I could not execute the added tests.

Methodology

Affected methodology: the dCDH DID_M / placebo estimator family and the TWFE decomposition diagnostic introduced by de Chaisemartin and D’Haultfoeuille, plus the dynamic-paper variance mapping. (aeaweb.org)

Severity P1 [Newly identified]. Impact: fit() computes twfe_diagnostic_payload before ragged-panel validation and before drop_larger_lower, but then stores those pre-filter TWFE results next to post-filter dCDH outputs. If any interior-gap group is dropped or any multi-switch group is removed, results.twfe_weights can include groups absent from results.groups, and twfe_beta_fe / twfe_fraction_negative / twfe_sigma_fe no longer describe the sample underlying overall_att. That is an undocumented methodology / semantic-contract deviation, and it can silently mislead users into attributing TWFE-vs-dCDH differences to weighting rather than to a different estimation sample. Concrete fix: recompute the fitted TWFE diagnostic on the filtered cell used by DID_M, or explicitly split the API into documented pre-filter vs post-filter diagnostics and warn whenever the samples differ. References: diff_diff/chaisemartin_dhaultfoeuille.py:L560, diff_diff/chaisemartin_dhaultfoeuille.py:L629, diff_diff/chaisemartin_dhaultfoeuille.py:L673, diff_diff/chaisemartin_dhaultfoeuille.py:L1092, diff_diff/chaisemartin_dhaultfoeuille.py:L1127, diff_diff/chaisemartin_dhaultfoeuille.py:L273, diff_diff/chaisemartin_dhaultfoeuille_results.py:L208, docs/api/chaisemartin_dhaultfoeuille.rst:L40, docs/choosing_estimator.rst:L238, docs/methodology/REGISTRY.md:L563.

Beyond that sample-contract issue, I did not find another unmitigated methodology defect in the changed DID_M, placebo, or cohort-recentered inference paths.

Code Quality

No findings. I did not see a new inline-inference or partial-NaN regression; the changed estimator path routes inference through diff_diff/utils.py:L152.

Performance

No separate findings from static review. The blocker above also does unnecessary TWFE work on rows later dropped, but the correctness/contract problem is the part that needs action.

Maintainability

No separate findings.

Tech Debt

No separate deferrable finding. The P1 above is not tracked under the TODO.md “Tech Debt from Code Reviews” section, so it is not mitigated by project deferment rules.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the current TWFE tests only assert fitted-vs-standalone equality on single-switch data where no filters fire, and one test comment explicitly codifies the pre-filter contract. That means CI will not catch the sample-mismatch bug in the cases that actually matter: dropped crossers and interior-gap groups. The user-facing docs also currently reinforce the wrong same-fit / same-data story. Concrete fix: add one regression test with a dropped multi-switch group and one with a dropped interior-gap group, asserting either post-filter TWFE parity with the fitted sample or an explicit warning + documented pre-filter contract; update the same-fit / same-data wording accordingly. References: tests/test_chaisemartin_dhaultfoeuille.py:L382, tests/test_chaisemartin_dhaultfoeuille.py:L528, tests/test_chaisemartin_dhaultfoeuille.py:L1204, tests/test_chaisemartin_dhaultfoeuille.py:L1222, docs/api/chaisemartin_dhaultfoeuille.rst:L40, docs/choosing_estimator.rst:L238.

Path to Approval

Make the fitted twfe_* results use the same filtered cell sample as overall_att and results.groups, or explicitly split pre-filter vs post-filter diagnostics and warn when they differ.
Add regression coverage for both filter paths that change the estimation sample: one interior-gap drop case and one drop_larger_lower=True multi-switch-drop case.
Update the dCDH docstring, results docstring, API docs, and estimator-selection docs so their “same fit” / “same data” wording matches the implemented contract.

Documents the dCDH TWFE diagnostic sample contract that Round 7's swap left implicit. The fitted results.twfe_* values are computed on the FULL pre-filter cell sample (matching the standalone twowayfeweights() function), NOT on the post-filter estimation sample used by overall_att / results.groups / inference fields. The existing user-facing wording said "TWFE on the same data" / "diagnostic from the same fit" — phrases that naturally read as "same data as overall_att" — which contradicted the post-Round-7 behavior. This commit: 1. Adds a new `**Note (TWFE diagnostic sample contract):**` block in REGISTRY.md enumerating all three sample-shaping filters (interior-gap, multi-switch, singleton-baseline) and explicitly carving singleton-baseline as variance-only (no fitted-vs-overall_att mismatch, so no warning). 2. Rewrites the `twfe_diagnostic` parameter docstring in chaisemartin_dhaultfoeuille.py to describe the pre-filter contract and the divergence warning. 3. Rewrites the twfe_weights / twfe_fraction_negative / twfe_sigma_fe / twfe_beta_fe field docstrings in the results dataclass to clarify they describe the FULL pre-filter cell sample, with a pointer to the REGISTRY contract Note. 4. Adds a `UserWarning` from `fit()` whenever the user requested the TWFE diagnostic AND any of the interior-gap or multi-switch filters dropped groups. The warning explains the divergence with explicit counts and points at REGISTRY for the rationale. The warning fires regardless of whether the diagnostic itself succeeded or hit the rank-deficient fallback (the plan-review correctly flagged that the `twfe_diagnostic_payload is not None` guard would swallow the rare rank-deficient + filtered-panel intersection — dropped that guard). 5. Updates docs/api/chaisemartin_dhaultfoeuille.rst and docs/choosing_estimator.rst to replace "from the same fit" with "computed on the data you pass in (pre-filter)". 6. Adds three regression tests in TestTwowayFeweightsHelper: - test_twfe_pre_filter_contract_with_interior_gap_drop: panel with a dropped interior-gap group, asserts fitted twfe_* matches standalone, estimation sample is smaller, and the divergence warning fires with the expected counts. - test_twfe_pre_filter_contract_with_multi_switch_drop: panel with an injected multi-switch crosser, similar assertions. - test_twfe_no_divergence_warning_on_clean_panel: negative test asserting NO divergence warning fires on a clean panel (hard-codes pattern="single_switch" to close a future footgun). 7. Fixes the stale "Step 5a guarantees..." comment at line 712 to "Step 5b guarantees..." (post-Round-7 the ragged-panel validation is Step 5b, not Step 5a). Independent cleanup; bundled because it's in the same file and the same topic. This resolution preserves Round 7's standalone-vs-fitted parity (both APIs use the pre-filter cell sample) and addresses Round 9's P1 about the documentation contract. Both reviewers' concerns are now satisfied: the standalone and fitted produce identical numbers on the same input, AND users see an explicit warning when filters make the fitted sample diverge from the dCDH estimation sample. Test counts: 107 -> 110 (three new sample-contract regression tests). Black, ruff clean. Files modified: - docs/methodology/REGISTRY.md (new TWFE sample contract Note enumerating all three filters) - diff_diff/chaisemartin_dhaultfoeuille.py (twfe_diagnostic param docstring, n_groups_dropped_interior_gap tracking, divergence warning at Step 6b, stale comment fix) - diff_diff/chaisemartin_dhaultfoeuille_results.py (twfe_weights / twfe_fraction_negative / twfe_sigma_fe / twfe_beta_fe field docstrings) - docs/api/chaisemartin_dhaultfoeuille.rst (wording fix) - docs/choosing_estimator.rst (wording fix) - tests/test_chaisemartin_dhaultfoeuille.py (3 new tests + 1 parity test comment update) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T01:27:43Z

/ai-review

github-actions · 2026-04-12T01:36:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 63edb3d9041ef37a53b58a34b6de32539ecfb2bd

Overall Assessment

Needs changes. One unmitigated P1 finding.

Executive Summary

The previous TWFE sample-contract blocker appears resolved: the pre-filter TWFE behavior is now documented in the registry, surfaced with a fit-time warning, and covered by dedicated regression tests. See docs/methodology/REGISTRY.md:L563-L573, diff_diff/chaisemartin_dhaultfoeuille.py:L576-L735, tests/test_chaisemartin_dhaultfoeuille.py:L1234-L1356.
Severity P1: when n_bootstrap > 0, dCDH computes bootstrap percentile CIs and bootstrap p-values, but discards them for the main results surface and replaces them with safe_inference(effect, bootstrap_se) output. That makes the public inference surface a hybrid, not the requested multiplier-bootstrap inference.
The mismatch propagates into overall_*, joiners_*, leavers_*, event_study_effects[1], summary(), to_dataframe(), is_significant, and significance_stars. See diff_diff/chaisemartin_dhaultfoeuille.py:L1090-L1137, diff_diff/chaisemartin_dhaultfoeuille_results.py:L400-L407, diff_diff/chaisemartin_dhaultfoeuille_results.py:L476-L724.
This issue is unmitigated: I did not find a corresponding Note / Deviation entry for this hybrid inference contract in the dCDH registry section, and TODO.md does not track it. See docs/methodology/REGISTRY.md:L460-L595, TODO.md:L51-L110.
Static review only. This runner does not have numpy, pandas, or pytest, so I could not execute the added tests.

Methodology

Severity P1. Impact: the bootstrap helper correctly computes per-target bootstrap SE / CI / p-value triples for DID_M, DID_+, and DID_-, but fit() only copies the bootstrap SE into the top-level results and then recomputes p-values and confidence intervals with safe_inference(). That silently changes the inference method from percentile/bootstrap p-values to normal-theory p-values and CIs whenever n_bootstrap > 0. Because event_study_effects[1] mirrors the same top-level fields, the inconsistency also affects the forward-compatible event-study surface. This is not the pattern used by the library’s other bootstrap-enabled estimators, which do propagate bootstrap p-values and CIs to the public result object. Concrete fix: in the bootstrap branch, assign overall_p / overall_ci, joiners_p / joiners_ci, and leavers_p / leavers_ci from br; keep *_t_stat from safe_inference(...)[0] if desired. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1090-L1137, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L175-L230, diff_diff/bootstrap_utils.py:L219-L291, diff_diff/imputation.py:L790-L805, diff_diff/two_stage.py:L778-L787, diff_diff/efficient_did.py:L1009-L1013.
Severity P3. Impact: the prior TWFE fitted-vs-estimation-sample mismatch is now explicitly documented as a Phase 1 contract, warned at fit time, mirrored in result docstrings, and regression-tested for both interior-gap and multi-switch filters. Under the review rules, that makes it informational rather than blocking. Concrete fix: none required. Refs: docs/methodology/REGISTRY.md:L563-L573, diff_diff/chaisemartin_dhaultfoeuille.py:L697-L735, diff_diff/chaisemartin_dhaultfoeuille_results.py:L208-L238, tests/test_chaisemartin_dhaultfoeuille.py:L1234-L1356.

Code Quality

No unmitigated findings beyond the bootstrap inference-surface issue above.

Performance

No separate performance findings from static review.

Maintainability

No separate maintainability findings from the changed code.

Tech Debt

No separate deferrable finding. The bootstrap inference-surface mismatch is not tracked in TODO.md, so it remains unmitigated. See TODO.md:L51-L110.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new bootstrap tests only verify that bootstrap_results is populated and that placebo bootstrap stays unavailable; they do not assert that the top-level public fields (overall_p_value, overall_conf_int, joiner/leaver analogues, event_study_effects[1]) or user-facing renderers (summary(), to_dataframe()) actually reflect bootstrap inference. That gap is why the P1 above would pass CI. Separately, the README currently describes the top-level overall_* fields as analytical, which conflicts with the runtime behavior once bootstrap is requested. Concrete fix: add a regression that compares the top-level inference fields and rendered summaries/tables against bootstrap_results when n_bootstrap > 0, and update the README / API docs / result docstrings to state exactly which fields switch to bootstrap inference. Refs: tests/test_chaisemartin_dhaultfoeuille.py:L867-L955, README.md:L1204-L1214, docs/api/chaisemartin_dhaultfoeuille.rst:L196-L206.

Path to Approval

In diff_diff/chaisemartin_dhaultfoeuille.py:L1108-L1122, propagate bootstrap p-values and CIs from br into the top-level overall_*, joiners_*, and leavers_* result fields instead of recomputing them with safe_inference().
Add a regression test that fits with n_bootstrap > 0 and asserts the public inference surface (overall_*, joiners_*, leavers_*, event_study_effects[1]) matches bootstrap_results.
Add user-facing coverage that summary() and to_dataframe() surface the corrected bootstrap inference, and update the dCDH docs so the semantics of the top-level fields are explicit when bootstrap is enabled.

Fixes a P1 where the dCDH bootstrap branch silently replaced the multiplier-bootstrap percentile p-values and CIs with normal-theory recomputations from the bootstrap SE. The bootstrap helper computes the per-target SE / CI / p-value triples correctly on the DCDHBootstrapResults object, but fit() was only copying the SE and then calling safe_inference() which returns normal-theory p-values and CIs. The user requested multiplier bootstrap inference and got a hybrid (bootstrap SE + normal-theory p/CI) — the public surface fields, event_study_effects[1], summary(), to_dataframe(), is_significant, and significance_stars all reflected the normal-theory recomputations instead of the bootstrap inference. The fix: in the fit() bootstrap branch, propagate br.overall_p_value and br.overall_ci directly to the top-level overall_p / overall_ci (and joiners/leavers analogues). Keep the t-stat from safe_inference()[0] since percentile bootstrap doesn't define an alternative t-stat semantic, and that satisfies the project anti-pattern rule (never compute t = effect/se inline). Library precedent: imputation.py:790-805, two_stage.py:778-787, and efficient_did.py:1009-1013 all propagate bootstrap p/CI to the public surface while keeping a SE-derived t-stat. dCDH was the only modern bootstrap-enabled estimator that didn't follow this pattern. Documentation updates: - New `**Note (bootstrap inference surface):**` block in REGISTRY.md documenting the propagation contract, the rationale for the SE-based t-stat, and the placebo carve-out (placebo bootstrap remains deferred to Phase 2). - Inference-method switch paragraph added to the ChaisemartinDHaultfoeuilleResults class docstring `Notes` section. - README.md row updated to clarify "cohort-recentered analytical SE by default; multiplier-bootstrap percentile inference when n_bootstrap > 0". - API rst Multiplier bootstrap example now shows that results.overall_p_value and results.overall_conf_int reflect the bootstrap inference (not just bootstrap_results.overall_*). Regression test added: test_bootstrap_p_value_and_ci_propagated_to_top_level asserts results.overall_p_value == bootstrap_results.overall_p_value, overall_conf_int == bootstrap_results.overall_ci, the joiners/leavers analogues, event_study_effects[1] reflects bootstrap, the t-stat is the SE-derived value, summary() doesn't crash, and to_dataframe() returns the bootstrap-derived numbers. This pins the contract so the silent inference-method swap can't regress. Test counts: 110 -> 111 (one new bootstrap propagation regression test). The existing TestBootstrap tests at lines 855-955 only asserted that bootstrap_results was populated and SE was finite, which is why the silent swap passed CI for nine commits. Files modified: - diff_diff/chaisemartin_dhaultfoeuille.py (fit() bootstrap branch propagates p/CI from br instead of recomputing via safe_inference) - diff_diff/chaisemartin_dhaultfoeuille_results.py (class docstring `Notes` section gains the inference-method switch paragraph) - docs/methodology/REGISTRY.md (new bootstrap inference surface Note) - README.md (field-description row clarification) - docs/api/chaisemartin_dhaultfoeuille.rst (Multiplier bootstrap example reflects the new propagation) - tests/test_chaisemartin_dhaultfoeuille.py (test_bootstrap_p_value_and_ci_propagated_to_top_level) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T01:47:20Z

/ai-review

github-actions · 2026-04-12T01:53:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 819ba5789e3ce411dd64066126eda1bc99a99208

Overall Assessment

✅ Looks good. No unmitigated P0/P1 findings remain in the changed dCDH code. Static review only; I could not run the tests because this runner is missing pytest, numpy, and pandas.

Executive Summary

The previous P1 bootstrap-inference blocker appears fixed. fit() now propagates bootstrap percentile p_value and conf_int into the top-level overall_*, joiners_*, and leavers_* fields, rather than silently recomputing normal-theory inference from the bootstrap SE: diff_diff/chaisemartin_dhaultfoeuille.py:L1090-L1136.
That bootstrap contract is now documented in both the results docstring and the Methodology Registry, so the main inference surface is aligned with the stated methodology: diff_diff/chaisemartin_dhaultfoeuille_results.py:L141-L158, docs/methodology/REGISTRY.md:L561-L565.
The substantive dCDH deviations from the papers or R that I checked are documented in REGISTRY.md with Note / Note (deviation from R ...) labels, so under the review rules they are informational rather than defects.
Severity P2: summary() still prints the analytical-CI caution unconditionally, even when n_bootstrap > 0 and the displayed p_value / conf_int are bootstrap-based. That is a reporting/docs mismatch, not a methodology correctness bug: diff_diff/chaisemartin_dhaultfoeuille_results.py:L522-L526.
The new regression covers numeric bootstrap propagation to top-level fields and to_dataframe(), but it only smoke-tests summary(), which is why the text mismatch can slip through: tests/test_chaisemartin_dhaultfoeuille.py:L974-L1049.

Methodology

Severity P3. Impact: Re-review of the prior blocker is clean. The public inference surface now matches the documented bootstrap contract: bootstrap percentile p_value / conf_int are copied from br into overall_*, joiners_*, and leavers_*, while t_stat remains SE-derived via safe_inference()[0]. This aligns the implementation with the registry note and the dedicated regression, so I do not see an unmitigated methodology defect here. Concrete fix: none. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1090-L1136, diff_diff/chaisemartin_dhaultfoeuille_results.py:L141-L158, docs/methodology/REGISTRY.md:L561-L565, tests/test_chaisemartin_dhaultfoeuille.py:L974-L1049.

Code Quality

No unmitigated findings in the changed code from static review.

Performance

No findings from static review.

Maintainability

No separate findings beyond the documentation/reporting mismatch noted below.

Tech Debt

No separate findings. The remaining issue is documentation/test debt rather than a correctness blocker.

Security

No findings.

Documentation/Tests

Severity P2. Impact: summary() always appends the text “dCDH analytical CI is conservative...” even in bootstrap mode, where the same summary surface is rendering bootstrap percentile overall_p_value / overall_conf_int. The numbers remain correct, but the explanatory note misstates the inference method and contradicts the results docstring, API docs, and registry note that say summary() reflects bootstrap inference automatically. The current regression only checks that summary() does not crash, not that its note matches the active inference mode. Concrete fix: branch the footer in summary() on bootstrap mode, for example self.bootstrap_results is not None; print a bootstrap-specific note in that branch, keep the analytical-conservative note only when n_bootstrap == 0, and extend the regression to assert the rendered summary text changes with inference mode. Refs: diff_diff/chaisemartin_dhaultfoeuille_results.py:L522-L526, diff_diff/chaisemartin_dhaultfoeuille_results.py:L141-L158, docs/api/chaisemartin_dhaultfoeuille.rst:L205-L213, docs/methodology/REGISTRY.md:L561-L565, tests/test_chaisemartin_dhaultfoeuille.py:L1040-L1049.

P2 fix: summary() was unconditionally printing "dCDH analytical CI is conservative under Assumption 8" even when n_bootstrap > 0 and the displayed p_value / conf_int were bootstrap-based. The footer now branches on bootstrap_results: when bootstrap is active, it prints "p-value and CI are multiplier-bootstrap percentile inference (N iterations, weight_type weights)"; when analytical, it keeps the original conservative-CI note. Added test assertions pinning both paths: - test_summary_formats_without_error (analytical mode): asserts "analytical CI is conservative" is present and "multiplier- bootstrap" is absent - test_bootstrap_p_value_and_ci_propagated_to_top_level (bootstrap mode): asserts "multiplier-bootstrap percentile inference" is present and "analytical CI is conservative" is absent 111 tests still pass; black, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T01:58:26Z

/ai-review

github-actions · 2026-04-12T02:09:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3824de8fcf853155ef17631e45ce9ae2771605d7

Overall Assessment

⚠️ Needs changes.

Highest unmitigated severity is P1. This is a static re-review of the changed files; I could not run the tests on this runner because pytest is not installed.

Executive Summary

The previous bootstrap-surface blocker appears resolved: fit() now propagates bootstrap percentile p_value and conf_int to the public top-level fields instead of silently recomputing normal-theory inference from the bootstrap SE. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:1101, docs/methodology/REGISTRY.md:561, tests/test_chaisemartin_dhaultfoeuille.py:974
The previous summary() reporting mismatch also appears resolved: bootstrap runs now print a bootstrap-specific note rather than the analytical-CI caveat. Refs: diff_diff/chaisemartin_dhaultfoeuille_results.py:523, tests/test_chaisemartin_dhaultfoeuille.py:1044
[Newly identified] P1: the shared cell-aggregation helper silently converts within-cell mixed treatment to majority 0/1 instead of rejecting data that violate the method’s group-time treatment contract. That changes D_{g,t} paths before switcher/control classification and therefore changes the estimand. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:125, diff_diff/chaisemartin_dhaultfoeuille.py:199, docs/methodology/REGISTRY.md:473, docs/methodology/REGISTRY.md:492
Public docs still market Phase 1 dCDH as suitable for “binary fuzzy designs,” which reinforces the unsupported mixed-cell behavior instead of fencing it off. Refs: README.md:1158, docs/choosing_estimator.rst:232, docs/methodology/REGISTRY.md:466
From static review of the changed dCDH inference, NaN, bootstrap, and ragged-panel paths, I did not find other unmitigated P0/P1 issues.

Methodology

Severity P1 [Newly identified]. Impact: _validate_and_aggregate_to_cells() averages raw treatment within each (group, time) cell and then majority-rounds any fractional d_gt to binary, and both fit() and twowayfeweights() consume that helper. The same registry section defines the estimator on binary cell-level treatment paths and transition sets (joiners, stable_0, leavers, stable_1), so majority-rounding a mixed cell mutates switcher/control membership before Theorem 3 / Section 3.7 arithmetic is applied. REGISTRY mentions this only as a generic implementation bullet, not a **Note:** / **Note (deviation...)** block, and there is no dCDH mitigation entry in TODO.md, so this remains an unmitigated methodology defect under the review rubric. Concrete fix: raise ValueError on any fractional aggregated d_gt instead of warning-and-rounding; require users to pre-aggregate to constant binary (group, time) treatment; remove the Phase 1 “binary fuzzy designs” claim until an actual fuzzy dCDH extension is implemented. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:121, diff_diff/chaisemartin_dhaultfoeuille.py:199, diff_diff/chaisemartin_dhaultfoeuille.py:475, docs/methodology/REGISTRY.md:473, docs/methodology/REGISTRY.md:492, README.md:1158, docs/choosing_estimator.rst:232, TODO.md:51

Previous methodology findings from the last review cycle appear addressed: the bootstrap inference surface and summary() note now match the registry/tests. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:1101, diff_diff/chaisemartin_dhaultfoeuille_results.py:523

Code Quality

No unmitigated findings from static review.

Performance

No findings from static review.

Maintainability

Severity P3. Impact: the inline comment immediately above the placebo inference block still says Phase 1 “approximate[s]” placebo SE using the plug-in formula, but the executable code below intentionally leaves placebo inference as NaN. The runtime behavior is correct; the comment is not. Concrete fix: update or remove that stale comment so it matches the warning text and the registry contract. Ref: diff_diff/chaisemartin_dhaultfoeuille.py:1034

Tech Debt

No separate tech-debt finding, but there is no dCDH-specific mitigation entry under TODO.md:51, so the P1 above is not downgraded by tracking.

Security

No findings.

Documentation/Tests

No additional unmitigated documentation/test finding beyond the methodology issue above.
I could not execute the new dCDH test files on this runner because pytest is unavailable, so the assessment is based on static review only.

Path to Approval

Change _validate_and_aggregate_to_cells() to reject any fractional aggregated d_gt with ValueError instead of warning-and-rounding, and keep that contract shared by both fit() and twowayfeweights().
Update the Phase 1 docs/docstrings to say treatment must already be constant within each (group, time) cell, and remove or explicitly defer the “binary fuzzy designs” language in README.md:1158, docs/choosing_estimator.rst:232, and docs/methodology/REGISTRY.md:466.
Replace the current mixed-cell warning regression at tests/test_chaisemartin_dhaultfoeuille.py:1521 with rejection tests for both public entry points, and add one positive test showing that already-constant cell-level treatment still fits correctly.

igerber and others added 3 commits April 11, 2026 15:50

igerber added the ready-for-ci Triggers CI test workflows label Apr 12, 2026

Conversation

igerber commented Apr 11, 2026

Summary

Methodology references

Intentional deviations from R DIDmultiplegtDYN (documented as **Note (deviation from R):** in REGISTRY.md)

Documented matches with R (worth noting because they're substantive choices)

No silent failures

Validation

Test files added

Test files modified

R parity reference

Documentation surface

Test count

No tutorial in this PR

Security / privacy

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

igerber commented Apr 11, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Intentional deviations from R `DIDmultiplegtDYN` (documented as `Note (deviation from R):` in REGISTRY.md)