Add ChaisemartinDHaultfoeuille (dCDH) DID_M estimator (Phase 1)#290
Add ChaisemartinDHaultfoeuille (dCDH) DID_M estimator (Phase 1)#290
Conversation
Phase 1 of the de Chaisemartin-D'Haultfoeuille estimator family — the only modern staggered DiD estimator in the library that handles non-absorbing (reversible) treatments. Treatment can switch on AND off over time (marketing campaigns, seasonal promotions, on/off policy cycles). Implements DID_M from de Chaisemartin & D'Haultfoeuille (2020) AER, equivalently DID_1 (horizon l=1) of the dynamic companion paper (NBER WP 29873). Ships: - Headline DID_M point estimate with cohort-recentered analytical SE from Web Appendix Section 3.7.3 of the dynamic companion paper - Joiners-only (DID_+) and leavers-only (DID_-) decompositions - Single-lag placebo DID_M^pl (Theorem 4 of AER 2020); analytical placebo SE deferred to Phase 2 (NaN with explicit warning) - Optional multiplier bootstrap clustered at group level (Rademacher / Mammen / Webb) - TWFE decomposition diagnostic from Theorem 1 of AER 2020 (per-cell weights, fraction negative, sigma_fe, beta_fe), exposed both as estimator field and standalone twowayfeweights() helper - Multi-switch group filtering (drop_larger_lower=True default, matches R DIDmultiplegtDYN); singleton-baseline filter (footnote 15 of dynamic paper); never-switching groups filter from variance — all with explicit warnings (no silent failures) - Forward-compatible fit() signature: Phase 2 (multi-horizon event study, aggregate, L_max) and Phase 3 (covariate adjustment via controls, group-specific linear trends, HonestDiD) parameters raise NotImplementedError with phase pointers - generate_reversible_did_data() generator with 7 patterns (single_switch, joiners_only, leavers_only, mixed_single_switch, random, cycles, marketing) - Validated against R DIDmultiplegtDYN v2.3.3 at horizon l=1 via tests/test_chaisemartin_dhaultfoeuille_parity.py (5 scenarios) Methodology notes documented in REGISTRY.md: - Conservative CI under Assumption 8 (independent groups), exact only under iid sampling - A11 zero-retention convention (paper-faithful, with consolidated warning) - drop_larger_lower default and inconsistency warning when False - Singleton-baseline filter (footnote 15) - Phase 1 placebo SE intentionally NaN (deferred to Phase 2) - Note (deviation from R DIDmultiplegtDYN): Python uses period-based stable-control sets matching AER 2020 Theorem 3 literally; R uses cohort-based controls from the dynamic paper. Agree exactly on pure-direction panels and the worked example; differ by ~1% on mixed-direction panels with both joiners and leavers. Tests: 95 passing locally (44 API, 8 fast methodology + 2 slow Tier 2, 38 generator, 5 R parity). Hand-calculable 4-group worked example produces DID_M = 2.5 exactly matching R. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s API Methodology / inference: - Bootstrap-updated overall/joiners/leavers inference now goes through safe_inference() instead of inline t_stat = effect / se. Matches the project anti-pattern rule and keeps NaN behavior consistent with the rest of the library. Placebo doc/code alignment: - Phase 1 placebo bootstrap is deferred to Phase 2 (the dynamic companion paper Section 3.7.3 derives the cohort-recentered analytical variance for DID_l only; the placebo's influence-function representation is not yet implemented). The bootstrap path covers DID_M, DID_+, and DID_- only. - Updated the source warning text, the placebo_inputs comment, README, CHANGELOG, REGISTRY, choosing_estimator.rst, llms-full.txt to drop the misleading "set n_bootstrap > 0 for a placebo SE today" promise and state clearly that placebo inference fields stay NaN even with bootstrap enabled. - Added test_placebo_bootstrap_unavailable_in_phase_1 to pin this contract so future contributors do not silently widen the bootstrap surface without wiring up the documented Phase 2 derivation. twowayfeweights() public API: - Added ValueError validation for missing outcome/group/time/treatment columns (was treatment-only). - Updated README, docs/api/chaisemartin_dhaultfoeuille.rst, and docs/llms-full.txt examples to pass outcome="outcome" so the documented usage actually runs. set_params(): - Re-runs the same __init__ validation rules so invalid parameter combinations cannot slip in after construction. Tests: - 96 passing locally (was 95). The new placebo test runs alongside TestBootstrap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Final doc inconsistency caught on re-review: practitioner_decision_tree.rst still said "set n_bootstrap > 0 for a bootstrap-based placebo SE today" while README, REGISTRY, choosing_estimator, and llms-full had been updated to reflect that placebo bootstrap is deferred to Phase 2 and placebo inference fields stay NaN even when n_bootstrap > 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment This PR adds a new estimator and touches methodology-critical code paths, but I found three unmitigated Executive Summary
Methodology
Code Quality No additional material findings beyond the methodology issues above. Performance No material findings in the changed performance-sensitive paths beyond the correctness issues above. Maintainability No additional material findings beyond the documentation/test gaps below. Tech Debt No mitigating Security No material security findings in the changed files. Documentation/Tests
Path to Approval
|
Addresses three methodology issues in the dCDH Phase 1 implementation:
1. Cell-count weighting (P1): the per-period DID and placebo formulas
were computing N_{a,b,t} as the sum of within-cell observation counts
instead of the count of (g, t) cells. This silently changed the
estimand on individual-level inputs with uneven cell sizes. Fixed by
replacing n_curr[mask].sum() with int(mask.sum()) and using simple
.mean() over cells instead of n_curr-weighted averages. The fix
matches the AER 2020 Theorem 3 cell-count notation literally and
matches R DIDmultiplegtDYN's default behavior on cell-aggregated
input. Renamed the results field n_switcher_obs -> n_switcher_cells
for accuracy. Test data uses balanced 1-obs-per-cell panels so the
regression test_cell_count_weighting_unbalanced_input is needed to
guard the unbalanced-input path.
2. Full Lambda^G_{g,l=1} influence function (P1): the analytical SE
path was using a "first-switch only" shortcut where each switching
group's U^G_g came from its own switch period only. This omitted the
periods where switching groups serve as stable controls for other
cohorts. Replaced with _compute_full_per_group_contributions, which
sums role-weighted outcome differences (joiner +1, stable_0
-n_10/n_00, leaver -1, stable_1 +n_01/n_11) across all periods. The
bootstrap path inherits the corrected vectors automatically. SE
parity vs R DIDmultiplegtDYN on pure-direction scenarios narrowed
from ~18% to ~3%; pure-direction parity tests now assert SE within
5% rtol.
3. Singleton-baseline filter at variance stage only (P1): footnote 15
of the dynamic paper says singleton-baseline groups have no cohort
peer for the cohort-recentered variance. The previous code applied
the filter as a point-estimate filter (dropping the groups before
the per-period DIDs), which under Python's documented period-based
stable-control interpretation could change DID_M when a unique-
baseline always-treated or never-treated group would otherwise serve
as a valid stable control. Fixed by computing the singleton-baseline
group set in fit() and passing it to the variance helper, which
excludes those groups from the cohort enumeration and U vector
only. The cell DataFrame retains them. Updated
TestSingletonBaselineFilter to assert the variance-only semantics
and the warning text.
Documentation:
- doc-deps.yaml and ROADMAP.md no longer link to gitignored
paper review files (REGISTRY.md is the canonical source).
- REGISTRY.md updated with cell-count notation, the full IF formula,
the variance-only singleton-filter scope, the new SE parity gap
numbers (3% vs the prior 18%), and a clearer period-vs-cohort
deviation note.
Tests:
- test_cohort_recentering_not_grand_mean now actually computes both
formulas on the same data and asserts they differ materially. The
prior version only checked overall_se > 0.
- test_cell_count_weighting_unbalanced_input pins the cell-count
contract on a deliberately unbalanced individual-level panel.
- test_parity_joiners_only and test_parity_leavers_only now assert
SE parity at 5% rtol (was previously skipped).
- 97 dCDH tests passing (96 + the new cell-count regression).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only: this shell is missing Executive Summary
Methodology Affected methods: dCDH
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Addresses two new P1s, two P2s, and two P3s from CI re-review: P1: ragged panel handling - Add fit() Step 5a validation: reject groups missing the first global period with a clear ValueError listing offenders, and drop groups with interior period gaps with an explicit UserWarning. Eliminates the silent NaN propagation in _compute_cohort_recentered_inputs that could crash the cohort enumeration with int(np.nan) or misclassify late-entry groups' baselines. - Fix singleton-baseline computation in fit() Step 7 to read the validated first global period explicitly instead of using groupby.first() which returned the first OBSERVED row per group. - Add defensive presence-gating in _compute_cohort_recentered_inputs: the helper now refuses to run if N_mat[:, 0] has any zero entries (a fit() validation regression tripwire), and the first-switch detection loop only counts transitions between adjacent OBSERVED periods. Even with fit() validation in place, the helper is now safe to call directly. - Add 2 regression tests: test_missing_baseline_period_raises_value_error and test_interior_gap_drops_group_with_warning. P1: twowayfeweights validation refactor - Extract _validate_and_aggregate_to_cells helper enforcing the dCDH validation contract (NaN treatment / NaN outcome / non-binary treatment all raise ValueError; within-cell varying treatment emits UserWarning before majority rounding). - Both fit() and twowayfeweights() now call the helper. Single source of truth for the validation rules, no drift between the two public entry points. - Add 4 regression tests for twowayfeweights() validation: test_twowayfeweights_rejects_nan_treatment, test_twowayfeweights_rejects_nan_outcome, test_twowayfeweights_rejects_non_binary_treatment, test_twowayfeweights_warns_on_within_cell_rounding. P2: joiner/leaver metadata - Fix n_joiner_cells = int(n_10_t_arr.sum()) (was count_nonzero counting PERIODS, not cells). Same for n_leaver_cells. - Compute n_joiner_obs and n_leaver_obs as actual observation counts (sum of n_gt over the joiner/leaver cells across periods), not as cell totals. For balanced one-obs-per-cell panels they equal n_*_cells; for individual-level inputs with multiple obs per cell they can be larger. Update results dataclass docstrings. P2: parity tests run on JSON without R - Decouple golden_values fixture from require_r_dcdh. Tests now run whenever the JSON file exists. R is only needed to regenerate the JSON via benchmarks/R/generate_dcdh_dynr_test_values.R. - Verified by running DIFF_DIFF_R=skip pytest tests/test_chaisemartin_dhaultfoeuille_parity.py — all 5 parity scenarios PASS (was previously skipping entirely). P3: summary() label rename - Rename "Groups dropped before estimation:" to "Group filter / metadata counts:". Label never-switching as "(reported, not dropped)". Reflects the Round 2 change where never-switching groups participate in the variance via stable-control roles. P3: CHANGELOG/ROADMAP consistency - Remove the CHANGELOG bullet that claimed three paper review files were committed under docs/methodology/papers/. Replace with a reference to REGISTRY.md as the canonical methodology surface (matching the ROADMAP wording). Tests: 103 dCDH passing (97 + 6 new). Worked example DID_M = 2.5 still exact. Pure-direction R parity tests still pass at 1e-4 / 5% rtol. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static review only: I could not run the new dCDH suite in this environment because Executive Summary
Methodology Methodology cross-check: the PR’s Phase 1 framing is consistent with the cited source material and with the registry’s current-scope definition for
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…trings) P2: split joiners_leavers DataFrame into n_cells + n_obs columns - to_dataframe(level="joiners_leavers") previously had a single n_obs column with mixed semantics by row (DID_M used switcher cell count; DID_+/DID_- used raw observation counts). Two columns with consistent units across all rows: n_cells (count of switching (g, t) cells) and n_obs (sum of n_gt over the same cells). DID_M row uses union of joiner + leaver cells. Updated test_to_dataframe_joiners_leavers to pin the new contract. P3: stale docstrings on results object - DCDHBootstrapResults class docstring now states explicitly that placebo bootstrap fields ALWAYS remain None in Phase 1 (the previous wording said they were "populated when available"). Per-field docstrings for placebo_se / placebo_ci / placebo_p_value now point back to the class-level note. - n_groups_dropped_never_switching docstring now reflects the Round 2 full-IF fix: never-switching groups participate in the variance via stable-control roles and the field is reported for backwards compatibility only — no actual exclusion happens. - n_groups_dropped_singleton_baseline docstring clarifies the variance-only filter scope (cell DataFrame retains them as period-based stable controls). P3: misleading R-script + prep_dgp comments - benchmarks/R/generate_dcdh_dynr_test_values.R: clarified that the Python and R generators mirror each other STRUCTURALLY (same pattern logic, same FE/effect/noise model), not at the RNG level. R's set.seed and NumPy's default_rng use different RNGs. Parity tests load the R script's golden-value JSON so both sides operate on byte-identical input regardless of how it was originally generated. - prep_dgp.py generate_reversible_did_data: clarified that the default single_switch pattern is A5-safe by construction (every group has at most one transition). Other patterns (random/cycles/marketing) ARE allowed to violate A5 and exist primarily as stress tests for the drop_larger_lower=True filter. The cohort-recentered variance formula is derived under A5, which is why drop_larger_lower defaults to True. Tests: 103 dCDH passing (no new tests; the existing test_to_dataframe_joiners_leavers was strengthened to assert the new n_cells / n_obs contract). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Needs changes Static review only: Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Addresses two new P1s + a P2 from CI re-review:
P1: degenerate-cohort SE collapses to 0.0
- _plugin_se now returns NaN when sum(U_centered**2) <= 0 (in addition
to the existing n==0 and divisor<=0 guards). This catches the
case where every variance-eligible group forms its own
(D_{g,1}, F_g, S_g) cohort, so cohort recentering produces an
identically-zero centered IF vector. Returning NaN rather than 0.0
prevents the silently-implies-infinite-precision failure mode.
- fit() now detects this case after the analytical SE call and emits
a UserWarning explaining that the variance is unidentified, the
point estimate is still valid, and the bootstrap path inherits the
same degeneracy on the same data.
- Worked example test now asserts overall_se is NaN with the warning,
alongside the existing DID_M = 2.5 exact assertion.
P1: placebo A11 silent zeroing
- _compute_placebo now mirrors the main path's A11 distinction:
"no joiners" stays a natural zero (no flag), while "joiners but no
3-period stable_0 controls" sets the placebo period contribution
to zero AND appends a warning string to placebo_a11_warnings.
Symmetric for leavers/stable_1.
- _compute_placebo's return signature widened from
Optional[Tuple[float, bool]] to Optional[Tuple[float, bool, List[str]]]
to expose the A11 warnings to the caller.
- fit() unpacks the new tuple and surfaces a consolidated
"Placebo (DID_M^pl) Assumption 11 violations" warning when any
placebo period was zeroed by the A11 contract. Mirrors the main
DID path's warning format.
- New test_placebo_a11_violation_emits_warning constructs a panel
where placebo joiners exist but no placebo stable_0 controls do
(4-group T=3 with 2 joiners + 2 always-treated controls), asserts
the placebo A11 warning fires.
REGISTRY.md: two new Note blocks
- The degenerate-cohort SE contract (NaN + warning, deviation from R)
- The placebo A11 zero-retention contract (mirrors main DID path)
Tests: 105 dCDH passing (was 103, added 2 new). The existing
test_hand_calculable_4group_3period_joiners_and_leavers wraps fit()
in a warnings.catch_warnings(simplefilter("ignore")) block since the
degenerate-cohort warning now fires on this panel; the dedicated
test_worked_example_se_is_unidentified_with_warning asserts the
warning explicitly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not execute the new dCDH suite here because Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Documents the dCDH Phase 1 ragged-panel handling contract that was implemented in Round 3 but not surfaced in any user-facing doc: - Balanced baseline required: groups missing the first global period raise ValueError with offending IDs - Interior period gaps: groups dropped with UserWarning - Terminal missingness (early exit / right-censoring): retained, contributes from observed periods only via the per-period present guard at three sites in the variance computation This is a Phase 1 limitation relative to R DIDmultiplegtDYN, which supports unbalanced panels with documented missing-treatment-before- first-switch handling. Added a new **Note (deviation from R)** block in REGISTRY.md, mirrored in the API rst, README, and CHANGELOG so users do not expect R-like behavior on late-entry or interior-gap panels. Also fixes doc drift on never-switching groups: after the Round 2 full-IF fix they participate in the variance via stable-control roles and are no longer filtered, but CHANGELOG/README/REGISTRY checklist and a methodology test comment still described them as filtered. The n_groups_dropped_never_switching field stays as backwards-compat metadata only (already documented correctly in the dataclass docstring since Round 2). Adds test_terminal_missingness_retained as a regression test pinning the contract: a group with periods [0, 1, 2] in a 5-period panel is retained in results.groups, the fit completes without error, and the per-period DIDs are populated. Files modified: - docs/methodology/REGISTRY.md (assumption check + new deviation Note + checklist) - docs/api/chaisemartin_dhaultfoeuille.rst (Phase 1 panel requirements section) - README.md (Note block + n_groups_dropped_never_switching field row split) - CHANGELOG.md (panel requirement sub-bullet + line 17 never-switching fix) - tests/test_chaisemartin_dhaultfoeuille.py (test_terminal_missingness_retained) - tests/test_methodology_chaisemartin_dhaultfoeuille.py (fix test comment) Test count: 105 -> 106 (one new regression test). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not execute the added dCDH tests in this runner because Executive Summary
Methodology The previous ragged-panel methodology blocker appears resolved; I do not see a remaining finding there. docs/methodology/REGISTRY.md:L477-L477, docs/api/chaisemartin_dhaultfoeuille.rst:L52-L72
Code Quality No additional findings beyond the public Performance No material findings in the changed paths I reviewed. Maintainability
Tech Debt
Security No material security findings in the changed files reviewed. Documentation/Tests
Path to Approval
|
…nguage Resolves three findings from the prior CI review: P1 — cluster parameter was a public no-op. The constructor exposed `cluster: Optional[str] = None` and stored it on `self.cluster`, but neither `fit()` nor `_compute_dcdh_bootstrap()` ever read it, so `cluster="state"` silently produced the same group-level inference as `cluster=None`. The Phase 1 contract is now: dCDH always clusters at the group level (via the cohort-recentered influence function and the multiplier bootstrap), and any non-None cluster value raises `NotImplementedError` at construction time. The same gate fires from `set_params`. Added `test_cluster_parameter_raises_not_implemented` covering all four entry points (`__init__`, `set_params`, the `cluster=None` happy path, and the convenience function). P3 — TWFE diagnostic was running on the post-Step-5a sample but the inline comment claimed "FULL pre-filter cell dataset" and the standalone `twowayfeweights()` function used the truly pre-filter cell. On ragged panels with interior gaps, the two paths diverged. Fixed by reordering: the TWFE diagnostic now runs as Step 5a (was 5b) BEFORE the ragged-panel validation, which is now Step 5b (was 5a). The blocks are independent — the diagnostic just reads `cell` while the ragged-panel block mutates it. Both API surfaces now operate on the same `_validate_and_aggregate_to_cells()` output. P3 — API rst step 3 said "Filters singleton-baseline groups" which read as a point-estimate sample drop. After Round 3, singleton- baseline groups are excluded from the variance computation only (they remain in the point-estimate sample as period-based stable controls). Fixed the language to match. Documentation: - REGISTRY.md gets a new `**Note (Phase 1 cluster contract):**` block in the dCDH section explaining the always-group-level semantics and the construction-time NotImplementedError gate. - API rst step 3 reflects the variance-only singleton-baseline scope. - The dCDH `cluster` parameter docstring now describes the Phase 1 contract instead of "Currently unused — analytical SEs are always at the group level via the cohort-recentered plug-in." Test counts: 106 -> 107 (one new cluster gate regression test). Files modified: - diff_diff/chaisemartin_dhaultfoeuille.py (cluster gate in __init__ and set_params, docstring update, TWFE diagnostic block reorder with renumbered Step 5a/5b labels) - docs/methodology/REGISTRY.md (cluster contract Note + rename Step 5a -> Step 5b in the ragged-panel deviation Note) - docs/api/chaisemartin_dhaultfoeuille.rst (singleton-baseline language fix in step 3 of the overview) - tests/test_chaisemartin_dhaultfoeuille.py (test_cluster_parameter_raises_not_implemented in TestForwardCompatGates; rename Step 5a -> Step 5b in three ragged-panel test docstrings) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good This re-review clears the prior P1 blocker: Executive Summary
Methodology Affected methods: Phase 1 dCDH No unmitigated P0/P1 findings. The changed implementation matches the registry on the core Theorem 3 / Theorem 4 formulas, the Section 3.7.3 cohort-recentered variance, and the explicitly documented deviations from R, so I am not treating those documented differences as defects (docs/methodology/REGISTRY.md:L480, docs/methodology/REGISTRY.md:L517, docs/methodology/REGISTRY.md:L565, docs/methodology/REGISTRY.md:L569, diff_diff/chaisemartin_dhaultfoeuille.py:L785, diff_diff/chaisemartin_dhaultfoeuille.py:L1506). Code Quality No material findings. The changed dCDH inference path uses Performance No material findings from static review. I did not benchmark in this runner. Maintainability
Tech Debt
Security No security findings in the changed files reviewed. Documentation/Tests Static only here: I could not run the added tests because
|
Two P3 doc-drift cleanups from the prior CI review. P3 — Bootstrap helper docstring drift. The dCDH bootstrap mixin documented the re-aggregation divisor as a "count of contributing groups" with the formula `W @ u_centered / n_groups_target`, but the live code (and the registry) use switching-cell denominators (`N_S`, joiner cells, leaver cells) per Theorem 3 of AER 2020. The numbers were correct; only the docstring drifted. Renamed the public mixin parameter from `n_groups_overall` to `divisor_overall` and updated the docstring to describe the divisor as the switching-cell count from the Theorem 3 weighting formula. The joiners/leavers tuple description in the docstring also says `(u_centered, divisor, original_effect)` instead of `(u_centered, n_groups, original_effect)` to match. The fit() call site at line 1041 passes `divisor_overall=N_S` under the new name. P3 — `docs/llms-full.txt` line 238 said the dCDH `cluster` parameter was "Reserved for future use," which is stale after Round 7's NotImplementedError gate. Updated to "Phase 1: must be None; non-None raises NotImplementedError" to match REGISTRY.md and the constructor contract. No behavior change; both fixes are purely textual / parameter naming. 107 tests still pass; black, ruff clean. Files modified: - diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py (rename n_groups_overall -> divisor_overall, expanded docstring describing the cell-count divisor and the formula `W @ u_centered / divisor`) - diff_diff/chaisemartin_dhaultfoeuille.py (call site at line 1041 updates the keyword argument to match) - docs/llms-full.txt (cluster comment matches the Phase 1 contract) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology Affected methodology: the dCDH
Beyond that sample-contract issue, I did not find another unmitigated methodology defect in the changed Code Quality No findings. I did not see a new inline-inference or partial-NaN regression; the changed estimator path routes inference through diff_diff/utils.py:L152. Performance No separate findings from static review. The blocker above also does unnecessary TWFE work on rows later dropped, but the correctness/contract problem is the part that needs action. Maintainability No separate findings. Tech Debt No separate deferrable finding. The P1 above is not tracked under the Security No findings. Documentation/Tests
Path to Approval
|
Documents the dCDH TWFE diagnostic sample contract that Round 7's
swap left implicit. The fitted results.twfe_* values are computed on
the FULL pre-filter cell sample (matching the standalone
twowayfeweights() function), NOT on the post-filter estimation
sample used by overall_att / results.groups / inference fields. The
existing user-facing wording said "TWFE on the same data" /
"diagnostic from the same fit" — phrases that naturally read as
"same data as overall_att" — which contradicted the post-Round-7
behavior. This commit:
1. Adds a new `**Note (TWFE diagnostic sample contract):**` block in
REGISTRY.md enumerating all three sample-shaping filters
(interior-gap, multi-switch, singleton-baseline) and explicitly
carving singleton-baseline as variance-only (no fitted-vs-overall_att
mismatch, so no warning).
2. Rewrites the `twfe_diagnostic` parameter docstring in
chaisemartin_dhaultfoeuille.py to describe the pre-filter contract
and the divergence warning.
3. Rewrites the twfe_weights / twfe_fraction_negative / twfe_sigma_fe
/ twfe_beta_fe field docstrings in the results dataclass to clarify
they describe the FULL pre-filter cell sample, with a pointer to
the REGISTRY contract Note.
4. Adds a `UserWarning` from `fit()` whenever the user requested the
TWFE diagnostic AND any of the interior-gap or multi-switch filters
dropped groups. The warning explains the divergence with explicit
counts and points at REGISTRY for the rationale. The warning fires
regardless of whether the diagnostic itself succeeded or hit the
rank-deficient fallback (the plan-review correctly flagged that the
`twfe_diagnostic_payload is not None` guard would swallow the rare
rank-deficient + filtered-panel intersection — dropped that guard).
5. Updates docs/api/chaisemartin_dhaultfoeuille.rst and
docs/choosing_estimator.rst to replace "from the same fit" with
"computed on the data you pass in (pre-filter)".
6. Adds three regression tests in TestTwowayFeweightsHelper:
- test_twfe_pre_filter_contract_with_interior_gap_drop: panel with
a dropped interior-gap group, asserts fitted twfe_* matches
standalone, estimation sample is smaller, and the divergence
warning fires with the expected counts.
- test_twfe_pre_filter_contract_with_multi_switch_drop: panel with
an injected multi-switch crosser, similar assertions.
- test_twfe_no_divergence_warning_on_clean_panel: negative test
asserting NO divergence warning fires on a clean panel
(hard-codes pattern="single_switch" to close a future footgun).
7. Fixes the stale "Step 5a guarantees..." comment at line 712 to
"Step 5b guarantees..." (post-Round-7 the ragged-panel validation
is Step 5b, not Step 5a). Independent cleanup; bundled because
it's in the same file and the same topic.
This resolution preserves Round 7's standalone-vs-fitted parity
(both APIs use the pre-filter cell sample) and addresses Round 9's
P1 about the documentation contract. Both reviewers' concerns are
now satisfied: the standalone and fitted produce identical numbers
on the same input, AND users see an explicit warning when filters
make the fitted sample diverge from the dCDH estimation sample.
Test counts: 107 -> 110 (three new sample-contract regression
tests). Black, ruff clean.
Files modified:
- docs/methodology/REGISTRY.md
(new TWFE sample contract Note enumerating all three filters)
- diff_diff/chaisemartin_dhaultfoeuille.py
(twfe_diagnostic param docstring, n_groups_dropped_interior_gap
tracking, divergence warning at Step 6b, stale comment fix)
- diff_diff/chaisemartin_dhaultfoeuille_results.py
(twfe_weights / twfe_fraction_negative / twfe_sigma_fe /
twfe_beta_fe field docstrings)
- docs/api/chaisemartin_dhaultfoeuille.rst (wording fix)
- docs/choosing_estimator.rst (wording fix)
- tests/test_chaisemartin_dhaultfoeuille.py (3 new tests + 1
parity test comment update)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Needs changes. One unmitigated P1 finding. Executive Summary
Methodology
Code Quality No unmitigated findings beyond the bootstrap inference-surface issue above. Performance No separate performance findings from static review. Maintainability No separate maintainability findings from the changed code. Tech Debt No separate deferrable finding. The bootstrap inference-surface mismatch is not tracked in Security No findings. Documentation/Tests
Path to Approval
|
Fixes a P1 where the dCDH bootstrap branch silently replaced the multiplier-bootstrap percentile p-values and CIs with normal-theory recomputations from the bootstrap SE. The bootstrap helper computes the per-target SE / CI / p-value triples correctly on the DCDHBootstrapResults object, but fit() was only copying the SE and then calling safe_inference() which returns normal-theory p-values and CIs. The user requested multiplier bootstrap inference and got a hybrid (bootstrap SE + normal-theory p/CI) — the public surface fields, event_study_effects[1], summary(), to_dataframe(), is_significant, and significance_stars all reflected the normal-theory recomputations instead of the bootstrap inference. The fix: in the fit() bootstrap branch, propagate br.overall_p_value and br.overall_ci directly to the top-level overall_p / overall_ci (and joiners/leavers analogues). Keep the t-stat from safe_inference()[0] since percentile bootstrap doesn't define an alternative t-stat semantic, and that satisfies the project anti-pattern rule (never compute t = effect/se inline). Library precedent: imputation.py:790-805, two_stage.py:778-787, and efficient_did.py:1009-1013 all propagate bootstrap p/CI to the public surface while keeping a SE-derived t-stat. dCDH was the only modern bootstrap-enabled estimator that didn't follow this pattern. Documentation updates: - New `**Note (bootstrap inference surface):**` block in REGISTRY.md documenting the propagation contract, the rationale for the SE-based t-stat, and the placebo carve-out (placebo bootstrap remains deferred to Phase 2). - Inference-method switch paragraph added to the ChaisemartinDHaultfoeuilleResults class docstring `Notes` section. - README.md row updated to clarify "cohort-recentered analytical SE by default; multiplier-bootstrap percentile inference when n_bootstrap > 0". - API rst Multiplier bootstrap example now shows that results.overall_p_value and results.overall_conf_int reflect the bootstrap inference (not just bootstrap_results.overall_*). Regression test added: test_bootstrap_p_value_and_ci_propagated_to_top_level asserts results.overall_p_value == bootstrap_results.overall_p_value, overall_conf_int == bootstrap_results.overall_ci, the joiners/leavers analogues, event_study_effects[1] reflects bootstrap, the t-stat is the SE-derived value, summary() doesn't crash, and to_dataframe() returns the bootstrap-derived numbers. This pins the contract so the silent inference-method swap can't regress. Test counts: 110 -> 111 (one new bootstrap propagation regression test). The existing TestBootstrap tests at lines 855-955 only asserted that bootstrap_results was populated and SE was finite, which is why the silent swap passed CI for nine commits. Files modified: - diff_diff/chaisemartin_dhaultfoeuille.py (fit() bootstrap branch propagates p/CI from br instead of recomputing via safe_inference) - diff_diff/chaisemartin_dhaultfoeuille_results.py (class docstring `Notes` section gains the inference-method switch paragraph) - docs/methodology/REGISTRY.md (new bootstrap inference surface Note) - README.md (field-description row clarification) - docs/api/chaisemartin_dhaultfoeuille.rst (Multiplier bootstrap example reflects the new propagation) - tests/test_chaisemartin_dhaultfoeuille.py (test_bootstrap_p_value_and_ci_propagated_to_top_level) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good. No unmitigated P0/P1 findings remain in the changed dCDH code. Static review only; I could not run the tests because this runner is missing Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P2 fix: summary() was unconditionally printing "dCDH analytical CI is conservative under Assumption 8" even when n_bootstrap > 0 and the displayed p_value / conf_int were bootstrap-based. The footer now branches on bootstrap_results: when bootstrap is active, it prints "p-value and CI are multiplier-bootstrap percentile inference (N iterations, weight_type weights)"; when analytical, it keeps the original conservative-CI note. Added test assertions pinning both paths: - test_summary_formats_without_error (analytical mode): asserts "analytical CI is conservative" is present and "multiplier- bootstrap" is absent - test_bootstrap_p_value_and_ci_propagated_to_top_level (bootstrap mode): asserts "multiplier-bootstrap percentile inference" is present and "analytical CI is conservative" is absent 111 tests still pass; black, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated severity is Executive Summary
Methodology
Previous methodology findings from the last review cycle appear addressed: the bootstrap inference surface and Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Summary
ChaisemartinDHaultfoeuille(aliasDCDH) — Phase 1 of the de Chaisemartin-D'Haultfœuille estimator family. The only modern staggered DiD estimator in the library that handles non-absorbing (reversible) treatments — treatment can switch on AND off over time (marketing campaigns, seasonal promotions, on/off policy cycles, binary fuzzy designs).DID_Mfrom de Chaisemartin & D'Haultfœuille (2020) AER, equivalentlyDID_1(horizonl = 1) of the dynamic companion paper (NBER WP 29873).aggregate/L_max) and Phase 3 (covariate adjustment viacontrols, group-specific linear trends, HonestDiD) parameters present from day one withNotImplementedErrorgates pointing at the relevant phase.DID_+) and leavers-only (DID_-) decompositions, single-lag placebo (DID_M^pl) point estimate, optional multiplier bootstrap (Rademacher / Mammen / Webb), TWFE decomposition diagnostic from Theorem 1 of AER 2020 (per-cell weights, fraction negative,sigma_fe,beta_fe), plus a standalonetwowayfeweights()helper.generate_reversible_did_data()generator with 7 patterns (single_switchdefault,joiners_only,leavers_only,mixed_single_switch,random,cycles,marketing).Methodology references
DID_Mestimator (=DID_1at horizonl = 1of the dynamic paper)docs/methodology/papers/chaisemartin-dhaultfoeuille-{2020-review,2020-appendix-review,dynamic-review}.md(committed in earlier work; the dynamic-paper review is the source for the cohort-recentered analytical variance from Web Appendix Section 3.7.3)DIDmultiplegtDYNv2.3.3 (CRAN, maintained by the paper authors)Intentional deviations from R
DIDmultiplegtDYN(documented as**Note (deviation from R):**in REGISTRY.md)Stable-control set definition. Python uses period-based stable definitions (
stable_0(t)= any cell withD_{g,t-1} = D_{g,t} = 0, regardless of baselineD_{g,1}), matching the AER 2020 Theorem 3 cell-count notationN_{0,0,t}andN_{1,1,t}literally. R uses cohort-based definitions that additionally requireD_{g,1}to match the side, matching the dynamic companion paper's cohort(D_{g,1}, F_g, S_g)framework. The two definitions agree exactly under any of: (a) the panel is pure-direction (joiners-only or leavers-only), (b) no joiner's post-switch state ever overlaps a period when leavers are switching (and vice versa), or (c) the worked example. They diverge by O(1%) on mixed-direction panels with both joiners and leavers when some joiners' post-switch cells could serve as leavers' controls (or vice versa). Parity tests use 1e-4 tolerance for pure-direction scenarios and 2.5% tolerance for mixed-direction scenarios, with the deviation explicitly documented indocs/methodology/REGISTRY.md.Phase 1 placebo SE. Intentionally
NaNwith aUserWarning. The dynamic companion paper Section 3.7.3 derives the cohort-recentered analytical variance forDID_lonly — not for the placeboDID_M^pl. Phase 2 will add multiplier-bootstrap support for the placebo via the dynamic paper's machinery; until then,placebo_se,placebo_t_stat,placebo_p_value, andplacebo_conf_intremainNaNeven whenn_bootstrap > 0. The placebo point estimate is still computed and exposed viaresults.placebo_effect. A regression test (test_placebo_bootstrap_unavailable_in_phase_1) pins this contract.Assumption 11 zero-retention convention. When some period
thas joiners but nostable_0controls (or leavers but nostable_1controls), Python sets that period'sDID_{+,t}(orDID_{-,t}) to zero but retains the switcher count in theN_Sdenominator, matching the AER 2020 Theorem 3 worked-example arithmetic. R drops the cohort entirely. Per-period flags (did_plus_t_a11_zeroed,did_minus_t_a11_zeroed) and a consolidatedUserWarningmake this visible to users. The R parity scenarios add never-treated and always-treated control cohorts so neither implementation triggers the convention, eliminating this as a source of disagreement in the parity tests.Documented matches with R (worth noting because they're substantive choices)
drop_larger_lower=Truedefault — drops multi-switch groups before estimation (matches R; required for the analytical variance to be consistent with the point estimate)D_{g,1}is unique**Note:**in REGISTRY.mdNo silent failures
Every drop / round / fallback emits an explicit
warnings.warn()orValueError. Five distinct warnings/errors fire on the relevant code paths: NaN treatment validation, NaN outcome validation, placebo not-computable, TWFE diagnosticdenom == 0, never-switching groups filter from variance, plus the multi-switch and singleton-baseline filters and the consolidated A11 warning.Validation
Test files added
tests/test_chaisemartin_dhaultfoeuille.py— 44 tests covering basic API, forward-compat gates (one per Phase 2/3 parameter),drop_larger_lowerfilter, A11 zero-retention, NaN handling, bootstrap (Rademacher / Mammen / Webb), results dataclass, and the newtest_placebo_bootstrap_unavailable_in_phase_1regression testtests/test_methodology_chaisemartin_dhaultfoeuille.py— 8 fast + 2 slow Tier 2 methodology tests including the hand-calculable 4-group worked example (assertsDID_M = 2.5,DID_+ = 2.0,DID_- = 3.0exactly),test_cohort_recentering_not_grand_mean(the load-bearing variance correctness check that catches the Add initial diff-diff library implementation #1 implementation bug), TWFE diagnostic correctness, and large-N recovery (slow)tests/test_chaisemartin_dhaultfoeuille_parity.py— 5 RDIDmultiplegtDYNparity scenarios at horizonl = 1:single_switch_mixed,joiners_only,leavers_only,mixed_single_switch, and the hand-calculable worked exampletests/test_prep_dgp_reversible.py— 38 tests for the newgenerate_reversible_did_data()generator (shape, patterns, reproducibility, true-effect column, switcher-type column, validation errors)Test files modified
tests/conftest.py— added_check_r_dcdh_available(),r_dcdh_availablesession fixture, andrequire_r_dcdhper-test fixture parallel to the existingr_available/require_rpattern, withDIFF_DIFF_R=skipenv-var overrideR parity reference
benchmarks/R/generate_dcdh_dynr_test_values.R— mirror generator + 5 scenarios, generates golden values viadid_multiplegt_dyn(..., effects=1)from RDIDmultiplegtDYNv2.3.3benchmarks/data/dcdh_dynr_golden_values.json— committed golden values (~73 KB)polarsr-universe backend thatDIDmultiplegtDYN >= 2.xrequiresDIDmultiplegtDYNis unavailable, so the test suite still runs on dev machines without R installedDocumentation surface
Per
docs/doc-deps.yamlfor the newchaisemartin_dhaultfoeuillegroup:docs/methodology/REGISTRY.md— full## ChaisemartinDHaultfoeuillesection with assumption checks, estimator equations, the four**Note:**blocks (conservative CI,drop_larger_lower, A11 zero-retention, singleton-baseline filter), the**Note (deviation from R DIDmultiplegtDYN):**block, edge cases, requirements checklistdocs/api/chaisemartin_dhaultfoeuille.rst— autoclass directives + autofunction forchaisemartin_dhaultfoeuille()andtwowayfeweights(), with usage examplesdocs/api/index.rst— toctree entry;docs/api/prep.rst—generate_reversible_did_dataautofunctiondocs/choosing_estimator.rst— Quick Reference row, decision-flowchart question, Survey Compatibility Matrix row (--in all four columns), Detailed Guidance subsectiondocs/practitioner_decision_tree.rst— new "Reversible Treatment (On/Off Cycles)" section with business-language framingdocs/llms.txtanddocs/llms-full.txt— bullet + ~100-line sectionREADME.md— features bullet, alias table entry, full usage sectionCHANGELOG.md—### Addedentry under## [Unreleased]ROADMAP.md— Phase 1 items (1a–1i) flipped from "Not started" to "Shipped"; Phase 2 / Phase 3 retainedTest count
96 dCDH tests passing locally (44 API + 8 methodology fast + 2 methodology slow [deselected by default] + 5 R parity + 38 generator – overlap = 95 deselectable + 1 placebo regression = 96 total). The hand-calculable 4-group worked example produces
DID_M = 2.5exactly matching RDIDmultiplegtDYN.No tutorial in this PR
Per the multi-phase plan, a single comprehensive dCDH tutorial ships in Phase 3 after covariate support is added. Phase 1 deliberately ships no tutorial.
Security / privacy