Tests for Instrumental Variable Validity • ivcheck

Introduction

A technical working paper for this package can be found here.

ivcheck is an R package that tests the identifying assumptions behind instrumental variable (IV) estimation. It provides three published falsification tests as named R functions, with S3 methods for fitted fixest and ivreg models plus a one-shot wrapper that runs every applicable test in a single call.

Every applied IV paper rests on two assumptions about the instrument Z: the exclusion restriction (Z affects the outcome Y only through the endogenous treatment D) and monotonicity (no defiers). Under these assumptions plus independence, the IV estimand identifies the local average treatment effect (LATE) for compliers (Imbens and Angrist 1994). Both assumptions are untestable-looking in principle, but the methodological literature has derived testable implications on the joint distribution of (Y, D, Z): Kitagawa (2015), Mourifie-Wan (2017), Frandsen-Lefgren-Leslie (2023). Rejection of these tests is evidence that at least one of exclusion or monotonicity has failed. Non-rejection is evidence of no detectable violation at the chosen level.

Applied IV research has not adopted these tests widely. Most empirical IV papers still argue identification by narrative (“my instrument is random-looking because X”), and referees are increasingly frustrated with this. The limiting factor has been tooling rather than conviction: Kitagawa’s test ships as supplementary Matlab code, Mourifie-Wan relies on the Stata clrtest module, and Frandsen-Lefgren-Leslie ships a Stata SSC module called testjfe. None is in R. ivcheck closes that gap: two added lines to a fixest::feols call and you have a published falsification test ready for your paper’s appendix.

The current landscape

The R ecosystem for IV estimation is mature. fixest is the dominant package for fast fixed-effects IV estimation via feols(y ~ x | d ~ z). ivreg provides classical 2SLS with Wu-Hausman, Sargan, and weak-IV F tests. ivmodel covers k-class estimators and weak-IV-robust confidence intervals. ivDiag (Lal, Lockhart, Xu, and Zu 2024, Political Analysis) implements effective-F and Anderson-Rubin diagnostics, valid-t and local-to-zero tests, plus sensitivity analysis.

None of these packages implements the LATE-validity family of falsification tests. Applied researchers who want their IV design formally tested have had to choose between writing a one-off replication script from the original paper’s methodology section, switching to Stata for the test and back to R for the rest of the analysis, or not running the test at all. The third option has dominated.

ivcheck is the first R-native implementation of the LATE-validity family. The implementations are faithful to the published statistics: Kitagawa’s variance-weighted interval-sup Kolmogorov-Smirnov form (equation 2.1 of the paper), the full Chernozhukov-Lee-Rosen intersection-bounds inference with Andrews-Soares adaptive moment selection for Mourifie-Wan with covariates, and the asymptotic chi-squared form of Frandsen-Lefgren-Leslie, with a direct generalisation to multivalued treatment (the binary case is the published result). All designed to slot into existing fixest and ivreg workflows without friction.

Installation

install.packages("ivcheck")

# Development version from GitHub
# install.packages("devtools")
devtools::install_github("charlescoverdale/ivcheck")

Quick start

library(fixest)
library(ivcheck)

data(card1995)
m <- feols(lwage ~ 1 | college ~ near_college, data = card1995)
iv_check(m, n_boot = 500)
#> IV validity diagnostic
#>   Kitagawa (2015):     stat = 5.25, p = 0.00, reject
#>   Mourifie-Wan (2017): stat = 5.25, p = 0.00, reject
#> Overall: at least one test rejects IV validity at 0.05.

Two added lines, a falsification test the referee is almost guaranteed to ask about, citation-ready output. The unconditional rejection above is the correct reading: Card’s IV is plausible only conditional on demographic controls. Add a control and the conditional Mourifie-Wan test passes (see the end-to-end example below).

Walkthrough

Output lines prefixed with #> show what the console prints.

A single test on raw vectors

library(ivcheck)

set.seed(1)
n <- 500
z <- sample(0:1, n, replace = TRUE)
d <- rbinom(n, 1, 0.3 + 0.4 * z)
y <- rnorm(n, mean = d)

k <- iv_kitagawa(y, d, z, n_boot = 500)
print(k)
#>
#> -- Kitagawa (2015) -----------------------------------------------------------
#> Sample size: 500
#> Statistic: 0.04, p-value: 0.91
#> Verdict: cannot reject IV validity at 0.05

The bootstrap p-value comes from the multiplier resampling procedure of Kitagawa (2015) section 3.2. With parallel = TRUE (the default) replications run across cores on POSIX systems.

With covariates (Mourifie-Wan)

x <- rnorm(n)
mw <- iv_mw(y, d, z, x = x, n_boot = 500)
print(mw)

iv_mw() with covariates estimates F(y, d | X = x, Z = z) by cubic-polynomial series regression, computes heteroscedasticity-robust standard errors, and takes the sup of the studentised positive-part violation over a grid of (y, x) points. Critical values use adaptive moment selection with Andrews-Soares kappa_n = sqrt(log(log(n))). Without covariates it reduces exactly to the variance-weighted Kitagawa test.

Judge designs (Frandsen-Lefgren-Leslie)

set.seed(1)
n <- 2000
judge <- sample.int(20, n, replace = TRUE)
d <- rbinom(n, 1, 0.3 + 0.02 * judge)
y <- rnorm(n, mean = d)

jfe <- iv_testjfe(y, d, judge, n_boot = 500)

Designs where the instrument is a set of mutually exclusive dummies (judge, caseworker, examiner) need a purpose-built test. iv_testjfe() fits a weighted-LS regression of per-judge mu_j on per-judge p_j and tests the implied linearity via chi-squared with K - 2 degrees of freedom (default) or multiplier bootstrap (method = "bootstrap"). Multivalued treatment is supported by a direct generalisation of the Frandsen-Lefgren-Leslie statistic; the binary case is the published result.

One-shot diagnostic on a fitted model

library(fixest)

df <- data.frame(z = z, d = d, y = y, x = x)
m  <- feols(y ~ x | d ~ z, data = df)

iv_check(m, n_boot = 500)

iv_check() detects which tests are applicable from the model structure (binary versus multivalued D, discrete versus judge-style Z, presence and dimensionality of exogenous controls) and runs only the applicable ones. iv_kitagawa() is the unconditional test, so it is skipped when the model carries any exogenous control; iv_mw() is the conditional test and runs with up to one covariate via the Chernozhukov-Lee-Rosen series-regression path (multivariate planned for v0.2.0). Works identically on ivreg::ivreg() objects.

Power planning

pw <- iv_power(y, d, z, method = "kitagawa", n_sims = 200)

Simulates data under a parametric exclusion violation and reports rejection probability at a grid of deviation sizes. Useful when choosing between candidate tests on the same design, or planning a minimum sample size for a study.

Example: end-to-end with Card (1995)

The unconditional test rejects; the conditional one does not. That contrast is the right reading of Card’s design.

library(ivcheck)
library(fixest)

data(card1995)   # bundled

# Unconditional: Kitagawa and Mourifie-Wan both reject.
m_uncond <- feols(lwage ~ 1 | college ~ near_college, data = card1995)
iv_check(m_uncond, n_boot = 500)
#> IV validity diagnostic
#>   Kitagawa (2015):     stat = 5.25, p = 0.00, reject
#>   Mourifie-Wan (2017): stat = 5.25, p = 0.00, reject
#> Overall: at least one test rejects IV validity at 0.05.

# Conditional on age: the conditional Mourifie-Wan test does not reject.
m_cond <- feols(lwage ~ age | college ~ near_college, data = card1995)
iv_check(m_cond, n_boot = 200)
#> i Kitagawa test skipped: fitted model has exogenous controls and
#>   iv_kitagawa() is unconditional.
#> i The conditional Mourifie-Wan test is the right object here.
#>
#> IV validity diagnostic
#>   Mourifie-Wan (2017): stat = 79.5, p = 0.71, pass
#> Overall: cannot reject IV validity at 0.05.

Card’s identification strategy is “proximity-to-college is plausible only conditional on demographic controls”. The unconditional test catches this and refuses to validate the IV. Once a single demographic control (age) is included, the conditional Mourifie-Wan test reads the design as compatible with LATE-validity at the 5% level. Multivariate controls (Card’s canonical specification uses age plus race and region) are planned for v0.2.0 via a tensor-product series basis; in v0.1.2 the workaround is to reduce additional controls to a single propensity index.

This is a test of the binary college = (educ >= 16) discretisation, not Card’s original continuous-schooling IV. Inspect result$binding to see which outcome interval carries the violation when a rejection occurs.

Functions

Function	Purpose
`iv_kitagawa()`	Kitagawa (2015) variance-weighted KS test. Extends to multivalued D via Sun (2023).
`iv_mw()`	Mourifie-Wan (2017) conditional-inequality test. Full CLR intersection-bounds with adaptive moment selection under covariates.
`iv_testjfe()`	Frandsen-Lefgren-Leslie (2023) test for judge / group IV designs. Supports multivalued treatment.
`iv_check()`	Wrapper that auto-detects applicable tests and runs them on a fitted IV model.
`iv_power()`	Monte Carlo power curve for sample-size planning.

Limitations

Read before using in published work.

Scope

Continuous instruments. All three tests require a discrete Z. For continuous instruments, discretise into quantile bins (quartiles or quintiles) before passing to iv_kitagawa or iv_mw. A formal nonparametric continuous-Z extension is on the v0.2.0 roadmap.
Fuzzy regression discontinuity. FRD has its own testable implications at the cutoff (Arai, Hsu, Kitagawa, Mourifie, and Wan 2022). Handling them requires different infrastructure (running variable, bandwidth selection, bias correction) that does not fit the current fitted-IV-model spine; a dedicated iv_frd() function is planned for v0.2.0.
iv_mw with covariates under weights. The weights argument is fully implemented for iv_kitagawa, iv_testjfe, and the no-covariate path of iv_mw. The CLR series-regression path for iv_mw with covariates does not yet propagate weights; planned for v0.2.1.
Fixed-effects IV models. iv_kitagawa, iv_mw, and iv_testjfe dispatched on a fixest model with | FE | aborts with a clear error. The discrete-Z tests operate on the raw (Y, D, Z) joint distribution; within-FE residualisation destroys the discrete structure of Z. Workaround: pre-demean Y and D inside each FE cell and pass as raw vectors to the default method. A proper stratified-by-FE variant is on the v0.2.0 roadmap.
Multivariate conditioning in iv_mw. The conditional path supports a single covariate. A tensor-product basis for multivariate x is planned for v0.2.0. Multivariate x aborts rather than silently dropping additional columns.
Sun (2023) unordered multivalued D. Supported via treatment_order = "unordered" plus a user-supplied monotonicity_set (a data frame with columns d, z_from, z_to encoding the direction of the monotonicity restriction per Sun’s Assumption 2.4(iii)). See ?iv_kitagawa for an example.

Interpretation

Non-rejection is not proof of validity. The tests have power against violations in the observable conditional distributions but are silent on violations that cancel out across subgroups.
Kitagawa vs Mourifie-Wan with covariates. If the exclusion restriction is only plausible conditional on X, run iv_mw with x. iv_kitagawa() is strictly the unconditional test: dispatched on a fitted model with exogenous controls, it errors with a pointer to iv_mw(), rather than silently dropping the controls (which earlier versions did and which produced misleading non-rejections; fixed in v0.1.2).
Many-instrument / judge regimes. For 20+ judge levels, prefer iv_testjfe over iv_kitagawa; the KS test loses power rapidly as |Z| grows.
Bootstrap size. n_boot = 1000 (default) is fine for publication-grade p-values. Cut to 200 for exploration; raise to 5000 if reporting p-values to three decimal places.
The se_floor trimming constant (Kitagawa’s \xi) has a material impact on finite-sample size. The default is 0.15, raised from the paper’s informally-recommended 0.05-0.1 range after Monte Carlo showed that smaller floors produce anti-conservative size under skewed Z-cell distributions with weak first stages. At 0.15 empirical size is at or below nominal 5% in all 24 Monte Carlo configurations tested. Users reproducing Kitagawa’s published examples can set se_floor = 0.1.

Package	Description
`predictset`	Conformal prediction intervals (uncertainty around treatment effects)
`nowcast`	Economic nowcasting
`mpshock`	Monetary policy shock series (commonly used as instruments)
`inequality`	Inequality measurement (distributional treatment effects)
`fixest`	Fast IV estimation via `feols(y ~ x \\| d ~ z)` (upstream from `ivcheck`)
`ivreg`	2SLS with Wu-Hausman, Sargan, weak-IV F (upstream from `ivcheck`)
`ivmodel`	k-class estimators, weak-IV robust CIs, sensitivity analysis
`ivDiag`	Effective F, Anderson-Rubin, valid-t, local-to-zero tests

ivcheck complements rather than competes with these. fixest or ivreg does the estimation, ivDiag does weak-IV post-estimation diagnostics, and ivcheck does LATE-assumption falsification.

Issues and requests

Report bugs or request additional tests at GitHub Issues. Pull requests implementing additional IV-validity tests from the literature are welcome; please include a reference to the original paper and a reproduction test against its empirical example.

References

Cite both the package and the paper for the test you use; citation("ivcheck") prints the full set.

iv_kitagawa(): Kitagawa (2015), Econometrica 83(5): 2043-2063 (multivalued D via Sun 2023, Journal of Econometrics 237(2): 105523).
iv_mw(): Mourifié and Wan (2017), Review of Economics and Statistics 99(2): 305-313.
iv_testjfe(): Frandsen, Lefgren and Leslie (2023), American Economic Review 113(1): 253-277.

Keywords

instrumental variables, LATE, causal inference, exclusion restriction, monotonicity, specification testing, falsification, judge IV, Kitagawa test, Mourifie-Wan test, FLL test, econometrics.

ivcheck