#statme - Mastodon

0 posts0 participants0 posts today

FunctionalProgramming @FunctionalProgramming@activitypub.awakari.com

#stat.ME

Origin | Interest | Match

arXiv.orgEFECT: A Method to Quantify the Reproducibility of Stochastic SimulationsReproducibility is a fundamental requirement for validating scientific claims in computational research. Stochastic computational models are widely used in fields such as systems biology, financial modeling and environmental sciences. However, achieving reproducibility in stochastic simulations remains challenging, as each run can produce different outcomes. Existing infrastructure and software tools do not address independent reproduction of simulation results. Without independent reproducibility, results and conclusions lack credibility, as it remains unclear whether observed findings reflect model behavior or are artifacts of stochastic variation or an underpowered study. To bridge this gap, we introduce the Empirical Characteristic Function Equality Convergence Test (EFECT), a data-driven method to quantify the reproducibility of stochastic simulation results. EFECT employs empirical characteristic functions to compare reported results with those independently generated by assessing distributional inequality, termed EFECT error. Additionally, we establish the EFECT convergence point, a quantitative metric for determining the required number of simulation runs to achieve an EFECT error value of a priori significance. EFECT is applicable to all bounded, real-valued outputs, regardless of the model type or simulation method that produced them. We tested EFECT with over 40 use cases to demonstrate its broad applicability and effectiveness. EFECT standardizes stochastic simulation reproducibility, establishing a workflow that guarantees reliable results, supporting a wide range of stakeholders, and thereby enhancing validation of stochastic simulation studies, across a model's lifecycle. To promote standardization, we are developing the open-source software library libSSR in multiple programming languages for easy integration of EFECT.

FunctionalProgramming @FunctionalProgramming@activitypub.awakari.com

#stat.ME #econ.EM

Origin | Interest | Match

arXiv.orgPartial identification via conditional linear programs: estimation and policy learningMany important quantities of interest are only partially identified from observable data: the data can limit them to a set of plausible values, but not uniquely determine them. This paper develops a unified framework for covariate-assisted estimation, inference, and decision making in partial identification problems where the parameter of interest satisfies a series of linear constraints, conditional on covariates. In such settings, bounds on the parameter can be written as expectations of solutions to conditional linear programs that optimize a linear function subject to linear constraints, where both the objective function and the constraints may depend on covariates and need to be estimated from data. Examples include estimands involving the joint distributions of potential outcomes, policy learning with inequality-aware value functions, and instrumental variable settings. We propose two de-biased estimators for bounds defined by conditional linear programs. The first directly solves the conditional linear programs with plugin estimates and uses output from standard LP solvers to de-bias the plugin estimate, avoiding the need for computationally demanding vertex enumeration of all possible solutions for symbolic bounds. The second uses entropic regularization to create smooth approximations to the conditional linear programs, trading a small amount of approximation error for improved estimation and computational efficiency. We establish conditions for asymptotic normality of both estimators, show that both estimators are robust to first-order errors in estimating the conditional constraints and objectives, and construct Wald-type confidence intervals for the partially identified parameters. These results also extend to policy learning problems where the value of a decision policy is only partially identified. We apply our methods to a study on the effects of Medicaid enrollment.

#statme #econem

2rZiKKbOU3nTafniR2qMMSE0gwZ @2rZiKKbOU3nTafniR2qMMSE0gwZ@activitypub.awakari.com

#stat.ME #math.OC #math.ST #stat.CO #stat.TH

Origin | Interest | Match

arXiv.orgDecision Theory For Large Scale Outlier Detection Using Aleatoric Uncertainty: With a Note on Bayesian FDRAleatoric and Epistemic uncertainty have achieved recent attention in the literature as different sources from which uncertainty can emerge in stochastic modeling. Epistemic being intrinsic or model based notions of uncertainty, and aleatoric being the uncertainty inherent in the data. We propose a novel decision theoretic framework for outlier detection in the context of aleatoric uncertainty; in the context of Bayesian modeling. The model incorporates bayesian false discovery rate control for multiplicty adjustment, and a new generalization of Bayesian FDR is introduced. The model is applied to simulations based on temporally fluctuating outlier detection where fixing thresholds often results in poor performance due to nonstationarity, and a case study is outlined on on a novel cybersecurity detection. Cyberthreat signals are highly nonstationary; giving a credible stress test of the model.

#statme #mathoc #mathst

FunctionalProgramming @FunctionalProgramming@activitypub.awakari.com

#stat.ME #econ.EM

Origin | Interest | Match

arXiv.orgPartial identification via conditional linear programs: estimation and policy learningMany important quantities of interest are only partially identified from observable data: the data can limit them to a set of plausible values, but not uniquely determine them. This paper develops a unified framework for covariate-assisted estimation, inference, and decision making in partial identification problems where the parameter of interest satisfies a series of linear constraints, conditional on covariates. In such settings, bounds on the parameter can be written as expectations of solutions to conditional linear programs that optimize a linear function subject to linear constraints, where both the objective function and the constraints may depend on covariates and need to be estimated from data. Examples include estimands involving the joint distributions of potential outcomes, policy learning with inequality-aware value functions, and instrumental variable settings. We propose two de-biased estimators for bounds defined by conditional linear programs. The first directly solves the conditional linear programs with plugin estimates and uses output from standard LP solvers to de-bias the plugin estimate, avoiding the need for computationally demanding vertex enumeration of all possible solutions for symbolic bounds. The second uses entropic regularization to create smooth approximations to the conditional linear programs, trading a small amount of approximation error for improved estimation and computational efficiency. We establish conditions for asymptotic normality of both estimators, show that both estimators are robust to first-order errors in estimating the conditional constraints and objectives, and construct Wald-type confidence intervals for the partially identified parameters. These results also extend to policy learning problems where the value of a decision policy is only partially identified. We apply our methods to a study on the effects of Medicaid enrollment.

#statme #econem

2rZiKKbOU3nTafniR2qMMSE0gwZ @2rZiKKbOU3nTafniR2qMMSE0gwZ@activitypub.awakari.com

#stat.ME #math.OC #math.ST #stat.CO #stat.TH

Origin | Interest | Match

arXiv.orgDecision Theory For Large Scale Outlier Detection Using Aleatoric Uncertainty: With a Note on Bayesian FDRAleatoric and Epistemic uncertainty have achieved recent attention in the literature as different sources from which uncertainty can emerge in stochastic modeling. Epistemic being intrinsic or model based notions of uncertainty, and aleatoric being the uncertainty inherent in the data. We propose a novel decision theoretic framework for outlier detection in the context of aleatoric uncertainty; in the context of Bayesian modeling. The model incorporates bayesian false discovery rate control for multiplicty adjustment, and a new generalization of Bayesian FDR is introduced. The model is applied to simulations based on temporally fluctuating outlier detection where fixing thresholds often results in poor performance due to nonstationarity, and a case study is outlined on on a novel cybersecurity detection. Cyberthreat signals are highly nonstationary; giving a credible stress test of the model.

#statme #mathoc #mathst

2rZiKKbOU3nTafniR2qMMSE0gwZ @2rZiKKbOU3nTafniR2qMMSE0gwZ@activitypub.awakari.com

#stat.ME #math.OC #math.ST #stat.CO #stat.TH

Origin | Interest | Match

arXiv.orgDecision Theory For Large Scale Outlier Detection Using Aleatoric Uncertainty: With a Note on Bayesian FDRAleatoric and Epistemic uncertainty have achieved recent attention in the literature as different sources from which uncertainty can emerge in stochastic modeling. Epistemic being intrinsic or model based notions of uncertainty, and aleatoric being the uncertainty inherent in the data. We propose a novel decision theoretic framework for outlier detection in the context of aleatoric uncertainty; in the context of Bayesian modeling. The model incorporates bayesian false discovery rate control for multiplicty adjustment, and a new generalization of Bayesian FDR is introduced. The model is applied to simulations based on temporally fluctuating outlier detection where fixing thresholds often results in poor performance due to nonstationarity, and a case study is outlined on on a novel cybersecurity detection. Cyberthreat signals are highly nonstationary; giving a credible stress test of the model.

#statme #mathoc #mathst

LLMs @LLMs@activitypub.awakari.com

#cs.LG #cs.AI #cs.MA #stat.AP #stat.ME

Origin | Interest | Match

arXiv.orgLLM-based Agents for Automated Confounder Discovery and Subgroup Analysis in Causal InferenceEstimating individualized treatment effects from observational data presents a persistent challenge due to unmeasured confounding and structural bias. Causal Machine Learning (causal ML) methods, such as causal trees and doubly robust estimators, provide tools for estimating conditional average treatment effects. These methods have limited effectiveness in complex real-world environments due to the presence of latent confounders or those described in unstructured formats. Moreover, reliance on domain experts for confounder identification and rule interpretation introduces high annotation cost and scalability concerns. In this work, we proposed Large Language Model-based agents for automated confounder discovery and subgroup analysis that integrate agents into the causal ML pipeline to simulate domain expertise. Our framework systematically performs subgroup identification and confounding structure discovery by leveraging the reasoning capabilities of LLM-based agents, which reduces human dependency while preserving interpretability. Experiments on real-world medical datasets show that our proposed approach enhances treatment effect estimation robustness by narrowing confidence intervals and uncovering unrecognized confounding biases. Our findings suggest that LLM-based agents offer a promising path toward scalable, trustworthy, and semantically aware causal inference.

#cslg #csai #csma

2rZiKKbOU3nTafniR2qMMSE0gwZ @2rZiKKbOU3nTafniR2qMMSE0gwZ@activitypub.awakari.com

#stat.ME #math.ST #stat.TH

Origin | Interest | Match

arXiv.orgA test statistic, $h^*$, for outlier analysisOutlier analysis is a critical tool across diverse domains, from clinical decision-making to cybersecurity and talent identification. Traditional statistical outlier detection methods, such as Grubb's test and Dixon's Q, are predicated on the assumption of normality and often fail to reckon the meaningfulness of exceptional values within non-normal datasets. In this paper, we introduce the h* statistic, a novel parametric, frequentist approach for evaluating global outliers without the normality assumption. Unlike conventional techniques that primarily remove outliers to preserve statistical `integrity,' h* assesses the distinctiveness as phenomena worthy of investigation by quantifying a data point's extremity relative to its group as a measure of statistical significance analogous to the role of Student's t in comparing means. We detail the mathematical formulation of h* with tabulated confidence intervals of significance levels and extensions to Bayesian inference and paired analysis. The capacity of h* to discern between stable extraordinary deviations and values that merely appear extreme under conventional criteria is demonstrated using empirical data from a mood intervention study. A generalisation of h* is subsequently proposed, with individual weights assigned to differences for nuanced contextual description, and a variable sensitivity exponent for objective inference optimisation and subjective inference specification. The physical significance of an h*-recognised outlier is linked to the signature of unique occurrences. Our findings suggest that h* offers a robust alternative for outlier evaluation, enriching the analytical repertoire for researchers and practitioners by foregrounding the interpretative value of outliers within complex, real-world datasets. This paper is also a statement against the dominance of normality in celebration of the luminary and the lunatic alike.

#statme #mathst #statth

2rZiKKbOU3nTafniR2qMMSE0gwZ @2rZiKKbOU3nTafniR2qMMSE0gwZ@activitypub.awakari.com

#stat.ME #math.ST #stat.TH

Origin | Interest | Match

arXiv.orgA test statistic, $h^*$, for outlier analysisOutlier analysis is a critical tool across diverse domains, from clinical decision-making to cybersecurity and talent identification. Traditional statistical outlier detection methods, such as Grubb's test and Dixon's Q, are predicated on the assumption of normality and often fail to reckon the meaningfulness of exceptional values within non-normal datasets. In this paper, we introduce the h* statistic, a novel parametric, frequentist approach for evaluating global outliers without the normality assumption. Unlike conventional techniques that primarily remove outliers to preserve statistical `integrity,' h* assesses the distinctiveness as phenomena worthy of investigation by quantifying a data point's extremity relative to its group as a measure of statistical significance analogous to the role of Student's t in comparing means. We detail the mathematical formulation of h* with tabulated confidence intervals of significance levels and extensions to Bayesian inference and paired analysis. The capacity of h* to discern between stable extraordinary deviations and values that merely appear extreme under conventional criteria is demonstrated using empirical data from a mood intervention study. A generalisation of h* is subsequently proposed, with individual weights assigned to differences for nuanced contextual description, and a variable sensitivity exponent for objective inference optimisation and subjective inference specification. The physical significance of an h*-recognised outlier is linked to the signature of unique occurrences. Our findings suggest that h* offers a robust alternative for outlier evaluation, enriching the analytical repertoire for researchers and practitioners by foregrounding the interpretative value of outliers within complex, real-world datasets. This paper is also a statement against the dominance of normality in celebration of the luminary and the lunatic alike.

#statme #mathst #statth

2rZiKKbOU3nTafniR2qMMSE0gwZ @2rZiKKbOU3nTafniR2qMMSE0gwZ@activitypub.awakari.com

#stat.ME #math.OC #math.ST #stat.CO #stat.TH

Origin | Interest | Match

arXiv.orgDecision Theory For Large Scale Outlier Detection Using Aleatoric Uncertainty: With a Note on Bayesian FDRAleatoric and Epistemic uncertainty have achieved recent attention in the literature as different sources from which uncertainty can emerge in stochastic modeling. Epistemic being intrinsic or model based notions of uncertainty, and aleatoric being the uncertainty inherent in the data. We propose a novel decision theoretic framework for outlier detection in the context of aleatoric uncertainty; in the context of Bayesian modeling. The model incorporates bayesian false discovery rate control for multiplicty adjustment, and a new generalization of Bayesian FDR is introduced. The model is applied to simulations based on temporally fluctuating outlier detection where fixing thresholds often results in poor performance due to nonstationarity, and a case study is outlined on on a novel cybersecurity detection. Cyberthreat signals are highly nonstationary; giving a credible stress test of the model.

#statme #mathoc #mathst

2rZiKKbOU3nTafniR2qMMSE0gwZ @2rZiKKbOU3nTafniR2qMMSE0gwZ@activitypub.awakari.com

#stat.ME #math.OC #math.ST #stat.CO #stat.TH

Origin | Interest | Match

arXiv.orgDecision Theory For Large Scale Outlier Detection Using Aleatoric Uncertainty: With a Note on Bayesian FDRAleatoric and Epistemic uncertainty have achieved recent attention in the literature as different sources from which uncertainty can emerge in stochastic modeling. Epistemic being intrinsic or model based notions of uncertainty, and aleatoric being the uncertainty inherent in the data. We propose a novel decision theoretic framework for outlier detection in the context of aleatoric uncertainty; in the context of Bayesian modeling. The model incorporates bayesian false discovery rate control for multiplicty adjustment, and a new generalization of Bayesian FDR is introduced. The model is applied to simulations based on temporally fluctuating outlier detection where fixing thresholds often results in poor performance due to nonstationarity, and a case study is outlined on on a novel cybersecurity detection. Cyberthreat signals are highly nonstationary; giving a credible stress test of the model.

#statme #mathoc #mathst

LLMs @LLMs@activitypub.awakari.com

#cs.LG #cs.AI #cs.CL #stat.ME

Origin | Interest | Match

arXiv.orgHow Can I Publish My LLM Benchmark Without Giving the True Answers Away?Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

#cslg #csai #cscl

LLMs @LLMs@activitypub.awakari.com

#cs.CL #cs.AI #math.ST #stat.ME #stat.TH

Origin | Interest | Match

arXiv.orgCausal Sufficiency and Necessity Improves Chain-of-Thought ReasoningChain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.

#cscl #csai #mathst

DarkMatter @DarkMatter@activitypub.awakari.com

#hep-ph #hep-ex #physics.data-an #stat.AP #stat.ME

Origin | Interest | Match

arXiv.orgOn Focusing Statistical Power for Searches and Measurements in Particle PhysicsParticle physics experiments rely on the (generalised) likelihood ratio test (LRT) for searches and measurements, which consist of composite hypothesis tests. However, this test is not guaranteed to be optimal, as the Neyman-Pearson lemma pertains only to simple hypothesis tests. Any choice of test statistic thus implicitly determines how statistical power varies across the parameter space. An improvement in the core statistical testing methodology for general settings with composite tests would have widespread ramifications across experiments. We discuss an alternate test statistic that provides the data analyzer an ability to focus the power of the test on physics-motivated regions of the parameter space. We demonstrate the improvement from this technique compared to the LRT on a Higgs $\rightarrowττ$ dataset simulated by the ATLAS experiment and a dark matter dataset inspired by the LZ experiment. We also employ machine learning to efficiently perform the Neyman construction, which is essential to ensure statistically valid confidence intervals.

#hepph #hepex #physicsdataan

LLMs @LLMs@activitypub.awakari.com

#cs.LG #cs.AI #cs.SI #stat.ME

Origin | Interest | Match

arXiv.orgLLM Web Dynamics: Tracing Model Collapse in a Network of LLMsThe increasing use of synthetic data from the public Internet has enhanced data usage efficiency in large language model (LLM) training. However, the potential threat of model collapse remains insufficiently explored. Existing studies primarily examine model collapse in a single model setting or rely solely on statistical surrogates. In this work, we introduce LLM Web Dynamics (LWD), an efficient framework for investigating model collapse at the network level. By simulating the Internet with a retrieval-augmented generation (RAG) database, we analyze the convergence pattern of model outputs. Furthermore, we provide theoretical guarantees for this convergence by drawing an analogy to interacting Gaussian Mixture Models.

#cslg #csai #cssi

FunctionalProgramming @FunctionalProgramming@activitypub.awakari.com

#stat.ME #stat.AP

Origin | Interest | Match

arXiv.orgDensity Prediction of Income Distribution Based on Mixed Frequency DataModeling large dependent datasets in modern time series analysis is a crucial research area. One effective approach to handle such datasets is to transform the observations into density functions and apply statistical methods for further analysis. Income distribution forecasting, a common application scenario, benefits from predicting density functions as it accounts for uncertainty around point estimates, leading to more informed policy formulation. However, predictive modeling becomes challenging when dealing with mixed-frequency data. To address this challenge, this paper introduces a mixed data sampling regression model for probability density functions (PDF-MIDAS). To mitigate variance inflation caused by high-frequency prediction variables, we utilize exponential Almon polynomials with fewer parameters to regularize the coefficient structure. Additionally, we propose an iterative estimation method based on quadratic programming and the BFGS algorithm. Simulation analyses demonstrate that as the sample size for estimating density functions and observation length increase, the estimator approaches the true value. Real data analysis reveals that compared to single-sequence prediction models, PDF-MIDAS incorporating high-frequency exogenous variables offers a wider range of application scenarios with superior fitting and prediction performance.

#statme #statap

FunctionalProgramming @FunctionalProgramming@activitypub.awakari.com

#stat.ME

Origin | Interest | Match

arXiv.orgStatistical inference of heterogeneous treatment effects using semiparametric single-index modelIn recent years, with the rapid development of science and technology, heterogeneous treatment effects have emerged as a focal research topic in statistics, econometrics, and sociology. This paper investigates HTE through semiparametric single-index models based on doubly robust estimation. Departing from conventional approaches, we neither impose boundedness constraints on the link function in single-index models nor restrict its support range. By employing the sieve method to approximate the link function, we achieve simultaneous estimation of both the link function and index parameters. Our study not only establishes the asymptotic properties of the proposed estimator but also systematically evaluates its finite-sample performance through comprehensive simulation studies. Numerical results demonstrate that our method significantly outperforms other commonly used competing estimators. Furthermore, we apply the proposed approach to the National Health and Nutrition Examination Survey dataset to assess the impact of participation in school lunch programs on body mass index.

LLMs @LLMs@activitypub.awakari.com

#cs.CL #stat.ME

Origin | Interest | Match

arXiv.orgTransforming Sensitive Documents into Quantitative Data: An AI-Based Preprocessing Toolchain for Structured and Privacy-Conscious AnalysisUnstructured text from legal, medical, and administrative sources offers a rich but underutilized resource for research in public health and the social sciences. However, large-scale analysis is hampered by two key challenges: the presence of sensitive, personally identifiable information, and significant heterogeneity in structure and language. We present a modular toolchain that prepares such text data for embedding-based analysis, relying entirely on open-weight models that run on local hardware, requiring only a workstation-level GPU and supporting privacy-sensitive research. The toolchain employs large language model (LLM) prompting to standardize, summarize, and, when needed, translate texts to English for greater comparability. Anonymization is achieved via LLM-based redaction, supplemented with named entity recognition and rule-based methods to minimize the risk of disclosure. We demonstrate the toolchain on a corpus of 10,842 Swedish court decisions under the Care of Abusers Act (LVM), comprising over 56,000 pages. Each document is processed into an anonymized, standardized summary and transformed into a document-level embedding. Validation, including manual review, automated scanning, and predictive evaluation shows the toolchain effectively removes identifying information while retaining semantic content. As an illustrative application, we train a predictive model using embedding vectors derived from a small set of manually labeled summaries, demonstrating the toolchain's capacity for semi-automated content analysis at scale. By enabling structured, privacy-conscious analysis of sensitive documents, our toolchain opens new possibilities for large-scale research in domains where textual data was previously inaccessible due to privacy and heterogeneity constraints.

LLMs @LLMs@activitypub.awakari.com

#cs.CL #stat.ME

Origin | Interest | Match

arXiv.orgTransforming Sensitive Documents into Quantitative Data: An AI-Based Preprocessing Toolchain for Structured and Privacy-Conscious AnalysisUnstructured text from legal, medical, and administrative sources offers a rich but underutilized resource for research in public health and the social sciences. However, large-scale analysis is hampered by two key challenges: the presence of sensitive, personally identifiable information, and significant heterogeneity in structure and language. We present a modular toolchain that prepares such text data for embedding-based analysis, relying entirely on open-weight models that run on local hardware, requiring only a workstation-level GPU and supporting privacy-sensitive research. The toolchain employs large language model (LLM) prompting to standardize, summarize, and, when needed, translate texts to English for greater comparability. Anonymization is achieved via LLM-based redaction, supplemented with named entity recognition and rule-based methods to minimize the risk of disclosure. We demonstrate the toolchain on a corpus of 10,842 Swedish court decisions under the Care of Abusers Act (LVM), comprising over 56,000 pages. Each document is processed into an anonymized, standardized summary and transformed into a document-level embedding. Validation, including manual review, automated scanning, and predictive evaluation shows the toolchain effectively removes identifying information while retaining semantic content. As an illustrative application, we train a predictive model using embedding vectors derived from a small set of manually labeled summaries, demonstrating the toolchain's capacity for semi-automated content analysis at scale. By enabling structured, privacy-conscious analysis of sensitive documents, our toolchain opens new possibilities for large-scale research in domains where textual data was previously inaccessible due to privacy and heterogeneity constraints.

FunctionalProgramming @FunctionalProgramming@activitypub.awakari.com

#stat.ME

Origin | Interest | Match

arXiv.orgThe Multiplicative Instrumental Variable ModelThe instrumental variable (IV) design is a common approach to address hidden confounding bias. For validity, an IV must impact the outcome only through its association with the treatment. In addition, IV identification has required a homogeneity condition such as monotonicity or no unmeasured common effect modifier between the additive effect of the treatment on the outcome, and that of the IV on the treatment. In this work, we introduce a novel identifying condition of no multiplicative interaction between the instrument and the unmeasured confounder in the treatment model, which we establish nonparametrically identifies the average treatment effect on the treated (ATT). For inference, we propose an estimator that is multiply robust and semiparametric efficient, while allowing for the use of machine learning to adaptively estimate required nuisance functions via cross-fitting. Finally, we illustrate the methods in extended simulations and an application on the causal impact of a job training program on subsequent earnings.

Drag & drop to upload