Optimal testing for calibration of predictive models

Speaker: 
Edgar Dobriban, University of Pennsylvania
Event time: 
Wednesday, March 2, 2022 - 12:00pm
Location: 
https://yale.zoom.us/j/97458245891 See map
Event description: 

Abstract:  The prediction accuracy of machine learning methods is steadily increasing, but the calibration of their uncertainty predictions poses a significant challenge.  
Numerous works focus on obtaining well-calibrated predictive models, but less is known about reliably assessing model calibration. This limits our ability to know when algorithms for improving calibration have a real effect, and when their improvements are merely artifacts due to random noise in finite datasets. In this work, we consider the problem of detecting mis-calibration of predictive models using a finite validation dataset. Due to the randomness in the data, plug-in measures of calibration need to be compared against a proper background distribution to reliably assess calibration. Thus, detecting mis-calibration in a classification setting can be formulated as a statistical hypothesis testing problem. The null hypothesis is that the model is perfectly calibrated, while the alternative hypothesis is that the deviation from calibration is sufficiently large.  We find that detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions.  When the conditional class probabilities are H"older continuous, we propose a minimax optimal test for calibration based on a debiased plug-in estimator of the $\ell_2$-Expected Calibration Error (ECE).  We further propose a version that is adaptive to unknown smoothness.  We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. Our algorithm is a general-purpose tool, which—combined with classical tests for calibration of discrete-valued predictors—can be used to test the calibration of virtually any classification method.

Event Type: 
Applied Mathematics