Edit or run this notebook
2.5 μs
Package initialization
13.9 μs
1.3 s

Calibration analysis of probabilistic models in Julia

David Widmann (@devmotion )

Uppsala University, Sweden

JuliaCon, July 2021

50.1 ms

Probabilistic predictive models

A probabilistic predictive model predicts a probability distribution over a set of targets for a given feature.

One can express the uncertainty in the prediction, which might be inherent to the prediction task or caused by insufficient knowledge of the underlying relation between feature and target.

21.2 μs

Example: Prediction of penguin species

Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081

98.6 μs
penguins
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
StringStringFloat64Float64Int64Int64String
1
"Adelie"
"Torgersen"
39.1
18.7
181
3750
"male"
2
"Adelie"
"Torgersen"
39.5
17.4
186
3800
"female"
3
"Adelie"
"Torgersen"
40.3
18.0
195
3250
"female"
4
"Adelie"
"Torgersen"
36.7
19.3
193
3450
"female"
5
"Adelie"
"Torgersen"
39.3
20.6
190
3650
"male"
6
"Adelie"
"Torgersen"
38.9
17.8
181
3625
"female"
7
"Adelie"
"Torgersen"
39.2
19.6
195
4675
"male"
8
"Adelie"
"Torgersen"
41.1
17.6
182
3200
"female"
9
"Adelie"
"Torgersen"
38.6
21.2
191
3800
"male"
10
"Adelie"
"Torgersen"
34.6
21.1
198
4400
"male"
more
333
"Chinstrap"
"Dream"
50.2
18.7
198
3775
"female"
14.7 s
27.6 ms
62.1 s

We split the Palmer penguin dataset randomly into a training (70%) and validation (30%) dataset.

11.1 μs
347 ms

We learn a Gaussian naive Bayes classifier that is able to predict the probability of the penguin species from the bill length and the flipper length.

We denote the features by X and the target by Y.

35.9 μs
466 ms

The predictions are distributions of the penguin species.

We let PX denote the distribution predicted by a model P for features X.

52.0 μs
UnivariateFinite{ScientificTypesBase.Multiclass{3}}(Adelie=>0.216, Chinstrap=>0.784, Gentoo=>9.44e-39)
268 μs
410 ms

There exist many different probabilistic predictive models for this task.

Ideally, we would like that

PX=law(YX)

almost surely.

Of course, usually it is not possible to achieve this in practice.

9.1 μs
5.8 s
172 μs
1.1 ms

Calibration

Motivation

Predictions should express involved uncertainties "correctly" and not be arbitrary probability distributions.

In particular, predictions should be consistent: if forecasts predict an 80% probability of rain for an infinite sequence of days, then ideally on 80% of the days it rains.

A probabilistic predictive model PX of the conditional distributions law(Y|X) is calibrated if

PX=law(Y|PX)

almost surely.

It is not relevant how the model was obtained. In particular it does not matter if one uses a maximum likelihood approach or performs Bayesian inference.

16.6 μs
12.6 s

Often weaker conditions are investigated, corresponding to less informative models.

E.g., "confidence calibration" only considers the most-confident class: This corresponds to a binary classification model

15.1 μs
3.0 ms
143 ms

Other target spaces can be considered as well.

19.7 μs
3.8 s
1.0 ms

Calibrated models

Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper scores. Q.J.R. Meteorol. Soc., 135: 1512-1519.

Any model of the form

PX:=law(Y|ϕ(X))almost surely,

where ϕ is some measurable function, is calibrated.

Function ϕ corresponds to the amount/loss of information about X:

  • Identity function yields the ideal model PX:=law(Y|X)

  • Every constant function yields the baseline model PX:=law(Y) ("climatology")

35.0 μs

Reliability diagrams

Murphy, A., & Winkler, R. (1977). Reliability of Subjective Probability Forecasts of Precipitation and Temperature. Journal of the Royal Statistical Society. Series C (Applied Statistics), 26(1), 41-47.

Bröcker, J., & Smith, L. A. (2007). Increasing the reliability of reliability diagrams. Weather and forecasting, 22(3), 651-661.

Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. & Schön, T. B. (2019). Evaluating model calibration in classification. Proceedings of Machine Learning Research, in Proceedings of Machine Learning Research 89:3459-3467 (AISTATS 2019).

Reliability diagrams are used to visually inspect calibration of binary classification models. They show binned averages of empirical frequencies versus confidence.

28.3 μs

Visualizations with ReliabilityDiagrams.jl

  • Supports Plots.jl and Makie.jl

  • Allows to plot deviation, i.e., empirical frequency minus confidence.

  • Includes consistency bars

Bins: 10

Deviation:

Consistency bars:

139 ms
85.2 ms
2.1 ms

Expected calibration error

The expected calibration error (ECE)

ECEd:=EPXd(PX,law(Y|PX))

measures the expected deviation of the predictions PX and empirical frequencies law(Y|PX) with respect to distance measure d.

  • For (multi-class) classification models, common choices for d are (semi-)metrics on the probability simplex such as Euclidean or squared Euclidean distance

  • For general probabilistic models statistical divergences can be chosen for d, e.g.,

    • f-divergences such as the Kullback-Leibler divergence,

    • Wasserstein distance,

    • maximum mean discrepancy (MMD).

142 μs

⚠️ Challenges

For general models, the distribution law(Y|PX) can be arbitrarily complex.

The empirical frequencies law(Y|PX) are difficult to estimate.

Common histogram binning based approaches lead to biased and inconsistent estimators.

8.1 μs

Estimation with CalibrationErrors.jl

Supports

  • different distance measures d (default: total variation distance),

  • different binning algorithms (bins of uniform size and bins that minimize variance within bins)

22.1 μs
37.0 μs

Distance: Number of bins: 10

65.5 ms
0.08596703393630331
25.1 μs
548 ms
1.3 μs

Why scoring rules are not sufficient

Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper scores. Q.J.R. Meteorol. Soc., 135: 1512-1519.

Probabilistic predictive models can be evaluated by the expected score

EPX,Ys(PX,Y)

where scoring rule s(P,ω) is the reward of prediction P if the true outcome is ω.

Examples: Brier score, logarithmic score

56.1 μs
-0.07008093228992143
42.9 μs
-0.1530897228467764
33.1 μs

Proper scoring rules can be decomposed as

EPX,Ys(PX,Y)=EPXd(law(Y),law(Y|PX))resolutionEPXd(PX,law(Y|PX))ECES(law(Y),law(Y))uncertainty of Y

where S(P,Q):=Ωs(P,ω)Q(dω) is the expected score of P under Q and d(P,Q):=S(Q,Q)S(P,P) is the score divergence.

  • resolution: minimized for uninformative models with law(Y|PX)=law(Y) such as constant models

  • ECE: minimized for calibrated models

  • uncertainty of Y: does not dependend on the model

Info

Models can trade off calibration for resolution!

68.7 μs

Alternatives to ECE

Widmann, D., Lindsten, F., & Zachariah, D. (2021). Calibration tests beyond classification. ICLR 2021.

A probabilistic predictive model is calibrated if

(PX,Y)=d(PX,ZX),

where ZX|PXPX.

  • No explicit conditional distributions law(Y|PX)

  • Suggests discrepancy of law(PX,Y) and law(PX,ZX) as calibration measure

48.4 μs

Kernel calibration error (KCE)

Widmann, D., Lindsten, F., & Zachariah, D. (2019). Calibration tests in multi-class classification: A unifying framework. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (pp. 12257–12267).

Widmann, D., Lindsten, F., & Zachariah, D. (2021). Calibration tests beyond classification. ICLR 2021.

Integral probability metrics can be used to define a general class of probability metrics with minimal assumptions about the involved distributions. The kernel calibration error (KCE) is an example of this class:

The kernel calibration error (KCE) with respect to a real-valued kernel k is defined as

KCEk:=MMDk(law(PX,Y),law(PX,ZX)),

where MMDk is the maximum mean discrepancy with respect to k.

  • Applies to all probabilistic predictive models

  • Existing (un)biased and consistent estimators of the MMD without challenging estimation of law(Y|PX)

  • Variance of estimators can be reduced by marginalizing out ZX

5.6 ms

Estimation with CalibrationErrors.jl

Supports

  • biased and unbiased estimators

  • estimators with quadratic and subquadratic sample complexity

  • kernels from KernelFunctions.jl

17.3 μs

Here we choose a tensor product kernel of the form

k((p,y),(p~,y~)):=k1(p,p~)δ(yy~).

Kernel k1: Length scale: 1

Estimator:

40.9 ms
50.0 ns
-0.00026486100537947895
239 μs
1.2 s

Info

One possible approach for selecting the length scale is to maximize the KCE on a held-out dataset (cf. approach for MMD proposed by Fukumizu et al. in Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions (2009)).

6.2 μs
51.6 μs
11.7 μs

Estimation with CalibrationErrorsDistributions.jl

Closed-form expressions for predictions of Gaussian and Laplace distributions in Distributions.jl

11.0 μs

Example: Gaussian process

  1. We sample from a Gaussian process (GP) with zero mean and a squared exponential kernel at 40 random locations in the interval [0,10].

9.0 μs
  1. We split the data randomly in a training (75%) and validation (25%) dataset, and compute the GP posterior from the training data.

5.6 μs
13.0 s
  1. We compute the predicted normal distributions on the validation data.

9.2 μs
gp_predictions
35.4 μs
  1. We estimate the squared KCE with a tensor product kernel of the form

    k((μ,y),(μ~,y~)):=exp(W2(μ,μ~))exp((yy~)222),

    where W2(μ,μ~) is the 2-Wasserstein distance of the Gaussian distributions μ and μ~.

34.7 μs

Length scale : 1

Estimator:

97.2 μs
51.0 ns
6.988050107219258e-6
3.6 μs
604 ms
6.9 ms
24.7 ms
27.4 μs
4.1 μs

Calibration tests

Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. & Schön, T. B. (2019). Evaluating model calibration in classification. Proceedings of Machine Learning Research, in Proceedings of Machine Learning Research 89:3459-3467 (AISTATS 2019).

Widmann, D., Lindsten, F., & Zachariah, D. (2019). Calibration tests in multi-class classification: A unifying framework. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (pp. 12257–12267).

Widmann, D., Lindsten, F., & Zachariah, D. (2021). Calibration tests beyond classification. ICLR 2021.

⚠️ Problem

It is difficult to interpret an estimated non-zero calibration error.

  • Calibration errors have no meaningful unit or scale.

  • Different calibration errors rank models differently.

  • Estimators of calibration errors are random variables.

78.7 μs

Perform a statistical test of the null hypothesis H0:="model is calibrated".

7.4 μs
5.1 s
  • Hypothesis testing of calibration is a special two-sample problem

  • Applies to all probabilistic predictive models

  • Existing two-sample tests based on the MMD can be improved by marginalizing out ZX

15.4 μs

Calibration tests with CalibrationTests.jl

Supports

  • tests based on consistency resampling

  • tests based on distribution-free bounds and asymptotic properties of KCE

  • tests with quadratic and subquadratic sample complexity

  • interface of HypothesisTests.jl

15.5 μs

Test:

67.2 μs
Asymptotic SKCE test
--------------------
Population details:
    parameter of interest:   SKCE
    value under h_0:         0.0
    point estimate:          -0.000264861

Test summary:
    outcome with 95% confidence: fail to reject h_0
    one-sided p-value:           0.6640

Details:
    test statistic: -0.0007061332966437105
50.0 ns
0.64
11.4 ms
46.6 μs
229 μs

pycalibration

  • Python interface for CalibrationErrors, CalibrationErrorsDistributions, and CalibrationTests

  • Inspired by diffeqpy

  • Uses PyJulia interface

31.4 μs

Usage

  • Load package and install Julia dependencies:

    >>> import pycalibration
    >>> pycalibration.install()
  • Define estimator of the SKCE with kernel

    k((μ,y),(μ^,y^))=exp(μμ^)δ(yy^):

    >>> from pycalibration import calerrors as ce
    >>> skce = ce.UnbiasedSKCE(ce.tensor(ce.ExponentialKernel(), ce.WhiteKernel()))
  • Estimate the SKCE for some random predictions and outcomes:

    >>> import numpy as np
    >>> from pycalibration import calerrors as ce
    >>> rng = np.random.default_rng(1234)
    >>> predictions = rng.random(100)
    >>> outcomes = rng.choice([True, False], 100)
    >>> skce(predictions, outcomes)
    0.03320398246523166
  • Perform a calibration test with some random predictions and outcomes:

    >>> from pycalibration import caltests as ct
    >>> import numpy as np
    >>> rng = np.random.default_rng(1234)
    >>> predictions = rng.dirichlet((3, 2, 5), 100)
    >>> outcomes = rng.integers(low=1, high=4, size=100)
    >>> kernel = ct.tensor(ct.ExponentialKernel(metric=ct.TotalVariation()), ct.WhiteKernel())
    >>> test = ct.AsymptoticSKCETest(kernel, predictions, outcomes)
    >>> print(test)
    <PyCall.jlwrap Asymptotic SKCE test
    --------------------
    Population details:
        parameter of interest:   SKCE
        value under h_0:         0.0
        point estimate:          6.07887e-5
    
    Test summary:
        outcome with 95% confidence: fail to reject h_0
        one-sided p-value:           0.4330
    
    Details:
        test statistic: -4.955380469272125
    >>> ct.pvalue(test)
    0.435

More examples can be found in the documentation.

16.6 μs

rcalibration

  • R interface for CalibrationErrors, CalibrationErrorsDistributions, and CalibrationTests

  • Inspired by diffeqr

  • Based on JuliaCall

12.9 μs

Usage

  • Load package and install Julia dependencies:

    > library(rcalibration)
    > rcalibration::install()
  • Define estimator of the SKCE with kernel

    k((μ,y),(μ^,y^))=exp(μμ^)δ(yy^):

    > ce <- calerrors()
    > skce <- ce$UnbiasedSKCE(ce$tensor(ce$ExponentialKernel(), ce$WhiteKernel()))
  • Estimate the SKCE for some random predictions and outcomes:

    > ce <- calerrors()
    > set.seed(1234)
    > predictions <- runif(100)
    > outcomes <- sample(c(TRUE, FALSE), 100, replace=TRUE)
    > skce$.(predictions, outcomes)
    [1] 0.01518318
  • Perform a calibration test with some random predictions and outcomes:

    > library(extraDistr)
    > ct <- caltests()
    > set.seed(1234)
    > predictions <- rdirichlet(100, c(3, 2, 5))
    > outcomes <- sample(1:3, 100, replace=TRUE)
    > kernel <- ct$tensor(ct$ExponentialKernel(metric=ct$TotalVariation()), ct$WhiteKernel())
    > test <- ct$AsymptoticSKCETest(kernel, ce$RowVecs(predictions), outcomes)
    > print(test)
    Julia Object of type AsymptoticSKCETest{KernelTensorProduct{Tuple{ExponentialKernel{TotalVariation}, WhiteKernel}}, Float64, Float64, Matrix{Float64}}.
    Asymptotic SKCE test
    --------------------
    Population details:
        parameter of interest:   SKCE
        value under h_0:         0.0
        point estimate:          0.0259434
    
    Test summary:
        outcome with 95% confidence: reject h_0
        one-sided p-value:           0.0100
    
    Details:
        test statistic: -0.007291403994633658
    > ct$pvalue(test)
    [1] 0.004

More examples can be found in the documentation.

33.0 μs

Take home messages

  • Calibration is an important aspect of probabilistic predictive models

  • Reliability diagrams with consistency bars help to visually inspect calibration

  • There exist alternatives to the ECE such as the KCE with favourable theoretical properties

  • Calibration tests can be used to deal with the randomness of calibration error estimates

  • Python and R interfaces if you do not use Julia

26.4 μs

See you at JuliaCon!

1.5 μs
Loading...
ii
Loading...