xxxxxxxxxx

2.5 μs

Package initialization

xxxxxxxxxx

13.9 μs

xxxxxxxxxx
 
begin
    using AbstractGPs
    using AbstractGPsMakie
    using AlgebraOfGraphics
    using CairoMakie
    using CalibrationErrors
    using CalibrationErrorsDistributions
    using CalibrationTests
    using DataFrames
    using MLJ
    using MLJNaiveBayesInterface
    using Luxor
    using PalmerPenguins
    using PlutoUI
    using ReliabilityDiagrams
    using StatsBase
​
    using Random
​
    # plotting settings
    CairoMakie.activate!(; type="svg")
    set_theme!(; AlgebraOfGraphics.aog_theme()..., resolution=(800, 400))
end

1.3 s

Calibration analysis of probabilistic models in Julia

David Widmann (@devmotion )

Uppsala University, Sweden

JuliaCon, July 2021

xxxxxxxxxx

50.1 ms

Probabilistic predictive models

A probabilistic predictive model predicts a probability distribution over a set of targets for a given feature.

One can express the uncertainty in the prediction, which might be inherent to the prediction task or caused by insufficient knowledge of the underlying relation between feature and target.

xxxxxxxxxx

21.2 μs

Example: Prediction of penguin species

Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081

xxxxxxxxxx

98.6 μs

penguins

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
	String	String	Float64	Float64	Int64	Int64	String
1	"Adelie"	"Torgersen"	39.1	18.7	181	3750	"male"
2	"Adelie"	"Torgersen"	39.5	17.4	186	3800	"female"
3	"Adelie"	"Torgersen"	40.3	18.0	195	3250	"female"
4	"Adelie"	"Torgersen"	36.7	19.3	193	3450	"female"
5	"Adelie"	"Torgersen"	39.3	20.6	190	3650	"male"
6	"Adelie"	"Torgersen"	38.9	17.8	181	3625	"female"
7	"Adelie"	"Torgersen"	39.2	19.6	195	4675	"male"
8	"Adelie"	"Torgersen"	41.1	17.6	182	3200	"female"
9	"Adelie"	"Torgersen"	38.6	21.2	191	3800	"male"
10	"Adelie"	"Torgersen"	34.6	21.1	198	4400	"male"
more
333	"Chinstrap"	"Dream"	50.2	18.7	198	3775	"female"

xxxxxxxxxx

14.7 s

xxxxxxxxxx

27.6 ms

xxxxxxxxxx

62.1 s

We split the Palmer penguin dataset randomly into a training (70%) and validation (30%) dataset.

xxxxxxxxxx

11.1 μs

xxxxxxxxxx

347 ms

We learn a Gaussian naive Bayes classifier that is able to predict the probability of the penguin species from the bill length and the flipper length.

We denote the features by $X$ and the target by $Y$ .

xxxxxxxxxx

35.9 μs

xxxxxxxxxx

466 ms

The predictions are distributions of the penguin species.

We let $P_{X}$ denote the distribution predicted by a model $P$ for features $X$ .

xxxxxxxxxx

52.0 μs

UnivariateFinite{ScientificTypesBase.Multiclass{3}}(Adelie=>0.216, Chinstrap=>0.784, Gentoo=>9.44e-39)

xxxxxxxxxx
 
only(MLJ.predict(penguins_model, [43.0 140.0]))

268 μs

xxxxxxxxxx

410 ms

There exist many different probabilistic predictive models for this task.

Ideally, we would like that

$P_{X} = law (Y ∣ X)$

almost surely.

Of course, usually it is not possible to achieve this in practice.

xxxxxxxxxx

9.1 μs

xxxxxxxxxx

5.8 s

xxxxxxxxxx

172 μs

xxxxxxxxxx

1.1 ms

Calibration

Motivation

Predictions should express involved uncertainties "correctly" and not be arbitrary probability distributions.

In particular, predictions should be consistent: if forecasts predict an 80% probability of rain for an infinite sequence of days, then ideally on 80% of the days it rains.

A probabilistic predictive model $P_{X}$ of the conditional distributions $law (Y | X)$ is calibrated if

$P_{X} = law (Y | P_{X})$

almost surely.

It is not relevant how the model was obtained. In particular it does not matter if one uses a maximum likelihood approach or performs Bayesian inference.

xxxxxxxxxx

16.6 μs

xxxxxxxxxx

12.6 s

Often weaker conditions are investigated, corresponding to less informative models.

E.g., "confidence calibration" only considers the most-confident class: This corresponds to a binary classification model

xxxxxxxxxx

15.1 μs

xxxxxxxxxx

3.0 ms

xxxxxxxxxx

143 ms

Other target spaces can be considered as well.

xxxxxxxxxx

19.7 μs

xxxxxxxxxx

3.8 s

xxxxxxxxxx

1.0 ms

Calibrated models

Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper scores. Q.J.R. Meteorol. Soc., 135: 1512-1519.

Any model of the form

$P_{X} := law (Y | ϕ (X)) almost surely,$

where $ϕ$ is some measurable function, is calibrated.

Function $ϕ$ corresponds to the amount/loss of information about $X$ :

Identity function yields the ideal model $P_{X} := law (Y | X)$
Every constant function yields the baseline model $P_{X} := law (Y)$ ("climatology")

xxxxxxxxxx

35.0 μs

Reliability diagrams

Murphy, A., & Winkler, R. (1977). Reliability of Subjective Probability Forecasts of Precipitation and Temperature. Journal of the Royal Statistical Society. Series C (Applied Statistics), 26(1), 41-47.

Bröcker, J., & Smith, L. A. (2007). Increasing the reliability of reliability diagrams. Weather and forecasting, 22(3), 651-661.

Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. & Schön, T. B. (2019). Evaluating model calibration in classification. Proceedings of Machine Learning Research, in Proceedings of Machine Learning Research 89:3459-3467 (AISTATS 2019).

Reliability diagrams are used to visually inspect calibration of binary classification models. They show binned averages of empirical frequencies versus confidence.

xxxxxxxxxx

28.3 μs

Visualizations with ReliabilityDiagrams.jl

Supports Plots.jl and Makie.jl
Allows to plot deviation, i.e., empirical frequency minus confidence.
Includes consistency bars

Bins: 10

Deviation:

Consistency bars:

xxxxxxxxxx

139 ms

xxxxxxxxxx

85.2 ms

xxxxxxxxxx

2.1 ms

Expected calibration error

The expected calibration error (ECE)

${ECE}_{d} := E_{P_{X}} d (P_{X}, law (Y | P_{X}))$

measures the expected deviation of the predictions $P_{X}$ and empirical frequencies $law (Y | P_{X})$ with respect to distance measure $d$ .

For (multi-class) classification models, common choices for $d$ are (semi-)metrics on the probability simplex such as Euclidean or squared Euclidean distance
For general probabilistic models statistical divergences can be chosen for $d$ , e.g.,
- $f$ -divergences such as the Kullback-Leibler divergence,
- Wasserstein distance,
- maximum mean discrepancy (MMD).

xxxxxxxxxx

142 μs

⚠️ Challenges

For general models, the distribution $law (Y | P_{X})$ can be arbitrarily complex.

The empirical frequencies $law (Y | P_{X})$ are difficult to estimate.

Common histogram binning based approaches lead to biased and inconsistent estimators.

xxxxxxxxxx

8.1 μs

Estimation with CalibrationErrors.jl

Supports

different distance measures $d$ (default: total variation distance),
different binning algorithms (bins of uniform size and bins that minimize variance within bins)

xxxxxxxxxx

22.1 μs

xxxxxxxxxx
 
probs_val, yint_val = let
    probs = pdf(predictions_val, MLJ.classes(y_val))
    yint = map(MLJ.levelcode, y_val)
    probs, yint
end;

37.0 μs

Distance: Number of bins: 10

xxxxxxxxxx

65.5 ms

0.08596703393630331

xxxxxxxxxx
 
let
    ece = ECE(UniformBinning(ece_nbins), ece_distance)
    ece(RowVecs(probs_val), yint_val)
end

25.1 μs

xxxxxxxxxx

548 ms

xxxxxxxxxx

1.3 μs

Why scoring rules are not sufficient

Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper scores. Q.J.R. Meteorol. Soc., 135: 1512-1519.

Probabilistic predictive models can be evaluated by the expected score

$E_{P_{X}, Y} s (P_{X}, Y)$

where scoring rule $s (P, ω)$ is the reward of prediction $P$ if the true outcome is $ω$ .

Examples: Brier score, logarithmic score

xxxxxxxxxx

56.1 μs

-0.07008093228992143

xxxxxxxxxx
 
mean(brier_score(predictions_val, y_val))

42.9 μs

-0.1530897228467764

xxxxxxxxxx
 
mean(-log_loss(predictions_val, y_val))

33.1 μs

Proper scoring rules can be decomposed as

$\begin{aligned} E_{P_{X}, Y} s (P_{X}, Y) = & \underset{resolution}{\underset{⏟}{E_{P_{X}} d (law (Y), law (Y | P_{X}))}} \\ - \underset{ECE}{\underset{⏟}{E_{P_{X}} d (P_{X}, law (Y | P_{X}))}} - \underset{uncertainty of Y}{\underset{⏟}{S (law (Y), law (Y))}} \end{aligned}$

where $S (P, Q) := \int_{Ω} s (P, ω) Q (d ω)$ is the expected score of $P$ under $Q$ and $d (P, Q) := S (Q, Q) - S (P, P)$ is the score divergence.

resolution: minimized for uninformative models with $law (Y | P_{X}) = law (Y)$ such as constant models
ECE: minimized for calibrated models
uncertainty of $Y$ : does not dependend on the model

Info

Models can trade off calibration for resolution!

xxxxxxxxxx

68.7 μs

Alternatives to ECE

Widmann, D., Lindsten, F., & Zachariah, D. (2021). Calibration tests beyond classification. ICLR 2021.

A probabilistic predictive model is calibrated if

$(P_{X}, Y) \overset{d}{=} (P_{X}, Z_{X}),$

where $Z_{X} | P_{X} \sim P_{X}$ .

No explicit conditional distributions $law (Y | P_{X})$
Suggests discrepancy of $law (P_{X}, Y)$ and $law (P_{X}, Z_{X})$ as calibration measure

xxxxxxxxxx

48.4 μs

Kernel calibration error (KCE)

Widmann, D., Lindsten, F., & Zachariah, D. (2019). Calibration tests in multi-class classification: A unifying framework. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (pp. 12257–12267).

Widmann, D., Lindsten, F., & Zachariah, D. (2021). Calibration tests beyond classification. ICLR 2021.

Integral probability metrics can be used to define a general class of probability metrics with minimal assumptions about the involved distributions. The kernel calibration error (KCE) is an example of this class:

The kernel calibration error (KCE) with respect to a real-valued kernel $k$ is defined as

${KCE}_{k} := {MMD}_{k} (law (P_{X}, Y), law (P_{X}, Z_{X})),$

where ${MMD}_{k}$ is the maximum mean discrepancy with respect to $k$ .

Applies to all probabilistic predictive models
Existing (un)biased and consistent estimators of the MMD without challenging estimation of $law (Y | P_{X})$
Variance of estimators can be reduced by marginalizing out $Z_{X}$

xxxxxxxxxx

5.6 ms

Estimation with CalibrationErrors.jl

Supports

biased and unbiased estimators
estimators with quadratic and subquadratic sample complexity
kernels from KernelFunctions.jl

xxxxxxxxxx

17.3 μs

Here we choose a tensor product kernel of the form

$k ((p, y), (\tilde{p}, \tilde{y})) := k_{1} (p, \tilde{p}) δ (y - \tilde{y}) .$

Kernel $k_{1}$ : Length scale: 1

Estimator:

xxxxxxxxxx

40.9 ms

UnbiasedSKCE{KernelTensorProduct{Tuple{TransformedKernel{ExponentialKernel{Euclidean}, ScaleTransform{Float64}}, WhiteKernel}}}UnbiasedSKCE

kernel

Tensor product of 2 kernels:
	Exponential Kernel (metric = Euclidean(0.0))
			- Scale Transform (s = 1.0)
	White Kernel

xxxxxxxxxx

50.0 ns

-0.00026486100537947895

xxxxxxxxxx
 
skce(RowVecs(probs_val), yint_val)

239 μs

xxxxxxxxxx

1.2 s

Info

One possible approach for selecting the length scale is to maximize the KCE on a held-out dataset (cf. approach for MMD proposed by Fukumizu et al. in Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions (2009)).

xxxxxxxxxx

6.2 μs

xxxxxxxxxx

51.6 μs

xxxxxxxxxx

11.7 μs

Estimation with CalibrationErrorsDistributions.jl

Closed-form expressions for predictions of Gaussian and Laplace distributions in Distributions.jl

xxxxxxxxxx

11.0 μs

Example: Gaussian process

We sample from a Gaussian process (GP) with zero mean and a squared exponential kernel at 40 random locations in the interval $[0, 10]$ .

xxxxxxxxxx

9.0 μs

We split the data randomly in a training (75%) and validation (25%) dataset, and compute the GP posterior from the training data.

xxxxxxxxxx

5.6 μs

xxxxxxxxxx

13.0 s

We compute the predicted normal distributions on the validation data.

xxxxxxxxxx

9.2 μs

gp_predictions

Distributions.Normal{Float64}

Distributions.Normal{Float64}(μ=1.02409, σ=0.0743258)

Distributions.Normal{Float64}(μ=-0.87118, σ=0.0563905)

Distributions.Normal{Float64}(μ=-0.912423, σ=0.0585367)

Distributions.Normal{Float64}(μ=-1.69306, σ=0.0533013)

Distributions.Normal{Float64}(μ=-0.546003, σ=0.0579756)

Distributions.Normal{Float64}(μ=-2.29, σ=0.127465)

Distributions.Normal{Float64}(μ=-0.150794, σ=0.0824986)

Distributions.Normal{Float64}(μ=-0.0155496, σ=0.0644285)

Distributions.Normal{Float64}(μ=-0.566621, σ=0.0573212)

Distributions.Normal{Float64}(μ=0.175842, σ=0.0682926)

xxxxxxxxxx

35.4 μs

We estimate the squared KCE with a tensor product kernel of the form

$k ((μ, y), (\tilde{μ}, \tilde{y})) := \exp (- W_{2} (μ, \tilde{μ})) \exp (- \frac{(y - \tilde{y})^{2}}{2 ℓ^{2}}),$

where $W_{2} (μ, \tilde{μ})$ is the 2-Wasserstein distance of the Gaussian distributions $μ$ and $\tilde{μ}$ .

xxxxxxxxxx

34.7 μs

Length scale $ℓ$ : 1

Estimator:

xxxxxxxxxx

97.2 μs

UnbiasedSKCE{KernelTensorProduct{Tuple{ExponentialKernel{CalibrationErrorsDistributions.Wasserstein}, TransformedKernel{SqExponentialKernel{Euclidean}, ScaleTransform{Float64}}}}}UnbiasedSKCE

kernel

Tensor product of 2 kernels:
	Exponential Kernel (metric = CalibrationErrorsDistributions.Wasserstein())
	Squared Exponential Kernel (metric = Euclidean(0.0))
			- Scale Transform (s = 1.0)

xxxxxxxxxx
 
gp_skce

51.0 ns

6.988050107219258e-6

xxxxxxxxxx
 
gp_skce(gp_predictions, gp_y_val)

3.6 μs

xxxxxxxxxx

604 ms

xxxxxxxxxx

6.9 ms

xxxxxxxxxx

24.7 ms

xxxxxxxxxx

27.4 μs

xxxxxxxxxx

4.1 μs

Calibration tests

Widmann, D., Lindsten, F., & Zachariah, D. (2021). Calibration tests beyond classification. ICLR 2021.

⚠️ Problem

It is difficult to interpret an estimated non-zero calibration error.

Calibration errors have no meaningful unit or scale.
Different calibration errors rank models differently.
Estimators of calibration errors are random variables.

xxxxxxxxxx

78.7 μs

Perform a statistical test of the null hypothesis $H_{0} := "model is calibrated"$ .

xxxxxxxxxx

7.4 μs

xxxxxxxxxx

5.1 s

Hypothesis testing of calibration is a special two-sample problem
Applies to all probabilistic predictive models
Existing two-sample tests based on the MMD can be improved by marginalizing out $Z_{X}$

xxxxxxxxxx

15.4 μs

Calibration tests with CalibrationTests.jl

Supports

tests based on consistency resampling
tests based on distribution-free bounds and asymptotic properties of KCE
tests with quadratic and subquadratic sample complexity
interface of HypothesisTests.jl

xxxxxxxxxx

15.5 μs

Test:

xxxxxxxxxx

67.2 μs

Asymptotic SKCE test
--------------------
Population details:
    parameter of interest:   SKCE
    value under h_0:         0.0
    point estimate:          -0.000264861

Test summary:
    outcome with 95% confidence: fail to reject h_0
    one-sided p-value:           0.6640

Details:
    test statistic: -0.0007061332966437105

xxxxxxxxxx
 
skce_test

50.0 ns

0.64

xxxxxxxxxx
 
pvalue(skce_test)

11.4 ms

xxxxxxxxxx

46.6 μs

xxxxxxxxxx

229 μs

pycalibration

Python interface for CalibrationErrors, CalibrationErrorsDistributions, and CalibrationTests
Inspired by diffeqpy
Uses PyJulia interface

xxxxxxxxxx

31.4 μs

Usage

Load package and install Julia dependencies:

>>> import pycalibration
>>> pycalibration.install()

Define estimator of the SKCE with kernel

$k ((μ, y), (\hat{μ}, \hat{y})) = \exp (- ∥ μ - \hat{μ} ∥) δ (y - \hat{y}) :$

>>> from pycalibration import calerrors as ce
>>> skce = ce.UnbiasedSKCE(ce.tensor(ce.ExponentialKernel(), ce.WhiteKernel()))

Estimate the SKCE for some random predictions and outcomes:

>>> import numpy as np
>>> from pycalibration import calerrors as ce
>>> rng = np.random.default_rng(1234)
>>> predictions = rng.random(100)
>>> outcomes = rng.choice([True, False], 100)
>>> skce(predictions, outcomes)
0.03320398246523166

Perform a calibration test with some random predictions and outcomes:

>>> from pycalibration import caltests as ct
>>> import numpy as np
>>> rng = np.random.default_rng(1234)
>>> predictions = rng.dirichlet((3, 2, 5), 100)
>>> outcomes = rng.integers(low=1, high=4, size=100)
>>> kernel = ct.tensor(ct.ExponentialKernel(metric=ct.TotalVariation()), ct.WhiteKernel())
>>> test = ct.AsymptoticSKCETest(kernel, predictions, outcomes)
>>> print(test)
<PyCall.jlwrap Asymptotic SKCE test
--------------------
Population details:
    parameter of interest:   SKCE
    value under h_0:         0.0
    point estimate:          6.07887e-5

Test summary:
    outcome with 95% confidence: fail to reject h_0
    one-sided p-value:           0.4330

Details:
    test statistic: -4.955380469272125
>>> ct.pvalue(test)
0.435

More examples can be found in the documentation.

xxxxxxxxxx

16.6 μs

rcalibration

R interface for CalibrationErrors, CalibrationErrorsDistributions, and CalibrationTests
Inspired by diffeqr
Based on JuliaCall

xxxxxxxxxx

12.9 μs

Usage

Load package and install Julia dependencies:

> library(rcalibration)
> rcalibration::install()

Define estimator of the SKCE with kernel

$k ((μ, y), (\hat{μ}, \hat{y})) = \exp (- ∥ μ - \hat{μ} ∥) δ (y - \hat{y}) :$

> ce <- calerrors()
> skce <- ce$UnbiasedSKCE(ce$tensor(ce$ExponentialKernel(), ce$WhiteKernel()))

Estimate the SKCE for some random predictions and outcomes:

> ce <- calerrors()
> set.seed(1234)
> predictions <- runif(100)
> outcomes <- sample(c(TRUE, FALSE), 100, replace=TRUE)
> skce$.(predictions, outcomes)
[1] 0.01518318

Perform a calibration test with some random predictions and outcomes:

> library(extraDistr)
> ct <- caltests()
> set.seed(1234)
> predictions <- rdirichlet(100, c(3, 2, 5))
> outcomes <- sample(1:3, 100, replace=TRUE)
> kernel <- ct$tensor(ct$ExponentialKernel(metric=ct$TotalVariation()), ct$WhiteKernel())
> test <- ct$AsymptoticSKCETest(kernel, ce$RowVecs(predictions), outcomes)
> print(test)
Julia Object of type AsymptoticSKCETest{KernelTensorProduct{Tuple{ExponentialKernel{TotalVariation}, WhiteKernel}}, Float64, Float64, Matrix{Float64}}.
Asymptotic SKCE test
--------------------
Population details:
    parameter of interest:   SKCE
    value under h_0:         0.0
    point estimate:          0.0259434

Test summary:
    outcome with 95% confidence: reject h_0
    one-sided p-value:           0.0100

Details:
    test statistic: -0.007291403994633658
> ct$pvalue(test)
[1] 0.004

More examples can be found in the documentation.

xxxxxxxxxx

33.0 μs

Take home messages

Calibration is an important aspect of probabilistic predictive models
Reliability diagrams with consistency bars help to visually inspect calibration
There exist alternatives to the ECE such as the KCE with favourable theoretical properties
Calibration tests can be used to deal with the randomness of calibration error estimates
Python and R interfaces if you do not use Julia

xxxxxxxxxx

26.4 μs

See you at JuliaCon!

xxxxxxxxxx

1.5 μs