Estimating Rare Probabilities

TL;DR: When dealing with rare events (e.g., surgical fatalities) where no failures have been observed (so far), traditional statistical methods fail. This post explores practical approaches to estimate these probabilities.

Introduction

Estimating the probability $p$ of a rare event (e.g., surgical fatalities) when no failures have been observed ($m = 0$ in $n$ trials) is a classic statistical challenge. Following Taleb’s example from his original post, consider a surgeon with 0 fatalities in 60 procedures. Traditional methods like the empirical estimator $\hat{p}_s = \frac{m}{n}$ collapse to $\hat{p}_s = 0$, which is both implausible and unhelpful for rare events; it leads to overconfidence and underestimation of risk. This limitation highlights the need for alternative estimation methods when dealing with small samples or rare events, particularly in critical applications like medical procedures or safety engineering. But what is the correct probability? There is no single answer to this question, and this post explores practical approaches to this problem. We will mainly focus on four aspects:

Taleb’s Maximum Ignorance Estimator
Bayesian approaches
Confidence intervals
A simple approximation for Taleb’s estimator

The Frequentist Approach

The frequentist approach typically uses the empirical estimator

\[\hat{p}_s = \frac{m}{n},\]

where $m$ is the number of observed failures and $n$ is the total number of trials. This maximum likelihood estimator is simple and intuitive, providing an unbiased estimate when $m > 0$. However, it fails dramatically in the case of zero observed failures ($m = 0$), yielding $\hat{p}_s = 0$ as we have seen above.

Taleb’s Maximum Ignorance Estimator

Taleb’s estimator solves for $p$ such that the binomial cumulative distribution function (CDF) equals $q = 0.5$ at $m = 0$. This essentially asks: “What failure rate $p$ would make $0$ events in $n$ trials equally likely as having some events?” More generally, the estimator for $\hat{p}_t$ is given by:

\[\hat{p}_t = 1 - I^{-1}_{0.5}(n - m, m + 1) = I^{-1}_{0.5}(m + 1, n - m)\]

where $I^{-1}$ is the inverse regularized beta function. We later obtain an approximation for this estimator via

\[\hat{p}_t \approx \frac{m + \frac{2}{3}}{n + \frac{1}{3}}.\]

Moreover, for the special case of $m = 0$, we can obtain an exact formula $\hat{p}_t = 1 - 2^{-1/n}$.

Taleb’s approach maximizes entropy, or in other words, the “maximum ignorance” about $p$, as we will now see: The entropy function $H(q)$ with $0 \leq q \leq 1$ is given by:

\[H(q) = -(q \log(q) + (1-q)\log(1-q)),\]

and this function takes its maximum at $q = 0.5$, which can be either seen by taking the derivative or observing that the function is symmetric around $q \mapsto 1-q$, so that $H(q) = H(1-q)$ and so $q = 0.5$ has to be an extremal point. As the function is concave, it must be a maximum. This justifies our choice of setting the binomial CDF equal to 0.5 in Taleb’s estimator, making 0 events in $n$ trials equally likely as having some events. Precisely for this reason, we refer to the probability $\hat{p}_t$ as the maximum entropy probability. This notion can be easily extended to $m > 0$ now with the same reasoning, asking “What failure rate $p$ would make at most $m$ events in $n$ trials equally likely as having more than $m$ events?”; some obvious caveats apply here, which we will discuss, at least partially, later.

In code, we can compute Taleb’s estimator as follows:

from scipy.stats import beta

def taleb_estimator(n, m=0, q=0.5):
    return 1 - beta.ppf(q, n - m, m + 1)

# Example: 0 fatalities in 60 procedures
p_hat = taleb_estimator(60)
print(f"Taleb's estimate: {p_hat:.4f}")

Example (continued). For a surgeon with 0 fatalities in 60 procedures, we obtain

\[\hat{p}_t \approx 0.01148 \, (1.15\%)\]

This contrasts sharply with the empirical $\hat{p}_s = 0$. The following figure shows the maximum entropy probability for various $n$ and $m$ vs the empirical estimator. We can see that for small $n$, the maximum entropy probability is far from the empirical estimator, and for larger $n$, it converges to it.

Figure 1. Maximum Entropy Probability for various $n$ and $m$ vs the empirical estimator.

Bayesian Approaches

The obvious other contender for estimating the probability of a rare event is Bayesian methods. Bayesian probability provides a natural framework for estimating rare event probabilities by incorporating prior beliefs and updating them with observed data. Unlike frequentist approaches that treat $p$ as a fixed unknown constant, Bayesian methods model $p$ as a random variable with a prior distribution reflecting our initial assumptions.

This approach combines:

Prior belief: Encoded in a probability distribution (e.g., uniform, beta)
Likelihood function: Binomial distribution for success/failure outcomes
Posterior distribution: Updated belief about $p$ via Bayes’ theorem

The choice of the prior significantly impacts results in low-data regimes. We will examine two approaches: the historical Laplace’s method using a uniform prior, and a more flexible beta-binomial model that allows tuning of prior assumptions through hyperparameters.

Laplace’s Rule of Succession

Laplace’s Rule of Succession (see Wikipedia) is a fundamental Bayesian approach for estimating probabilities, particularly useful in cases with limited data. Developed by Pierre-Simon Laplace in 1814, it addresses the problem of estimating the probability of an event that has not been observed yet, based on past observations. Laplace famously used the rule of succession to calculate the probability that the Sun will rise tomorrow, given that it has risen every day for the past 5000 years. By considering approximately 5000 × 365.25 days, he obtained odds of about 1,826,200 to 1 in favor of the Sun rising tomorrow.

The rule emerges from using a uniform $\text{Beta}(1,1)$ prior, which represents maximum uncertainty about the true probability. When combined with binomial likelihood through Bayes’ theorem, this yields a posterior distribution that naturally handles the case of zero observed events. The resulting estimator $\hat{p}_l$ adds one “pseudo-observation” to both the success and failure counts:

\[\hat{p}_l = \frac{m + 1}{n + 2},\]

preventing the estimate from collapsing to zero when no failures are observed. This rather old approach comes with the following properties for our surgical example:

It provides a non-zero estimate even when no failures are observed
It smoothly transitions between the zero-failure and non-zero failure cases
It remains computationally simple while incorporating Bayesian principles

Derivation of Laplace’s Rule of Succession

We begin by assuming a uniform prior distribution for $p$, specifically $p \sim \text{Beta}(1, 1)$, which reflects our initial state of maximum uncertainty. The prior density function is constant, $f(p) = 1$, for all $0 < p < 1$ and $0$ otherwise.

Next, we define the likelihood function, which describes the probability of observing $m$ events in $n$ trials given $p$. This is expressed as:

\[P(m \mid p,n) = \binom{n}{m}p^m(1-p)^{n-m}.\]

Applying Bayes’ theorem, which in its generic form is:

\[P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)},\]

we update our prior beliefs with the observed data to obtain the posterior distribution:

\[f(p \mid m,n) \propto 1 \cdot p^m(1-p)^{n-m} = p^m(1-p)^{n-m}.\]

This posterior distribution is again a Beta distribution:

\[p \mid m,n \sim \text{Beta}(m+1, n-m+1).\]

With this, we obtain the estimate for $p$ as the expected value of this Beta distribution:

\[\mathbb{E}[p \mid m,n] = \frac{m+1}{m+1 + (n-m+1)} = \frac{m + 1}{n + 2},\]

and in particular, Laplace’s Rule of Succession as:

\[\hat{p}_l = \frac{m + 1}{n + 2}.\]

Example (continued). For our surgeon, we get

\[\hat{p}_l = \frac{0 + 1}{60 + 2} \approx 0.0161 \, (1.61\%).\]

Beta Prior Approach

Generalizing from Laplace’s Rule of Succession and its derivation above, we can use the Beta distribution essentially assuming $p \sim \text{Beta}(\alpha, \beta)$ and, using that Beta is a conjugate prior as before, this results in the analogous estimation:

\[\hat{p}_b = \frac{m + \alpha}{n + \alpha + \beta},\]

with appropriate choices of $\alpha$ and $\beta$. For the case of Laplace’s Rule of Succession, we have $\alpha = \beta = 1.$

Interpretation of $\alpha$ and $\beta$ in the Beta Distribution

In the context of the Beta distribution, the parameters $\alpha$ and $\beta$ have intuitive interpretations:

Alpha ($\alpha$). This parameter can be thought of as representing the number of “successes” prior to observing any data. In other words, it is a pseudo-count of successes that we assume before seeing the actual data. A higher value of $\alpha$ indicates a stronger prior belief in the probability of success.

Beta ($\beta$). Similarly, this parameter represents the number of “failures” prior to observing any data. It is a pseudo-count of failures that we assume before seeing the actual data. A higher value of $\beta$ indicates a stronger prior belief in the probability of failure.

Together, $\alpha$ and $\beta$ shape the prior distribution of the probability $p$. The mean of the Beta distribution, which serves as the point estimate for $p$, is given by:

\[\mathbb{E}[p] = \frac{\alpha}{\alpha + \beta}.\]

This mean represents our prior belief about the probability of success before observing any (new) data. The larger the values of $\alpha$ and $\beta$, the more confident we are in our prior belief. For example, if $\alpha = 5$ and $\beta = 5$, we have a prior belief that the probability of success is 0.5, but with moderate confidence. If $\alpha = 50$ and $\beta = 50$, we still believe the probability of success is 0.5, but with much higher confidence. In the figure below, we depict the Beta distributions for different parameters.

Figure 2. Beta distributions with different parameters.

Comparison of Taleb’s Estimator to Laplace’s Rule of Succession

So how does Taleb’s estimator compare to Laplace’s Rule of Succession? The following figure shows the log ratio of Laplace’s Rule of Succession to Maximum Entropy Probability for various $n$ and $m$. For small $m$, the maximum entropy probability estimates smaller probabilities than Laplace’s Rule of Succession. For larger $m$, the situation is reversed, and the maximum entropy probability estimates larger probabilities than Laplace’s Rule of Succession; the latter effect becomes less pronounced for larger $n$. Later, when we examine the approximation of the maximum entropy probability, we will see why this is the case.

Figure 3. Log ratio of Laplace’s Rule of Succession to Maximum Entropy Probability.

Confidence Intervals

No exposition would be complete without confidence intervals, particularly given that one-point estimates can be “dangerous” by not providing any information about the uncertainty. In particular, in high-stakes scenarios, confidence intervals are often more informative. We can compute confidence intervals for $p$ for varying $n$ and $m$ using the Beta distribution, as shown in the following code.

from scipy.stats import beta

def confidence_interval(n, m, alpha=0.05):
    lower = beta.ppf(alpha/2, m + 1, n - m + 1)
    upper = beta.ppf(1 - alpha/2, m + 1, n - m + 1)
    return (lower, upper)

# Example
ci = confidence_interval(60, 0)
print(f"95% CI: ({ci[0]:.4f}, {ci[1]:.4f})")

For our surgeon, we get a 95% confidence interval as a function of $n$ that looks as follows. We added the maximum entropy probability as a reference as well.

Figure 4. 95% confidence intervals for various $n$ (left) and in log-scale (right).

Approximating the Binomial Median

Now we come to the interesting part. How can we approximate the maximum entropy probability? We will see that the maximum entropy probability is the median of the median of the Beta distribution (with the right parameters), so we can use an approximation of the median of the Beta distribution to approximate the maximum entropy probability. Recall that essentially, given $n$ trials and $m$ successes, the maximum entropy probability is the value $p$ such that the probability of observing up to $m$ successes in $n$ trials equals 0.5. As such, we want to solve:

\[\text{BinomCDF}(m, n, p) = 0.5.\]

As used above, we have the identity:

\[\text{BinomCDF}(m, n, p) = I_{1-p}(n - m, m + 1)\]

where $I_{1-p}(\alpha, \beta)$ is the regularized incomplete beta function. Setting the right-hand side equal to 0.5, it follows that $1-p$ is the median of a Beta distribution (see Wikipedia) with parameters $\alpha = n - m$ and $\beta = m + 1$. While a closed-form expression for the median of the Beta distribution is not known, it turns out that the median can be approximated by (see paper):

\[\text{Median} \approx \frac{\alpha - \frac{1}{3}}{\alpha + \beta - \frac{2}{3}}\]

Substituting $\alpha = n - m$ and $\beta = m + 1$ yields:

\[1 - \hat{p}_{ta} \approx \frac{n - m - \frac{1}{3}}{n - m + m + 1 - \frac{2}{3}} = \frac{n - m - \frac{1}{3}}{n + \frac{1}{3}},\]

so that we obtain the approximation:

\[\hat{p}_{ta} \approx 1 - \frac{n - m - \frac{1}{3}}{n + \frac{1}{3}} = \frac{m + \frac{2}{3}}{n + \frac{1}{3}}.\]

So how good is this approximation? According to the paper, we should have both $\alpha, \beta > 1$ for a good approximation, which is not the case for us when $m = 0$ or $m = n$. However, the approximation still is quite good: the following figure shows the log ratio of the approximation from above vs the exact maximum entropy probability for various $n$ and $m$. We can see that the approximation is quite good, even for small $n$ and $m$, and the relative error is less than $4\%$.

Figure 5. Log ratio of approximated to exact median probability.

Additionally, in certain special cases, we can derive exact formulas for the median of the Beta distribution. Specifically, when one of the parameters is equal to 1, the median can be calculated as follows: for $\alpha = 1$, we have $\text{median}(\alpha, 1) = 2^{-1/\alpha}$, and for $\beta = 1$, we have $\text{median}(1, \beta) = 1 - 2^{-1/\beta}$; additionally, when the parameters are equal ($\alpha = \beta$), the median is exactly $\frac{1}{2}$. We can apply the special case to our case with $m = 0$, which is equal to $\beta = 1$ and $\alpha = n$, so that we obtain:

\[\hat{p}_{t} = 1 - 2^{-1/n}.\]

The following figure verifies this formula empirically and compares it to the general approximation from above.

Figure 6. Comparison of the maximum entropy probability for various $n$ and $m=0$ vs the general approximation $\hat{p}_{ta} \approx \frac{m + \frac{2}{3}}{n + \frac{1}{3}}$ and the special case formula $\hat{p}_{ta} \approx 1 - 2^{-1/n}$. Again, we depict log ratios.

Final Remarks

Note that the Bayesian variants as well as the maximum entropy probability are biased estimators. In particular, $\hat{p}_{l}(n,m) + \hat{p}_{l}(n,n-m) > 1$ and $\hat{p}_{t}(n,m) + \hat{p}_{t}(n,n-m) > 1$ in contrast to the empirical estimator $\hat{p}_s(n,m) = \frac{m}{n}$, where we have $\hat{p}_s(n,m) + \hat{p}_s(n,n-m) = 1$. They converge to the empirical estimator for large $n$ though; both of them.

Also, “technically” we could obtain the approximation for the maximum entropy probability from above via the Bayesian approach with a Beta prior where $\alpha = m + 2/3$ and $\beta = n - m - 1/3$. This results in $\mathbb{E}[p] = \frac{\alpha}{\alpha + \beta} = \frac{m + \frac{2}{3}}{n + \frac{1}{3}}$; whether this is “sensible” in the Bayesian estimation sense is debatable, as it would imply an initial prior on $p$ of $p \sim \text{Beta}(2/3, -1/3)$ which is not valid as $\alpha, \beta > 0$ for the Beta distribution.

References

Taleb, N.N. Maximum Ignorance Probability, With Applications to Surgery’s Error Rates Link.
Critical Analysis of Taleb’s Maximum Ignorance Estimator Link.
Kerman, J. A closed-form approximation for the median of the beta distribution Link.