Lecture 5: Distributions

Lecture 5: Distributions#

Note

This lecture introduces common distributions including discrete value distributions such as the Binomial and Poisson distributions, as well as continuous value distributions such as Uniform, Normal, and Exponential distributions.

Caution

This lecture uses Python to facilitate dynamic online visualization. If you are not comfortable with Python, you can ignore the code cells, focusing only on the visualizations. However, if you are an experienced/enthusiastic Python user, feel free to play around with the code cells.

Binomial Distribution#

Given, a Bernoulli trial - a single experiment with exactly two possible outcomes: success and failure, the Binomial Distribution is a discrete probability distribution that models the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success. For instance, consider a box of tickets wherein $p$ proportion of tickets are labeled as 1 while $1-p$ are labeled as 0. If we were to draw $n$ tickets (with replacement) the probability of drawing exactly $k$ tickets labeled as 1 follows a Binmial Distribution. Specifically, since each draw is an identical, indepedent, random (IIR) event, this proability can be evaluated as, $P(X = k) = p^k (1-p)^{(n-k)}$ $_nC_k$

Test Yourself

Compute the probability of landing exactly 1/2/3/4/5/6 head(s) in 6 flips.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
import ipywidgets as widgets
from IPython.display import display

# Function to plot the binomial histogram
def plot_binomial(n, p):
    k = np.arange(0, n + 1)
    probs = binom.pmf(k, n, p)

    # Theoretical statistics
    mean = n * p
    median = np.floor(n * p) if p != 0.5 else n / 2
    mode = np.floor((n + 1) * p)
    var = n * p * (1 - p)
    std_dev = np.sqrt(var)
    iqr = binom.ppf(0.75, n, p) - binom.ppf(0.25, n, p)
    value_range = n  # Range = max - min = n - 0
    skewness = (1 - 2 * p) / std_dev
    kurtosis = (1 - 6 * p * (1 - p)) / var

    # Create plot
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.bar(k, probs, color='skyblue', edgecolor='black')
    ax.set_title(f'Binomial PMF (n = {n}, p = {p})')
    ax.set_xlabel('Number of Successes (k)')
    ax.set_ylabel('Probability')
    ax.grid(axis='y', linestyle='--', alpha=0.7)

    # Text boxes for each category of statistics
    location_stats = (
        f"Measures of Location\n"
        f"- Mean: {mean:.2f}\n"
        f"- Median: {median:.2f}\n"
        f"- Mode: {mode:.0f}"
    )

    dispersion_stats = (
        f"Measures of Dispersion\n"
        f"- Range: {value_range}\n"
        f"- Q1: {binom.ppf(0.25, n, p):.2f}\n"
        f"- Q3: {binom.ppf(0.75, n, p):.2f}\n"
        f"- IQR: {iqr:.2f}\n"
        f"- Std Dev: {std_dev:.2f}"
    )

    shape_stats = (
        f"Measures of Shape\n"
        f"- Skewness: {skewness:.2f}\n"
        f"- Kurtosis: {kurtosis:.2f}"
    )

    # Add each stats box in a different location
    ax.text(1.02, 0.95, location_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='whitesmoke'))

    ax.text(1.02, 0.665, dispersion_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='honeydew'))

    ax.text(1.02, 0.30, shape_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='lavender'))

    plt.tight_layout()
    plt.show()

# Interactive widgets
n_slider = widgets.IntSlider(value=50, min=1, max=100, step=1, description='n (trials):')
p_slider = widgets.FloatSlider(value=0.5, min=0.01, max=0.99, step=0.01, description='p (success):')

ui = widgets.VBox([n_slider, p_slider])
out = widgets.interactive_output(plot_binomial, {'n': n_slider, 'p': p_slider})

# Display the widget
display(ui, out)

Matplotlib is building the font cache; this may take a moment.

Poisson Distribution#

Given, a Bernoulli trial - a single experiment with exactly two possible outcomes: success and failure, the Poisson Distribution is a discrete probability distribution that models the number of successes in a fixed number of independent Bernoulli trials, each with the same but rare probability of success. For instance, consider a box of tickets wherein a very small proportion ($p$) of tickets are labeled as 1 while $1-p$ are labeled as 0. If we were to draw $n$ tickets (with replacement) such that the expected number of 1s remains fixed at $\lambda = np$, then probability of drawing exactly $k$ tickets labeled as 1 follows a Binomial Distribution $P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k}$ that converges to the Poisson form $P(X = k) = \lambda^k e^{-\lambda} / {k!}$.

Proof $$P(X = k) = \lim_{n \to \infty,} \binom{n}{k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n - k} = \lambda^k e^{-\lambda} / {k!}$$

Expanding $\binom{n}{k}$

\[\begin{split} \begin{aligned} \lim_{n \to \infty} \binom{n}{k} & = \frac{n!}{k!(n-k!)} \\ & = \frac{\prod_{i=1}^k (n-k+i)}{k!} \\ & = \frac{n^k \prod_{i=1}^k (1-k/n+i/n)}{k!} \\ & = \frac{n^k}{k!} \end{aligned} \end{split}\]

Using the following Taylor Expansion, $\ln{(1+x)} = x - \frac{x^2}{2} + \frac{x^3}{3} - \dots$, expanding $\left(1 - \frac{\lambda}{n}\right)^{n - k}$

\[\begin{split} \begin{aligned} \left(1 - \frac{\lambda}{n}\right)^{n - k} &= y \\ \ln{y} &= (n-k)\ln\left(1 - \frac{\lambda}{n}\right) \\ \ln{y} &= -(n-k)\left(\frac{\lambda}{n} - \frac{\lambda^2}{2n^2} + \frac{\lambda^3}{3n^3} - \dots \right) \\ \ln{y} &= -\left(\lambda - \frac{\lambda^2}{2n} + \frac{\lambda^3}{3n^2} - \dots \right) + k\left(\frac{\lambda}{n} - \frac{\lambda^2}{2n^2} + \frac{\lambda^3}{3n^3} - \dots \right)\\ \lim_{n \to \infty} \ln{y} &= -\lambda \\ \lim_{n \to \infty} y &= e^{-\lambda} \end{aligned} \end{split}\]

Combining the two,

\[\begin{split} \begin{aligned} P(X = k) &= \lim_{n \to \infty} \binom{n}{k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n - k} \\ &= \frac{n^k}{k!} \left(\frac{\lambda}{n}\right)^k e^{-\lambda} \\ &= \lambda^k e^{-\lambda} / {k!} \end{aligned} \end{split}\]

Test Yourself

Based on the historical data, it is known that Chennai observes extreme flooding once in every 10 years. Compute the probability that Chennai will observe exactly 1/2/3/4 extreme flooding event(s) in the next 20 years.

import numpy as np
import ipywidgets as widgets
import matplotlib.pyplot as plt
from scipy.stats import poisson
from IPython.display import display

# Function to plot Poisson PMF with statistics
def plot_poisson(lambd):
    x = np.arange(0, max(20, int(lambd + 4 * np.sqrt(lambd))))  # Extend range for higher λ
    pmf = poisson.pmf(x, lambd)

    # Theoretical statistics
    mean = lambd
    median = np.floor(lambd + 1/3 - 0.02/lambd)  # Approximation
    mode = np.floor(lambd)
    var = lambd
    std_dev = np.sqrt(var)
    q1 = poisson.ppf(0.25, lambd)
    q3 = poisson.ppf(0.75, lambd)
    iqr = q3 - q1
    value_range = x[-1] - x[0]
    skewness = 1 / np.sqrt(lambd)
    kurtosis = 1 / lambd

    # Create plot
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.bar(x, pmf, color='lightcoral', edgecolor='black')
    ax.set_title(f'Poisson PMF (λ = {lambd})')
    ax.set_xlabel('Number of Events (k)')
    ax.set_ylabel('Probability')
    ax.grid(axis='y', linestyle='--', alpha=0.7)

    # Text boxes for each category of statistics
    location_stats = (
        f"Measures of Location\n"
        f"- Mean: {mean:.2f}\n"
        f"- Median: {median:.2f}\n"
        f"- Mode: {mode:.0f}"
    )

    dispersion_stats = (
        f"Measures of Dispersion\n"
        f"- Range: {value_range}\n"
        f"- Q1: {q1:.2f}\n"
        f"- Q3: {q3:.2f}\n"
        f"- IQR: {iqr:.2f}\n"
        f"- Std Dev: {std_dev:.2f}"
    )

    shape_stats = (
        f"Measures of Shape\n"
        f"- Skewness: {skewness:.2f}\n"
        f"- Kurtosis: {kurtosis:.2f}"
    )

    # Add stats boxes
    ax.text(1.02, 0.95, location_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='whitesmoke'))

    ax.text(1.02, 0.665, dispersion_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='honeydew'))

    ax.text(1.02, 0.30, shape_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='lavender'))

    plt.tight_layout()
    plt.show()

# Interactive widget
lambda_slider = widgets.IntSlider(value=3, min=1, max=15, description='λ (rate):')
ui = widgets.VBox([lambda_slider])
out = widgets.interactive_output(plot_poisson, {'lambd': lambda_slider})

# Display
display(ui, out)

Exponential Distribution#

Given a Bernoulli trial — a single experiment with exactly two possible outcomes: success and failure — the Exponential Distribution is a continuous probability distribution that models the time to first success in a sequence of independent Bernoulli trials, each with the same but rare probability of success. For instance, consider a box of tickets wherein a very small proportion $(p)$ of tickets are labeled as 1, while $1 - p$ are labeled as 0. If we were to draw $n$ tickets (with replacement), each at a regular interval of $\Delta t$ such that the expected number of 1s remains fixed at $\lambda = np$, then the time to the first ticket labeled 1 is described by the probability density function $f(t) = \lambda e^{-\lambda t}$ for $t \ge 0$, where $\lambda$ is the constant rate of occurrence.

Proof

\[f(t) = \lambda e^{-\lambda t}\]

To begin with, the probability of first success (ticket labeled as 1) at $k^{th}$ trial is given by $P(X = k) = (1 - p)^{k-1}p = (1-p)^k \frac{p}{1-p}$

Discretizing such that, $t = k \Delta{t}$, we have $P(T = t) = (1 - p)^{\frac{t}{\Delta{t}}}\frac{p}{1-p}$.

Further, since $p = \lambda \Delta{t}$, we have $P(T = t) = (1 - \lambda \Delta{t})^{\frac{t}{\Delta{t}}} \frac{\lambda \Delta{t}}{1-\lambda \Delta{t}}$.

Now then, using the following Taylor Expansion, $\ln{(1+x)} = x - \frac{x^2}{2} + \frac{x^3}{3} - \dots$, we expand $(1 - \lambda \Delta{t})^{\frac{t}{\Delta{t}}}$,

\[\begin{split} \begin{aligned} (1 - \lambda \Delta{t})^{\frac{t}{\Delta{t}}} &= y \\ \ln{y} & = \frac{t}{\Delta{t}} \ln{(1 - \lambda \Delta{t})} \\ \ln{y} & = -\frac{t}{\Delta{t}} (\lambda \Delta{t} - \frac{\lambda^2\Delta{t}^2}{2} + ...) \\ \lim_{\Delta{t} \to 0} \ln{y} & = -\lambda t \\ \lim_{\Delta{t} \to 0} y & = e^{-\lambda t} \end{aligned} \end{split}\]

Further,

\[\lim_{\Delta{t} \to 0} \frac{\lambda \Delta{t}}{1-\lambda \Delta{t}} = \lambda \Delta{t}\]

Combining the two renders,

\[P(T = t) = \lambda e^{-\lambda t}\Delta{t}\]

Consequently, probability density is given by,

\[f(t) = \frac{P(T = t)}{\Delta{t}} = \lambda e^{-\lambda t}\]

import numpy as np
import ipywidgets as widgets
from scipy.stats import expon
import matplotlib.pyplot as plt
from IPython.display import display

# Function to plot Exponential PDF with annotated statistics
def plot_exponential(scale):
    x = np.linspace(-5, 25, 500)
    pdf = expon.pdf(x, scale=scale)

    # Theoretical statistics
    mean = scale
    median = scale * np.log(2)
    mode = 0
    value_range = "0 to ∞"
    q1 = expon.ppf(0.25, scale=scale)
    q3 = expon.ppf(0.75, scale=scale)
    iqr = q3 - q1
    std_dev = scale
    skewness = 2
    kurtosis = 6  # Excess kurtosis = 6

    # Create plot
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.plot(x, pdf, color='darkgreen', linewidth=2)
    ax.fill_between(x, pdf, color='palegreen', alpha=0.5)
    ax.set_title(f'Exponential PDF (scale = {np.round(scale,1)})')
    ax.set_xlabel('x')
    ax.set_ylabel('Density')
    ax.set_xlim(-5, 25)
    ax.grid(axis='y', linestyle='--', alpha=0.7)

    # Statistics boxes
    location_stats = (
        f"Measures of Location\n"
        f"- Mean: {mean:.2f}\n"
        f"- Median: {median:.2f}\n"
        f"- Mode: {mode}"
    )

    dispersion_stats = (
        f"Measures of Dispersion\n"
        f"- Range: {value_range}\n"
        f"- Q1: {q1:.2f}\n"
        f"- Q3: {q3:.2f}\n"
        f"- IQR: {iqr:.2f}\n"
        f"- Std Dev: {std_dev:.2f}"
    )

    shape_stats = (
        f"Measures of Shape\n"
        f"- Skewness: {skewness:.2f}\n"
        f"- Kurtosis: {kurtosis:.2f}"
    )

    # Add text boxes
    ax.text(1.02, 0.95, location_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='whitesmoke'))

    ax.text(1.02, 0.665, dispersion_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='honeydew'))

    ax.text(1.02, 0.30, shape_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='lavender'))

    plt.tight_layout()
    plt.show()

# Interactive widget
scale_slider = widgets.FloatSlider(value=1.0, min=0.1, max=5.0, step=0.1, description='Scale (λ⁻¹):')

ui = widgets.VBox([scale_slider])
out = widgets.interactive_output(plot_exponential, {'scale': scale_slider})

# Display
display(ui, out)

Uniform Distribution#

The Uniform Distribution is a continuous probability distribution that models a situation where all outcomes within a specific range are equally likely. Here, consider a box containing an infinite number of tickets, each labeled with a unique real number between $a$ and $b$. If we draw one ticket uniformly at random, then any number $x \in [a, b]$ is equally likely to be drawn. Since the probability of drawing any exact number is zero in a continuous setting, we describe the distribution using a constant probability density function: $f(X = x) = \frac{1}{b - a}$ for $x \in [a, b]$.

import numpy as np
import ipywidgets as widgets
import matplotlib.pyplot as plt
from scipy.stats import uniform
from IPython.display import display

# Function to plot uniform PDF with statistics
def plot_uniform(a, b):
    if b <= a:
        print("Upper bound must be greater than lower bound.")
        return

    x = np.linspace(-20, 20, 500)
    pdf = uniform.pdf(x, loc=a, scale=b - a)

    # Theoretical statistics
    mean = (a + b) / 2
    median = mean
    mode = "Any value in [a, b]"
    value_range = b - a
    q1 = a + 0.25 * (b - a)
    q3 = a + 0.75 * (b - a)
    iqr = q3 - q1
    std_dev = (b - a) / np.sqrt(12)
    skewness = 0
    kurtosis = -6/5  # Excess kurtosis for uniform

    # Create plot
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.plot(x, pdf, color='darkorange', linewidth=2)
    ax.fill_between(x, pdf, color='moccasin', alpha=0.5)
    ax.set_title(f'Uniform PDF (a = {a}, b = {np.round(b,1)})')
    ax.set_xlabel('x')
    ax.set_ylabel('Density')
    ax.set_xlim(-25, 25)
    ax.grid(axis='y', linestyle='--', alpha=0.7)

    # Statistics boxes
    location_stats = (
        f"Measures of Location\n"
        f"- Mean: {mean:.2f}\n"
        f"- Median: {median:.2f}\n"
        f"- Mode: {mode}"
    )

    dispersion_stats = (
        f"Measures of Dispersion\n"
        f"- Range: {value_range:.2f}\n"
        f"- Q1: {q1:.2f}\n"
        f"- Q3: {q3:.2f}\n"
        f"- IQR: {iqr:.2f}\n"
        f"- Std Dev: {std_dev:.2f}"
    )

    shape_stats = (
        f"Measures of Shape\n"
        f"- Skewness: {skewness:.2f}\n"
        f"- Kurtosis: {kurtosis:.2f}"
    )

    # Add stats boxes
    ax.text(1.02, 0.95, location_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='whitesmoke'))

    ax.text(1.02, 0.665, dispersion_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='honeydew'))

    ax.text(1.02, 0.30, shape_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='lavender'))

    plt.tight_layout()
    plt.show()

# Interactive widgets
a_slider = widgets.FloatSlider(value=-10, min=-10, max=-1, step=0.5, description='a (min):')
b_slider = widgets.FloatSlider(value=10, min=1, max=10, step=0.5, description='b (max):')

ui = widgets.VBox([a_slider, b_slider])
out = widgets.interactive_output(plot_uniform, {'a': a_slider, 'b': b_slider})

# Display
display(ui, out)

Normal Distribution#

The Normal distribution, also known as the Gaussian distribution, is a continuous probability distribution that emerges from the Central Limit Theorem. Here, consider a box containing an infinite number of tickets, each labeled with a unique real number rendering a fixed mean $\mu$ and variance $\sigma^2$. If we repeatedly draw tickets from this box — say, $n$ independent draws (with replacement) — and compute the average of the values drawn, the distribution of that average will increasingly resemble a bell-shaped curve with probability density function is given by $f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$, as the number of draws becomes large. Thus, the normal distribution can be seen as the emergent shape that results from aggregating randomness — a natural consequence of repeatedly averaging values from a well-behaved box.

import numpy as np
import ipywidgets as widgets
import matplotlib.pyplot as plt
from scipy.stats import norm
from IPython.display import display

# Function to plot Normal PDF with annotated statistics
def plot_normal(mu, sigma):
    x = np.linspace(-25, 25, 500)
    pdf = norm.pdf(x, mu, sigma)

    # Theoretical statistics
    mean = mu
    median = mu
    mode = mu
    value_range = f"Theoretically Infinite"
    q1 = norm.ppf(0.25, mu, sigma)
    q3 = norm.ppf(0.75, mu, sigma)
    iqr = q3 - q1
    std_dev = sigma
    skewness = 0
    kurtosis = 0  # Excess kurtosis = 0 for normal

    # Create plot
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.plot(x, pdf, color='teal', linewidth=2)
    ax.fill_between(x, pdf, color='lightblue', alpha=0.5)
    ax.set_title(f'Normal PDF (μ = {mu}, σ = {np.round(sigma,1)})')
    ax.set_xlabel('x')
    ax.set_ylabel('Density')
    ax.grid(axis='y', linestyle='--', alpha=0.7)

    # Statistics boxes
    location_stats = (
        f"Measures of Location\n"
        f"- Mean: {mean:.2f}\n"
        f"- Median: {median:.2f}\n"
        f"- Mode: {mode:.2f}"
    )

    dispersion_stats = (
        f"Measures of Dispersion\n"
        f"- Range: {value_range}\n"
        f"- Q1: {q1:.2f}\n"
        f"- Q3: {q3:.2f}\n"
        f"- IQR: {iqr:.2f}\n"
        f"- Std Dev: {std_dev:.2f}"
    )

    shape_stats = (
        f"Measures of Shape\n"
        f"- Skewness: {skewness:.2f}\n"
        f"- Kurtosis: {kurtosis:.2f}"
    )

    # Add text boxes
    ax.text(1.02, 0.95, location_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='whitesmoke'))

    ax.text(1.02, 0.665, dispersion_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='honeydew'))

    ax.text(1.02, 0.30, shape_stats, transform=ax.transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='lavender'))

    plt.tight_layout()
    plt.show()

# Interactive widgets
mu_slider = widgets.FloatSlider(value=0, min=-5, max=5, step=0.5, description='μ (mean):')
sigma_slider = widgets.FloatSlider(value=1, min=0.1, max=5, step=0.1, description='σ (std dev):')

ui = widgets.VBox([mu_slider, sigma_slider])
out = widgets.interactive_output(plot_normal, {'mu': mu_slider, 'sigma': sigma_slider})

# Display
display(ui, out)