Lecture 11: Hypothesis Tests#
Note
This lecture introduces frequently used hypothesis tests in statistics, detailing the test relvance along with assumptions, test statistic, rejection criterion, and a working example, in the context of two-tailed tests - when a deviation of the observed value in either direction from the hypothesized value is considered equally undesirable, as well as one-tailed test - when a deviation in a specific direction from the hypothesized value is undesirable.
z-Test#
Mean Value Test#
Relevance#
The mean value z-test evaluates whether a sample mean significantly differs from a hypothesized population mean, under the assumption that,
population variance is known
sampled observations are independent and randomly drawn from the population
distribution of the sample means is approximately normal (either due to a large sample size or the underlying population being normal)
Test Statistic#
The test statistic \((z)\) in a z-test quantifies the deviation of the sample mean from the population mean in units of the standard error as follows,
Where,
\(\hat{\mu}\) is the sample mean
\(\mu_o\) is the hypothesized population mean
\(\sigma\) is the population standard deviation
\(n\) is the sample size
Rejection Criterion#
The null hypothesis can be rejected on the basis of rejection region \((\text{RR})\) of the test statistic \((z)\), which defines the set of values for \(z\) that are considered highly improbable under the assumption that the null hypothesis is true. To this end, if the computed test statistic falls within this rejection region \((z \in \text{RR})\), then the null hypothesis can be rejected. Note, for a two-tailed test, the rejection region is given by \((-\infty, -z_{\alpha/2}) \cup (z_{\alpha/2}, \infty)\). In contrast, for a one-tailed test, \((-\infty, -z_{\alpha})\) forms the rejection region in a left-tailed test, and \((z_{\alpha}, \infty)\) in a right-tailed test.
Alternatively, the null hypothesis can be rejected on the basis of the confidence interval \((\text{CI})\), which provides a range of plausible values for the population mean \((\mu_o)\) under the assumption that the null hypothesis is true. To this end, if the population mean lies outside the confidence interval \((\mu_o \notin \text{CI})\), then the null hypothesis can be rejected. Note, for a two-tailed test, the confidence interval is given by \([\hat{\mu} - z_{\alpha / 2}(\sigma / \sqrt{n}), \hat{\mu} + z_{\alpha / 2}(\sigma / \sqrt{n})]\). In contrast, for a one-tailed test, \((-\infty, \hat{\mu} + z_{\alpha}(\sigma / \sqrt{n}))\) forms the confidence interval in a left-tailed test, and \((\hat{\mu} - z_{\alpha}(\sigma / \sqrt{n})), \infty)\) in a right-tailed test.
Note, p-value is the smallest value of significance level (\(\alpha\)) for which we can reject the null hypothesis. A high p-value implies that the observed data is reasonably consistent under the presumption of the null hypothesis, thereby compelling us to not reject it. On the other hand, a low p-value signifies that the observed outcome is highly improbable under the presumption of null hypothesis, thereby supporting its rejection beyond a reasonable doubt.
Example#
An automobile manufacturer claims that its electric SUV has a real-world range of 500kms. The Automotive Research Association of India (ARAI) conducts 100 range tests, rendering a mean of 485kms. Given population standard deviation of 25kms, test the claim at 5% significance level.
\(H_o\): The electric SUV meets the claimed range (\(\mu \geq 500\))
\(H_a\): The electric SUV falls short of the claimed range (\(\mu < 500\))
The test statistic is computed as, \(z = (485 - 500) / (25 / \sqrt{100}) = -6\).
For this a left-tailed test, at 5% siginifcance level, the rejection region is given by, \(\text{RR} = (-\infty, -1.645)\). Since, \(z \in RR \Rightarrow \text{reject} \ H_o\).
Alternatively, for this a left-tailed test, the confidence interval is given by, \(\text{CI} = (-\infty, 485 + 1.645 \times (25 / \sqrt{100})) = (-\infty, 489.1)\). Since, \(\mu_o \notin \text{CI} \Rightarrow \text{reject} \ H_o\)
Further, the p-value for this test is 0.0000001%, implying that there is a one in a billion chance of observing such an outcome under the presumption of the null hypothesis, thereby compelling us to reject the null hypothesis.
Hence, based on the observed outcomes, the electric SUV fails to meet the claimed range of 500kms, beyond a reasonable doubt. Note, this reasonable doubt emerges from a 5% possibility of rejecting a true null hypothesis - Type I error.
One-Proportion Test#
Relevance#
The one-proportion z-test evaluates whether the proportion observed in a sample significantly differs from the population proportion, under the assumption that,
population observations are modeled as Bernoulli Trials
sampled observations are independent and randomly drawn from the population
distribution of the sample proportion is approximately normal (due to large sample size)
Test Statistic#
The test statistic in a one-proportion z-test captures how far the observed sample proportion deviates from the hypothesized population proportion, scaled by the standard error, as follows,
Where,
\(\hat{p}\) is the observed sample proportion
\(p_o\) is the hypothesized population proportion
\(n\) is the sample size.
Rejection Criterion#
The null hypothesis can be rejected on the basis of rejection region \((\text{RR})\) of the test statistic \((z)\), which defines the set of values for \(z\) that are considered highly improbable under the assumption that the null hypothesis is true. To this end, if the computed test statistic falls within this rejection region \((z \in \text{RR})\), then the null hypothesis can be rejected. Note, for a two-tailed test, the rejection region is given by \((-\infty, -z_{\alpha/2}) \cup (z_{\alpha/2}, \infty)\). In contrast, for a one-tailed test, \((-\infty, -z_{\alpha})\) forms the rejection region in a left-tailed test, and \((z_{\alpha}, \infty)\) in a right-tailed test.
Note, p-value is the smallest value of significance level (\(\alpha\)) for which we can reject the null hypothesis. A high p-value implies that the observed data is reasonably consistent under the presumption of the null hypothesis, thereby compelling us to not reject it. On the other hand, a low p-value signifies that the observed outcome is highly improbable under the presumption of null hypothesis, thereby supporting its rejection beyond a reasonable doubt.
Example#
A public health agency claims that at least 70% of citizens are compliant with a new vaccination policy. A random survey of 400 people reveals that 260 of them are compliant. Test the claim of the public health agency at 5% significance level.
\(H_o\): \(p \geq 0.70\)
\(H_a\): \(p < 0.70\)
The test statistic is computed as, \(z = (0.65 - 0.7)/\sqrt{0.7 \times (1 - 0.7)/400} \approx -2.18\)
At 5% significance level, the rejection region is \((-\infty, -1.645)\). Since, \(z \in RR \Rightarrow \text{reject} \ H_o\).
Hence, based on the observed data, there is sufficient statistical evidence to suggest that the actual compliance rate is less than the claimed 70%, beyond a reasonable doubt. Note, this reasonable doubt emerges from a 5% possibility of rejecting a true null hypothesis - Type I error.
Two-Proportion Test#
Relevance#
The two-proportion z-test is used to determine whether there is a significant difference between the proportions of two independent groups, under the assumption that,
observations within each group are modeled as independent Bernoulli Trials
the samples are independently and randomly drawn from their respective populations
distribution of the sample proportion is approximately normal (due to large sample size)
Test Statistic#
The test statistic evaluates how far apart the observed proportions in the two samples are, relative to the standard error of the difference under the null hypothesis that the two population proportions are equal, given by,
Where,
\(\hat{p}_1\) and \(\hat{p}_2\) are sample proportion from the respective populations
\(n_1\) and \(n_2\) are the sample sizes
\(\hat{p}\) is the pooled sample proportion defined as, \(\hat{p} = (p_1n_1 + p_2n_2)/(n_1 + n_2)\)
Rejection Criterion#
The null hypothesis can be rejected on the basis of rejection region \((\text{RR})\) of the test statistic \((z)\), which defines the set of values for \(z\) that are considered highly improbable under the assumption that the null hypothesis is true. To this end, if the computed test statistic falls within this rejection region \((z \in \text{RR})\), then the null hypothesis can be rejected. Note, for a two-tailed test, the rejection region is given by \((-\infty, -z_{\alpha/2}) \cup (z_{\alpha/2}, \infty)\). In contrast, for a one-tailed test, \((-\infty, -z_{\alpha})\) forms the rejection region in a left-tailed test, and \((z_{\alpha}, \infty)\) in a right-tailed test.
Note, p-value is the smallest value of significance level (\(\alpha\)) for which we can reject the null hypothesis. A high p-value implies that the observed data is reasonably consistent under the presumption of the null hypothesis, thereby compelling us to not reject it. On the other hand, a low p-value signifies that the observed outcome is highly improbable under the presumption of null hypothesis, thereby supporting its rejection beyond a reasonable doubt.
Example#
A researcher claims that there is a significant difference in support for a new public policy between urban and rural residents. A sample of 300 urban residents yields 180 supporters, while a sample of 250 rural residents yields 130 supporters. Test the claim at 5% significance level.
\(H_o\): \(p_1 = p_2\)
\(H_a\): \(p_1 \neq p_2\)
Pooled proportion is: \(\hat{p} = (180 + 130)/(300 + 250) = 310/550 = 0.564\)
The test statistic is computed as, \(z = (0.60 - 0.52) / \sqrt{0.564(1 - 0.564)(1/300 + 1/250)} \approx 1.88\)
At 5% significance level, the rejection region is \((-\infty, -1.96) \cup (1.96, \infty)\). Since, \(z \notin RR \Rightarrow \text{cannot reject} \ H_o\).
Thus, there is insufficient evidence to conclude that support for the policy differs between urban and rural residents — at least not beyond a reasonable doubt. This doubt arises from the possibility of failing to reject a falase null hypothesis - Type II error.
t-Test#
One-Sample Test#
Relevance#
The one-sample t-test is used to determine whether the mean of a single sample differs significantly from a known or hypothesized population mean when the population standard deviation is unknown, under the assumption that
sampled observations are independent, continuous, and randomly drawn from the population
underlying population distribution is approximately normal
Test Statistic#
The test statistic in a one-sample t-test measures how many standard errors the sample mean is away from the hypothesized population mean. Since the population standard deviation is unknown, it is estimated from the sample itself. The test statistic is given by:
Where,
\(\hat{\mu}\) is the sample mean
\(\mu_o\) is the hypothesized population mean
\(s\) is the sample standard deviation
\(n\) is the sample size
This statistic follows a t-distribution with \(n - 1\) degrees of freedom.
Rejection Criterion#
The null hypothesis can be rejected on the basis of rejection region \((\text{RR})\) of the test statistic \((t)\), which defines the set of values for \(t\) that are considered highly improbable under the assumption that the null hypothesis is true. To this end, if the computed test statistic falls within this rejection region \((t \in \text{RR})\), then the null hypothesis can be rejected. Note, for a two-tailed test, the rejection region is given by \((-\infty, -t_{\alpha/2, n - 1}) \cup (t_{\alpha/2, n - 1}, \infty)\). In contrast, for a one-tailed test, \((-\infty, -t_{\alpha, n - 1})\) forms the rejection region in a left-tailed test, and \((t_{\alpha, n - 1}, \infty)\) in a right-tailed test.
Alternatively, the null hypothesis can be rejected on the basis of the confidence interval \((\text{CI})\), which provides a range of plausible values for the population mean \((\mu_o)\) under the assumption that the null hypothesis is true. To this end, if the population mean lies outside the confidence interval \((\mu_o \notin \text{CI})\), then the null hypothesis can be rejected. Note, for a two-tailed test, the confidence interval is given by \([\hat{\mu} - t_{\alpha / 2, n - 1}(\sigma / \sqrt{n}), \hat{\mu} + t_{\alpha / 2, n - 1}(\sigma / \sqrt{n})]\). In contrast, for a one-tailed test, \((-\infty, \hat{\mu} + t_{\alpha, n - 1}(\sigma / \sqrt{n}))\) forms the confidence interval in a left-tailed test, and \((\hat{\mu} - t_{\alpha, n - 1}(\sigma / \sqrt{n})), \infty)\) in a right-tailed test.
Note, p-value is the smallest value of significance level (\(\alpha\)) for which we can reject the null hypothesis. A high p-value implies that the observed data is reasonably consistent under the presumption of the null hypothesis, thereby compelling us to not reject it. On the other hand, a low p-value signifies that the observed outcome is highly improbable under the presumption of null hypothesis, thereby supporting its rejection beyond a reasonable doubt.
Example#
Suppose a battery manufacturer claims that their batteries last an average of 100 hours. A consumer group tests a random sample of 10 batteries and finds a sample mean of 95 hours with a sample standard deviation of 8 hours. At 5% significance level, test whether the batteries last less than the claimed duration.
\(H_o\): \(\mu \geq 100\)
\(H_a\): \(\mu < 100\)
The test statistic is, \(r = \frac{95 - 100}{8 / \sqrt{10}} = \frac{-5}{2.53} \approx -1.98\)
At \(df = 9\), the rejection region is \((-\infty, -t_{0.05, 9}) = (-\infty, -1.833)\). Since \(t \in \text{RR} \Rightarrow \text{reject} \ H_o\)
Thus, there is sufficient evidence to suggest that the battery life is less than the claimed 100 hours, beyond a reasonable doubt. Note, this reasonable doubt emerges from a 5% possibility of rejecting a true null hypothesis - Type I error.
Two-Sample Test#
Relevance#
The two-sample t-test is used to compare the means of two independent groups to determine if there is a statistically significant difference between them, under the assumption that,
sampled observations within each group are independent, continuous, and randomly drawn from their respective populations
underlying population distributions for both groups are approximately normal
Test Statistic#
The test statistic quantifies how far apart the sample means are, relative to the variability of the observations.
When assuming equal variances, the pooled t-test statistic is calculated as:
Where,
\(\hat{\mu}_1, \hat{\mu}_2\) are the sample means
\(s_1, s_2\) are the sample standard deviations, and
\(n_1, n_2\) are the sample sizes for the two groups
\(s_p\) is the pooled standard deviation, given by \(\sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}\)
The t-statistic follows a t-distribution with \(n_1 + n_2 - 2\) degrees of freedom.
Rejection Criterion#
The null hypothesis can be rejected on the basis of rejection region \((\text{RR})\) of the test statistic \((t)\), which defines the set of values for \(t\) that are considered highly improbable under the assumption that the null hypothesis is true. To this end, if the computed test statistic falls within this rejection region \((t \in \text{RR})\), then the null hypothesis can be rejected. Note, for a two-tailed test, the rejection region is given by \((-\infty, -t_{\alpha/2, n_1 + n_2 - 2}) \cup (t_{\alpha/2, n_1 + n_2 - 2}, \infty)\). In contrast, for a one-tailed test, \((-\infty, -t_{\alpha, n_1 + n_2 - 2})\) forms the rejection region in a left-tailed test, and \((t_{\alpha, n_1 + n_2 - 2}, \infty)\) in a right-tailed test.
Note, p-value is the smallest value of significance level (\(\alpha\)) for which we can reject the null hypothesis. A high p-value implies that the observed data is reasonably consistent under the presumption of the null hypothesis, thereby compelling us to not reject it. On the other hand, a low p-value signifies that the observed outcome is highly improbable under the presumption of null hypothesis, thereby supporting its rejection beyond a reasonable doubt.
Example#
A company tests a new training method. 12 employees trained under the new method have a mean performance score of 78 and standard deviation of 5, and 15 employees under the old method have a mean of 74, standard deviation of 6. Test whether the new method results in significantly better scores at 5% significance.
\(H_o\): \(\mu_1 = \mu_2\)
\(H_a\): \(\mu_1 \ne \mu_2\)
To begin with, the pooled standard deviation is, \(s_p = \sqrt{\frac{(11)(25) + (14)(36)}{25}} = \sqrt{\frac{275 + 504}{25}} = \sqrt{31.16} \approx 5.58\)
The test statistic is computed as, \(t = \frac{78 - 74}{5.58 \sqrt{\frac{1}{12} + \frac{1}{15}}} = \frac{4}{5.58 \times 0.38} \approx 1.88\)
At \(df = 25\), the rejection region is give by \((t_{0.05, 25}, \infty) = (1.708, \infty)\). Since \(t \in \text{RR} \Rightarrow \text{reject} \ H_o\)
Based on the given information, the new method leads to significantly better performance - beyond a reasonable doubt. This reasonable doubt emerges from a 5% possibility of rejecting a true null hypothesis - Type I error.
Paired Test#
Relevance#
The paired t-test is used to compare means from the same group at two different times (such as before-and-after scenarios), or between two matched samples (such as twins or case-control pairs), under the assumption that,
differences between paired observations are independently and randomly sampled from the population
distribution of these differences in the population is approximately normal
Test Statistic#
The test statistic is calculated on the differences between paired observations, given by
Where,
\(\bar{d}\) : mean of the differences
\(s_d\) : standard deviation of the differences
\(n\) : number of pairs.
The t-statistic follows a t-distribution with \(n - 1\) degrees of freedom.
Rejection Criterion#
The null hypothesis can be rejected on the basis of rejection region \((\text{RR})\) of the test statistic \((t)\), which defines the set of values for \(t\) that are considered highly improbable under the assumption that the null hypothesis is true. To this end, if the computed test statistic falls within this rejection region \((t \in \text{RR})\), then the null hypothesis can be rejected. Note, for a two-tailed test, the rejection region is given by \((-\infty, -t_{\alpha/2, n - 1}) \cup (t_{\alpha/2, n - 1}, \infty)\). In contrast, for a one-tailed test, \((-\infty, -t_{\alpha, n - 1})\) forms the rejection region in a left-tailed test, and \((t_{\alpha, n - 1}, \infty)\) in a right-tailed test.
Note, p-value is the smallest value of significance level (\(\alpha\)) for which we can reject the null hypothesis. A high p-value implies that the observed data is reasonably consistent under the presumption of the null hypothesis, thereby compelling us to not reject it. On the other hand, a low p-value signifies that the observed outcome is highly improbable under the presumption of null hypothesis, thereby supporting its rejection beyond a reasonable doubt.
Example#
A nutritionist tests whether a new diet lowers blood pressure. 8 patients have their blood pressure measured before and after following the diet. The mean of the differences (before - after) is 5 mmHg, with a standard deviation of 4.5. Test at 5% significance whether the diet is effective.
\(H_o\): \(\mu_d = 0\)
\(H_a\): \(\mu_d > 0\)
The test statistic is, \(t = \frac{5}{4.5 / \sqrt{8}} = \frac{5}{1.59} \approx 3.14\)
At \(df = 7\), the rejection region is \((t_{0.05, 7}, \infty) = (1.895, \infty)\). Since \(t \in \text{RR} \Rightarrow \text{reject} \ H_o\)
Thus, the diet has a statistically significant effect in lowering blood pressure - beyond a reasonable doubt. This reasonable doubt emerges from a 5% possibility of rejecting a true null hypothesis - Type I error.
Note
Beyond these foundational z-tests and t-tests, we will explore other hypothesis tests throughout the course as and when they become relevant. The key takeaway is that each test serves a specific purpose and rests on its own set of assumptions.
Broadly speaking, a statistician must remain aware that decision-making under uncertainty hinges on the careful formulation and testing of hypotheses. The central question is always: Which test is most appropriate for the question at hand, given the data and assumptions?