# Lecture 7: Sampling

```{note}
A central problem in statistics is to obtain information about a population without examining every unit in the population. This process of drawing conclusions from a small subset of the population is called sampling. In this lecture, we will explore sampling design - the rules for deciding which units comprise the sample, crucial to accuracy and reliability of the results.
```

---

## Sampling Desing Process

A population is a collection of units. We are interested in exploring the numerical properties of this population, called parameters (measures of location, dispersion, and shape), and we base our estimates and inferences on the observed value of a quantity computed from the sample, called statistics. The sampling design process, i.e., the rules for selecting which units of the population are included in the sample, is critical to ensuring accurate and reliable results. A poorly designed sample can introduce bias, causing estimates of population parameters to consistently overstate or understate the true values.

### Random Sampling

**Definition**: Under random sampling, every unit in the population has an equal chance of being selected. This is the most basic form of probability sampling and provides unbiased estimates if implemented correctly.

**Example**: To estimate the average number of trips per person in a city, a transportation researcher randomly selects individuals from the city’s electoral roll and surveys them about their daily travel patterns. Since each individual had an equal chance of selection, the results can be generalized to the entire population with known margins of error.

### Convineince Sampling

**Definition**: Convenience sampling is a non-probability sampling technique where the sample is drawn from a group that is readily available and easy to contact. 

**Example**: A group of transportation engineers at a metro station decides to survey the first $n$ passengers exiting the platform during the morning rush hour to get a quick understanding of passenger satisfaction. While easy and fast to execute, the results may be biased toward a certain commuter demographic or time-specific behavior.

### Quota Sampling

**Definition**: Quota sampling is a non-probability sampling method where the population is segmented into mutually exclusive subgroups, and the researcher selects individuals non-randomly until pre-set quotas for each group are filled.

**Example**: A mode choice survey aims to gather input from a mix of commuters. The researcher sets quotas such that 40% of respondents are car users, 30% are bus users, 20% are metro users, and 10% are cyclists. Enumerators then approach people until each quota is filled, regardless of how individuals are selected within those categories.

### Systematic Sampling

**Definition**: In systematic sampling, the first unit is selected at random from an ordered list, and subsequent units are selected at fixed intervals (e.g., every k-th unit).

**Example**: To monitor traffic volume, a field team records every 10th vehicle passing a counting station on a highway during peak hours. The starting vehicle is randomly chosen, and then every 10th vehicle thereafter is surveyed for vehicle type and occupancy.

### Stratified Sampling

**Definition**: Stratified sampling involves dividing the population into homogeneous subgroups (strata) based on characteristics such as income level, vehicle ownership, or residential density. Random samples are then drawn independently from each stratum. 

**Example**: A citywide travel demand survey divides the population into strata based on income brackets (e.g., low, middle, high). A random sample of households is selected from each bracket to ensure that travel behavior across economic groups is adequately represented in the analysis.

### Cluster Sampling

**Definition**: Cluster Sampling is a sampling method where the population is divided into groups, or clusters, and a random selection of these clusters is chosen. Then, either all units within the selected clusters or a sample from within them is surveyed.

**Example**: A city wants to estimate the average daily number of trips made by households for commuting purposes. Instead of randomly sampling households across the entire city (which can be logistically difficult and expensive), the city is divided into wards or traffic analysis zones (TAZs)—these form the clusters. A random selection of a few zones is made, and then all (or a sample of) households within these selected zones are surveyed.

### Multistage Sampling 

**Definition**: Multistage Sampling is a more complex form of cluster sampling where sampling is carried out in multiple stages. Instead of surveying all units within selected clusters, additional sampling steps are applied within those clusters. This approach combines different sampling techniques (e.g., cluster and stratified or simple random sampling) to improve efficiency and manageability, especially in large or geographically spread populations.

**Example**: To estimate public transport usage across a state, a transportation agency first selects a random sample of districts (first stage). Within each selected district, a random selection of towns or urban blocks is made (second stage). Then, within each chosen block, a sample of households is surveyed regarding their transit habits. This step-by-step narrowing helps balance cost and coverage.

### Hybrid Sampling

**Definition**: Hybrid Sampling Designs combine two or more basic sampling methods—such as stratified, cluster, systematic, or simple random sampling—to leverage the strengths of each. This approach allows researchers to tailor the sampling strategy to the complexity of the population and the specific goals of the study, often resulting in more efficient and representative samples.

**Example**: A metropolitan region is conducting a travel behavior survey. First, the population is stratified by income level (stratified sampling). Within each income group, specific neighborhoods are selected as clusters (cluster sampling). Finally, within each selected neighborhood, households are chosen using systematic sampling. This hybrid design ensures diversity in socioeconomic characteristics while controlling for geographic spread and data collection costs.

---

**Sampling Desing Process Summary**

| Sampling Method        | Pros                                                                 | Cons                                                                 |
|------------------------|----------------------------------------------------------------------|----------------------------------------------------------------------|
| **Random Sampling**     | - Minimizes selection bias<br>- Allows valid statistical inference<br>- Simple conceptual design | - Requires full sampling frame<br>- Logistically expensive for large populations |
| **Convenience Sampling**| - Quick and easy to implement<br>- Low cost<br>- Useful for pilot studies | - Highly biased<br>- Results not generalizable<br>- No control over representativeness |
| **Quota Sampling**      | - Ensures representation of key groups<br>- Cost-effective<br>- Useful when population proportions are known | - Non-random selection within quotas<br>- Potential interviewer bias<br>- Not suitable for statistical inference |
| **Systematic Sampling** | - Easy to implement in the field<br>- Does not require full random number generation<br>- Works well with ordered lists | - Risk of hidden patterns introducing bias<br>- Assumes uniform distribution |
| **Stratified Sampling** | - High precision in estimates<br>- Ensures representation across strata<br>- Allows subgroup analysis | - Requires prior knowledge of strata<br>- More complex to organize and analyze |
| **Cluster Sampling**    | - Cost-effective for large, dispersed populations<br>- Reduces travel and admin cost<br>- Useful when a full sampling frame is unavailable | - Higher sampling error<br>- May reduce precision if clusters are internally homogeneous |
| **Multistage Sampling** | - Scalable for large populations<br>- Offers operational flexibility<br>- Reduces fieldwork cost | - Complex design and analysis<br>- Risk of compounding sampling errors across stages |
| **Hybrid Sampling**     | - Customizable for complex population structures<br>- Balances precision and cost<br>- Leverages strengths of multiple methods | - Design can become overly complex<br>- Requires careful planning and expertise |

---

**Test Yourself**

Context: A particular city consists of 700 blocks, each block contains at least 8 households and an average of 40 households, and each household contains at least one person and an average of 4 persons. As part of the sampling process, 70 blocks are drawn at random, thereafter, 8 households are drawn at random from each selected block, and finally each person in these selected households is sampled.

1. What sampling method have we employed?

<form id="Q1">
  <label><input type="radio" name="q1" value="a" onchange="checkRadioQ1()"> Random </label><br>
  <label><input type="radio" name="q1" value="b" onchange="checkRadioQ1()"> Convenience </label><br>
  <label><input type="radio" name="q1" value="c" onchange="checkRadioQ1()"> Quota </label><br>
  <label><input type="radio" name="q1" value="d" onchange="checkRadioQ1()"> Systematic </label><br>
  <label><input type="radio" name="q1" value="e" onchange="checkRadioQ1()"> Stratified </label><br>
  <label><input type="radio" name="q1" value="f" onchange="checkRadioQ1()"> Cluster </label><br>
  <label><input type="radio" name="q1" value="g" onchange="checkRadioQ1()"> Multistage </label><br>
  <label><input type="radio" name="q1" value="h" onchange="checkRadioQ1()"> Hybrid </label><br>
</form>
<p id="feedbackQ1"></p>
<script>
function checkRadioQ1() {
  const options = document.getElementsByName("q1");
  let selectedValue = "";
  for (const option of options) {
    if (option.checked) {
      selectedValue = option.value;
      break;
    }
  }
  const feedback = document.getElementById("feedbackQ1");
  if (selectedValue === "g") {
    feedback.innerHTML = "✅ Correct! This is multistage sampling";
  } else {
    feedback.innerHTML = "❌ Incorrect.";
  }
}
</script>

2. Does every resident of the city have an equal chance of being in the sample?

<form id="Q2">
  <label><input type="radio" name="q2" value="a" onchange="checkRadioQ2()"> Yes </label><br>
  <label><input type="radio" name="q2" value="b" onchange="checkRadioQ2()"> No </label><br>
</form>
<p id="feedbackQ2"></p>
<script>
function checkRadioQ2() {
  const options = document.getElementsByName("q2");
  let selectedValue = "";
  for (const option of options) {
    if (option.checked) {
      selectedValue = option.value;
      break;
    }
  }
  const feedback = document.getElementById("feedbackQ2");
  if (selectedValue === "b") {
    feedback.innerHTML = "✅ Correct!";
  } else {
    feedback.innerHTML = "❌ Incorrect. Only individuals in the same block have an equally likely probability of selection, while the probability of selection in the sample varies from one individual to another, across blocks. Specifically, residents of smaller blocks have a higher likelihood of being sampled.";
  }
}
</script>

3. Is the average number of people per household in the sample equal to the average across the population - 4?

<form id="Q3">
  <label><input type="radio" name="q3" value="a" onchange="checkRadioQ3()"> Yes </label><br>
  <label><input type="radio" name="q3" value="b" onchange="checkRadioQ3()"> No </label><br>
</form>
<p id="feedbackQ3"></p>
<script>
function checkRadioQ3() {
  const options = document.getElementsByName("q3");
  let selectedValue = "";
  for (const option of options) {
    if (option.checked) {
      selectedValue = option.value;
      break;
    }
  }
  const feedback = document.getElementById("feedbackQ3");
  if (selectedValue === "b") {
    feedback.innerHTML = "✅ Correct!";
  } else {
    feedback.innerHTML = "❌ Incorrect. Because certain households have a higher likelihood of being sampled, there is no guarantee that sample estimate will match population parameter.";
  }
}
</script>
