Lecture 14: Multivariate Data

Lecture 14: Multivariate Data#

Note

In the first few lectures of this course, we explored measures of location, dispersion, and shape to communicate the characteristics of the single variable data. In this lecture, we will explore means of communicating characteristics of multivariate data.

Linearity and Non-linearity#

Linearity refers to the situation where the relationship between the predictor variable \(X\) and the response variable \(Y\) can be well approximated by a straight line. In a scatter plot, this appears as a cloud of points roughly forming a linear trend. Linear relationships are foundational in many statistical models, including simple and multiple linear regression. Non-linearity, on the other hand, occurs when the relationship between \(X\) and \(Y\) cannot be captured adequately by a straight line. This could manifest as curves, exponential growth or decay, plateaus, or other complex patterns. Identifying non-linearity is essential because applying a linear model to non-linear data can lead to biased estimates and misleading inferences. Techniques such as polynomial regression, splines, or transformation of variables are often used to handle non-linearity.

Homoscedasticity and Heteroscedasticity#

Homoscedasticity describes a scenario in which the variability of the response variable \(Y\) remains constant across all levels of the predictor \(X\). In graphical terms, if you draw vertical slices through a scatter plot, the spread of the Y-values should be roughly equal across these slices. Homoscedasticity is a key assumption in linear regression, as violations can lead to inefficient estimates and invalid standard errors. In contrast, heteroscedasticity occurs when the variance of \(Y\) changes with \(X\) — for example, increasing or decreasing as \(X\) increases. This fan-shaped or funnel-shaped pattern in a scatter plot signals heteroscedasticity. It often arises in income data, where higher-income groups tend to show greater variation in expenditure, for instance. If present, robust regression methods or variance-stabilizing transformations may be required.

Outliers#

Outliers are observations that lie far from the general pattern of the data. They can arise due to data entry errors, measurement anomalies, or genuine variability in the population. Statistically, outliers are often defined as points lying more than 1.5 times the interquartile range (IQR) from the quartiles, or several standard deviations away from the mean. Outliers can have a disproportionate influence on model estimates, especially in methods like least squares regression, which are sensitive to extreme values. While not all outliers are problematic, their presence should prompt further investigation — they might reveal interesting phenomena, data quality issues, or the need for robust modeling techniques that down-weight their influence.

Test Yourself

The relation between household size and monthly expenditure per capita is linear.

The monthly expenditure per capita is heteroscedastic with respect to household size.

The monthly expenditure per capita contains outliers.

Tip

In the next few lectures, we will explore such visualization tools - a critical part of the analytical workflow, to uncover patterns and communicate insights more effectivey, particularly for multivariate data. These visual data summaries will enable us to detect trends and spot anomalies, serving as a foundation for more advanced data analysis.

Lecture 14: Multivariate Data

Contents

Lecture 14: Multivariate Data#

Introduction#

Linearity and Non-linearity#

Homoscedasticity and Heteroscedasticity#

Outliers#