Lecture 27: Setting up Python#

Note

Python is an open-source, high-level, general-purpose programming language known for its simplicity and readability. Originally developed by Guido van Rossum in the late 1980s and first released in 1991, Python has since grown into one of the most popular programming languages for data analysis, machine learning, web development, automation, and scientific computing. Its strength lies in its extensive standard library, vibrant ecosystem of third-party packages, and ability to integrate seamlessly with other languages and tools.


Installing Python and JupyterLab / VS Code on Windows#

Follow these steps:

Step 1: Install Python#

  1. Visit the official Python website: https://www.python.org/downloads/.

  2. Download the latest stable version of Python (e.g., 3.x series) for Windows.

  3. Run the installer:

    • Important: Check the box “Add Python to PATH” before clicking Install Now.

    • This ensures that Python commands can be run from the terminal.

  4. Verify installation:

    • Open Command Prompt and type: python --version

Step 2: Install VS Code#

  1. Visit the official Visual Studio Code website: https://code.visualstudio.com/.

  2. Download the Windows installer and run it.

  3. During installation:

    • Select the option to add “Open with Code” to the right-click context menu.

    • Optionally, allow adding VS Code to the system PATH.

  4. Launch VS Code after installation.

Step 3: Install the Python Extension for VS Code#

  1. Open VS Code.

  2. Go to the Extensions view (click the square icon on the left sidebar or press Ctrl+Shift+X).

  3. Search for Python and install the extension published by Microsoft.

  4. This extension adds syntax highlighting, IntelliSense, debugging, and Jupyter Notebook support.

Data Types in Python#

Python supports the following basic data types:

  • Character

  • Numeric

  • Integer

  • Logical

  • Complex

Here are some examples:

# Character
x = "CE5540"
print("Type of x is: ", type(x))

# Float
r = 3.14
print("Type of y is: ", type(r))

# Integer
v = int(42)
print("Type of v is: ", type(v))

# Logical
f = True
print("Type of f is: ", type(f))

# Complex
z = 2 + 3j
print("Type of z is: ", type(z))
Type of x is:  <class 'str'>
Type of y is:  <class 'float'>
Type of v is:  <class 'int'>
Type of f is:  <class 'bool'>
Type of z is:  <class 'complex'>

Data Structures in Python#

Python supports the following data structures:

  • Vectors

  • Matrices

  • Lists

  • Data Frames

Here are some examples:

import numpy as np
import pandas as pd

# Vectors
v1 = ["Apple", "Banana", "Mango"]
v2 = [9, 1, 5, 4, 6, 7, 0, 3, 8]
v3 = list(range(1, 6))  # R's 1:5
print("# Vectors")
print(v1)
print(v2)
print(v3)
print(f"Accessing a value in a vector: v1[0] = {v1[0]}. \nNotice that Python uses 0-based indexing!")

# Matrices (NumPy arrays)
data = [9, 1, 5, 4, 6, 7, 0, 3, 8]
## By row
m1 = np.array(data).reshape(3, 3, order='C')
## By column
m2 = np.array(data).reshape(3, 3, order='F')
print("\n# Matrices")
print("m1 (by row):\n", m1)
print("m2 (by column):\n", m2)
print(f"Accessing a value in a matrix: m1[0, 2] = {m1[0, 2]}")

# Dictionaries
l = {"name": "John", "age": 25, "scores": [90, 85, 88]}
print("\n# Dict")
print(l)
print(f"Accessing a value in a dict: l['name'] = {l['name']}")

# Data Frames (pandas)
df = pd.DataFrame({"Name": ["Alice", "Bob"], "Age": [23, 25]})
print("\n# Data Frame")
print(df)
print(f"Accessing a value in a data frame: df['Age'].iloc[0] = {df['Age'].iloc[0]}")
# Vectors
['Apple', 'Banana', 'Mango']
[9, 1, 5, 4, 6, 7, 0, 3, 8]
[1, 2, 3, 4, 5]
Accessing a value in a vector: v1[0] = Apple. 
Notice that Python uses 0-based indexing!

# Matrices
m1 (by row):
 [[9 1 5]
 [4 6 7]
 [0 3 8]]
m2 (by column):
 [[9 4 0]
 [1 6 3]
 [5 7 8]]
Accessing a value in a matrix: m1[0, 2] = 5

# Dict
{'name': 'John', 'age': 25, 'scores': [90, 85, 88]}
Accessing a value in a dict: l['name'] = John

# Data Frame
    Name  Age
0  Alice   23
1    Bob   25
Accessing a value in a data frame: df['Age'].iloc[0] = 23

Control Flow#

Here is how you would write control flow statements in Python

x = 10

if x > 0:
    print("x is a positive number")
elif x < 0:
    print("x is a negative number")
else:
    print("x is zero!")
x is a positive number

Writing Loops in Python#

Python supports both for and while loops.

# For loop
print("# For Loop")
for i in range(1, 6):  # range(1, 6) generates 1, 2, 3, 4, 5
    print("Iteration:", i)

# While loop
print("\n\n# While Loop")
i = 1
while i <= 5:
    print("Count:", i)
    i = i + 1
# For Loop
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5


# While Loop
Count: 1
Count: 2
Count: 3
Count: 4
Count: 5

Writing Functions in Python#

Functions are blocks of code that can be reused. Here’s how to define and call one.

# Factorial Function (Iterative Form)
def factorial_iterative(n):
    result = 1
    for i in range(2, n + 1):  # range is end-exclusive, so use n+1
        result *= i
    return result

# Example usage
print(factorial_iterative(5))  # Output: 120
120
# Factorial Function (Recursive Form)
def factorial_recursive(n):
    if n == 0 or n == 1:
        return 1
    else:
        return n * factorial_recursive(n - 1)

# Example usage
print(factorial_recursive(5))  # Output: 120
120

Summarising Data in Python#

In this segment of the lecture, we will develop measures of location, dispersion, and shape discussed in the previous lecture through 2024 ITUS sample individual dataset.

import pandas as pd

# 2024 ITUS Individual Data (original)
url  = "https://raw.githubusercontent.com/anmpahwa/CE5540/refs/heads/main/resources/ITUS_IND_OG.csv"
data = pd.read_csv(url)  # Loading Data

# Data structure
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 533719 entries, 0 to 533718
Data columns (total 23 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   survey_year        533719 non-null  int64  
 1   fsu_serial_no      533719 non-null  int64  
 2   sector             533719 non-null  int64  
 3   nss_region         533719 non-null  int64  
 4   district           533719 non-null  int64  
 5   stratum            533719 non-null  int64  
 6   sub_stratum        533719 non-null  int64  
 7   sub_round          533719 non-null  int64  
 8   fod_sub_region     533719 non-null  int64  
 9   nsc                533719 non-null  int64  
 10  household_id       533719 non-null  int64  
 11  individual_id      533719 non-null  int64  
 12  response_code      533719 non-null  int64  
 13  day_of_week        533719 non-null  int64  
 14  type_of_day        533719 non-null  int64  
 15  relation_to_head   533719 non-null  int64  
 16  gender             533719 non-null  int64  
 17  age                533719 non-null  int64  
 18  marital_status     533719 non-null  int64  
 19  education_level    533719 non-null  int64  
 20  employment_status  533719 non-null  int64  
 21  industry           533719 non-null  int64  
 22  weight             533719 non-null  float64
dtypes: float64(1), int64(22)
memory usage: 93.7 MB
None
# Filtering out into a smaller dataset
data = data[(data["nss_region"] == 241) & (data["employment_status"] == 81)]

# View the filtered data (head for preview)
print(data.head())
     survey_year  fsu_serial_no  sector  nss_region  district  stratum  \
38          2024          30010       1         241        17       13   
210         2024          30013       1         241        26       13   
225         2024          30013       1         241        26       13   
229         2024          30013       1         241        26       13   
312         2024          30014       1         241        29       13   

     sub_stratum  sub_round  fod_sub_region  nsc  ...  day_of_week  \
38            11          2            2420    4  ...            1   
210           10          1            2423    4  ...            4   
225           10          1            2423    4  ...            7   
229           10          1            2423    4  ...            7   
312           11          4            2420    4  ...            7   

     type_of_day  relation_to_head  gender  age  marital_status  \
38             1                 5       1   22               1   
210            1                 5       1   18               1   
225            1                 6       1   18               1   
229            1                 5       1   19               1   
312            1                 5       1   35               1   

     education_level  employment_status  industry    weight  
38                10                 81     99999  208857.0  
210                5                 81     99999  207000.0  
225                5                 81     99999  207000.0  
229                4                 81     99999  207000.0  
312                2                 81     99999  196571.0  

[5 rows x 23 columns]
import pandas as pd
import numpy as np
from scipy import stats

# Measures of Location
z1 = data["age"].mean()     # Mean
z2 = data["age"].median()   # Median
z3 = data["age"].mode()[0]  # Mode (returns Series, take first)

print("Mean:", round(z1, ndigits=3))
print("Median:", round(z2, ndigits=3))
print("Mode:", round(z3, ndigits=3))

# Measures of Dispersion
z1 = data["age"].max() - data["age"].min()      # Range
z2 = stats.iqr(data["age"], nan_policy="omit")  # IQR
z3 = data["age"].std()                          # Standard deviation

print("Range:", round(z1, ndigits=3))
print("IQR:", round(z2, ndigits=3))
print("SD:", round(z3, ndigits=3))

# Measures of Distribution
z1 = stats.skew(data["age"], nan_policy="omit")     # Skewness
z2 = stats.kurtosis(data["age"], nan_policy="omit") # Kurtosis (Fisher's definition)

print("Skewness:", round(z1, ndigits=3))
print("Kurtosis:", round(z2, ndigits=3))
Mean: 26.949
Median: 23.0
Mode: 18
Range: 70
IQR: 11.0
SD: 11.423
Skewness: 1.816
Kurtosis: 4.337

Linear Regression in Python#

# Load necessary libraries
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt  

# Notebook/plot display options
plt.rcParams['figure.figsize'] = (9, 6)
pd.set_option('display.max_columns', None)  # Optional: show all columns when printing

# 2024 ITUS Individual Data (original)
url  = "https://raw.githubusercontent.com/anmpahwa/CE5540/refs/heads/main/resources/ITUS_HHD_DT.csv"
data = pd.read_csv(url)

# Data structure
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139487 entries, 0 to 139486
Data columns (total 21 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Unique_HH_ID         139487 non-null  object 
 1   time_of_year         139487 non-null  int64  
 2   day_of_week          139487 non-null  int64  
 3   sector               139487 non-null  int64  
 4   region               139487 non-null  int64  
 5   district_population  139487 non-null  int64  
 6   gender_ratio         139487 non-null  float64
 7   average_age          139487 non-null  float64
 8   marital_status       139487 non-null  int64  
 9   highest_eduLevel     139487 non-null  int64  
 10  employment_ratio     139487 non-null  float64
 11  family_structure     139487 non-null  float64
 12  household_size       139487 non-null  int64  
 13  religion             139487 non-null  int64  
 14  social_group         139487 non-null  int64  
 15  land_possessed       139487 non-null  int64  
 16  total_expenditure    139487 non-null  int64  
 17  dwelling_unit        139487 non-null  int64  
 18  dwelling_structure   139487 non-null  int64  
 19  weight               139487 non-null  int64  
 20  shopping_choice      139487 non-null  int64  
dtypes: float64(4), int64(16), object(1)
memory usage: 22.3+ MB
None

Scatter Plot#

# Scatter Plot
plt.scatter(
    data["average_age"],
    np.log10(data["total_expenditure"] / data["household_size"]),
    color="steelblue",
    s=15  # matches your earlier size=2 in ggplot (Matplotlib uses points^2)
)
plt.title("Average Age vs Log Monthly Expenditure per Capita", fontsize=12, fontweight="bold")
plt.xlabel("Average Age", fontsize=10)
plt.ylabel("Log Monthly Expenditure per Capita", fontsize=10)
plt.xticks(fontsize=9)
plt.yticks(fontsize=9)

plt.show()
../_images/1879154ac2b5e8bf0fed8a6b6dd36ed3a1d40dbd2b9da9a36db86f582e384e8f.png
# Compute Pearson correlation
correlation = data["average_age"].corr(np.log10(data["total_expenditure"] / data["household_size"]))
print("Correlation between average age and monthly expenditure:", round(correlation, 3))
Correlation between average age and monthly expenditure: 0.264
# Define variables
X = data["average_age"]
y = np.log10(data["total_expenditure"] / data["household_size"])

# Fit linear regression model
X = sm.add_constant(X)
model = sm.OLS(y, X, missing="drop").fit()

# Display summary
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.070
Model:                            OLS   Adj. R-squared:                  0.070
Method:                 Least Squares   F-statistic:                 1.047e+04
Date:                Sun, 10 Aug 2025   Prob (F-statistic):               0.00
Time:                        16:16:59   Log-Likelihood:                -1697.0
No. Observations:              139487   AIC:                             3398.
Df Residuals:                  139485   BIC:                             3418.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           3.3684      0.002   2060.789      0.000       3.365       3.372
average_age     0.0046   4.52e-05    102.317      0.000       0.005       0.005
==============================================================================
Omnibus:                     4880.760   Durbin-Watson:                   0.757
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6665.938
Skew:                           0.374   Prob(JB):                         0.00
Kurtosis:                       3.767   Cond. No.                         90.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Errors

# Sums of squares
sst = np.sum((np.asarray(model.model.endog) - np.asarray(model.model.endog).mean())**2)             # total
ssr = np.sum((np.asarray(model.fittedvalues)  - np.asarray(model.model.endog).mean())**2)          # regression
sse = np.sum((np.asarray(model.model.endog) - np.asarray(model.fittedvalues))**2)                  # error

# Degrees of freedom
n = int(model.nobs)
p = int(model.df_model) + 1   # number of parameters incl. intercept
df_resid = int(model.df_resid)

# Metrics
rse = np.sqrt(sse / df_resid)
R2 = ssr / sst
AdjR2 = 1 - (1 - R2) * (n - 1) / (n - p)

print("SST               :", round(sst, 2))
print("SSR               :", round(ssr, 2))
print("SSE               :", round(sse, 2))
print("RSE               :", round(rse, 2))
print("R-squared         :", round(R2, 3))
print("Adjusted R-squared:", round(AdjR2, 3))
SST               : 8996.15
SSR               : 628.05
SSE               : 8368.1
RSE               : 0.24
R-squared         : 0.07
Adjusted R-squared: 0.07
# Define variables
x = data["average_age"]
y = np.log10(data["total_expenditure"] / data["household_size"])

# Fit simple linear regression y = m*x + b
m, b = np.polyfit(x, y, 1)

# Regression line for plotting
x_line = np.linspace(x.min(), x.max(), 200)
y_line = m * x_line + b

# Scatter
plt.scatter(x, y, color="steelblue", s=15)

# Regression line (no CI, like se = FALSE)
plt.plot(x_line, y_line, color="darkred", linewidth=1.2)

# Labels & styling
plt.title("Regression Plot", fontsize=12, fontweight="bold")
plt.xlabel("Average Age", fontsize=10)
plt.ylabel("Log Monthly Expenditure per Capita (₹)", fontsize=10)
plt.xticks(fontsize=9)
plt.yticks(fontsize=9)

plt.show()
/Users/anmolpahwa/Library/Python/3.13/lib/python/site-packages/IPython/core/pylabtools.py:170: UserWarning: Glyph 8377 (\N{INDIAN RUPEE SIGN}) missing from font(s) Arial.
  fig.canvas.print_figure(bytes_io, **kw)
../_images/749a1c1d71f369900f9bc9d657451249f9220f84cb69032e8e91df95455439ee.png
# Extract fitted values and residuals from statsmodels model
fitted_vals = model.fittedvalues
residuals = model.resid

# Scatter plot of residuals vs fitted values
plt.scatter(fitted_vals, residuals, color="steelblue", s=15)

# Add horizontal zero reference line
plt.axhline(y=0, color="black", linestyle="--", linewidth=0.8)

# Labels & styling
plt.title("Residuals Plot", fontsize=12, fontweight="bold")
plt.xlabel("Fitted Values", fontsize=10)
plt.ylabel("Error", fontsize=10)
plt.xticks(fontsize=9)
plt.yticks(fontsize=9)

plt.show()
../_images/1aebf915eeaba918e2d1f2c1418e770415feadeb24d7e5cabb197200cbd0d50a.png