Lecture 24: Symbolic Regression#

So far, we have explored linear regression, logistic regression, and their applications in modeling relationships between variables. However, these methods assume predefined model forms—linear or logistic in nature. Symbolic Regression offers a data-driven discovery of both the structure and parameters of the relationship between variables, without imposing a fixed functional form.

What is Symbolic Regression?#

Symbolic regression is a form of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset. Unlike linear or logistic regression, symbolic regression does not start with a predetermined equation structure. Instead, it uses evolutionary algorithms to evolve candidate equations that can model the data.

It combines principles from:

  • Machine Learning: For automated model discovery.

  • Genetic Programming: To evolve symbolic representations.

  • Classical Regression: To fit parameters within the discovered structure.

Symbolic regression is particularly useful when:

  • The underlying relationship between variables is unknown.

  • We suspect the relationship is non-linear or involves interactions not easily captured by standard models.

  • Interpretability of the resulting model is important.

Endogenous and Exogenous Variables#

Symbolic regression operates similarly to other regression methods with:

  • Endogenous Variable: The response or dependent variable we aim to predict or explain.

  • Exogenous Variables: The independent variables used to explain or predict the endogenous variable.

However, the functional form relating these variables is not predefined but discovered.

Example Use Cases#

  • Physical Sciences: Discovering governing equations from experimental data.

  • Economics: Uncovering non-linear dependencies in economic indicators.

  • Transportation: Modeling complex interactions between traffic flow, speed, and emissions.

Mathematical Formulation#

Instead of assuming a model like: $\( Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \)\( symbolic regression searches for: \)\( Y = f(X_1, X_2, ..., X_n) + \epsilon \)$ where (f) is an expression composed of:

  • Arithmetic operators (+, -, *, /)

  • Analytical functions (sin, cos, log, exp, etc.)

  • Constants and parameters to be optimized

PySR: Symbolic Regression in Python#

PySR is a high-performance symbolic regression library that combines evolutionary search with a focus on simplicity and interpretability of the discovered models.

Installation#

pip install pysr

Basic Example#

Let us explore symbolic regression using PySR.

# Step 1: Import Libraries
import numpy as np
from pysr import PySRRegressor
# Step 2: Generate Synthetic Data
np.random.seed(0)
X = np.random.randn(1000, 2)
y = X[:, 0] ** 2 + np.sin(X[:, 1]) + 0.1 * np.random.randn(1000)
# Step 3: Model Fitting with PySR
model = PySRRegressor(
    niterations=40,
    binary_operators=["+", "-", "*", "/"],
    unary_operators=["sin", "cos", "exp", "log", "abs"],
    populations=5,
    progress=True
)

model.fit(X, y)
c:\Users\anmpa\AppData\Local\Programs\Python\Python313\Lib\site-packages\pysr\sr.py:2811: UserWarning: Note: it looks like you are running in Jupyter. The progress bar will be turned off.
  warnings.warn(
Compiling Julia backend...
[ Info: Started!
───────────────────────────────────────────────────────────────────────────────────────────────────
Complexity  Loss       Score      Equation
1           2.247e+00  0.000e+00  y = 0.93362
2           1.143e+00  6.756e-01  y = abs(x₀)
3           4.341e-01  9.684e-01  y = x₀ * x₀
5           1.973e-01  3.944e-01  y = (x₀ * x₀) + x₁
6           9.130e-03  3.073e+00  y = (x₀ * x₀) + sin(x₁)
8           9.103e-03  1.439e-03  y = ((x₀ * x₀) + -0.0051229) + sin(x₁)
10          9.097e-03  3.693e-04  y = (sin(x₁ * 1.0051) + (x₀ * x₀)) + -0.0050434
12          9.092e-03  2.580e-04  y = sin(x₁) + ((((x₀ * x₀) + 0.00052394) / 0.99751) + -0.0...
                                      080193)
27          9.091e-03  1.016e-05  y = (sin(x₁ + (cos((x₁ + sin(cos(x₀ * (x₀ + x₀)) * -0.0096...
                                      079)) + x₀) * -0.0033448)) + ((x₀ * x₀) / 0.9981)) + -0.00...
                                      58046
28          9.087e-03  4.265e-04  y = (sin(x₁ + (cos(sin(x₀ + ((x₁ * (exp(x₀) + x₀)) * cos(-...
                                      0.0093952))) + x₀) * -0.0093952)) + (x₀ * (x₀ / 0.99992)))...
                                       + -0.00085283
29          9.059e-03  3.082e-03  y = (sin(x₁) + (((x₀ * x₀) + -0.0013693) / 0.99985)) + (co...
                                      s(sin(((((x₁ * x₀) * exp(x₀)) * x₁) + x₀) * 1.6642) + x₀) ...
                                      * -0.013314)
───────────────────────────────────────────────────────────────────────────────────────────────────
  - outputs\20250715_200304_Ub2Sxa\hall_of_fame.csv
[ Info: Final population:
[ Info: Results saved to:
PySRRegressor.equations_ = [
	    pick     score                                           equation  \
	0         0.000000                                          0.9336204   
	1         0.675641                                            abs(x0)   
	2         0.968433                                            x0 * x0   
	3         0.394401                                     (x0 * x0) + x1   
	4   >>>>  3.073007                                (x0 * x0) + sin(x1)   
	5         0.001439               ((x0 * x0) + -0.005122854) + sin(x1)   
	6         0.000369  (sin(x1 * 1.0051208) + (x0 * x0)) + -0.0050433893   
	7         0.000258  sin(x1) + ((((x0 * x0) + 0.0005239351) / 0.997...   
	8         0.000010  (sin(x1 + (cos((x1 + sin(cos(x0 * (x0 + x0)) *...   
	9         0.000426  (sin(x1 + (cos(sin(x0 + ((x1 * (exp(x0) + x0))...   
	10        0.003082  (sin(x1) + (((x0 * x0) + -0.0013692942) / 0.99...   
	
	        loss  complexity  
	0   2.247133           1  
	1   1.143408           2  
	2   0.434126           3  
	3   0.197262           5  
	4   0.009130           6  
	5   0.009103           8  
	6   0.009097          10  
	7   0.009092          12  
	8   0.009091          27  
	9   0.009087          28  
	10  0.009059          29  
]
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Step 4: Viewing Discovered Equations
print(model)
PySRRegressor.equations_ = [
	    pick     score                                           equation  \
	0         0.000000                                          0.9336204   
	1         0.675641                                            abs(x0)   
	2         0.968433                                            x0 * x0   
	3         0.394401                                     (x0 * x0) + x1   
	4   >>>>  3.073007                                (x0 * x0) + sin(x1)   
	5         0.001439               ((x0 * x0) + -0.005122854) + sin(x1)   
	6         0.000369  (sin(x1 * 1.0051208) + (x0 * x0)) + -0.0050433893   
	7         0.000258  sin(x1) + ((((x0 * x0) + 0.0005239351) / 0.997...   
	8         0.000010  (sin(x1 + (cos((x1 + sin(cos(x0 * (x0 + x0)) *...   
	9         0.000426  (sin(x1 + (cos(sin(x0 + ((x1 * (exp(x0) + x0))...   
	10        0.003082  (sin(x1) + (((x0 * x0) + -0.0013692942) / 0.99...   
	
	        loss  complexity  
	0   2.247133           1  
	1   1.143408           2  
	2   0.434126           3  
	3   0.197262           5  
	4   0.009130           6  
	5   0.009103           8  
	6   0.009097          10  
	7   0.009092          12  
	8   0.009091          27  
	9   0.009087          28  
	10  0.009059          29  
]
# Step 5: Making Predictions
y_pred = model.predict(X)

Strengths of Symbolic Regression#

  • Interpretability: Yields human-readable equations.

  • Flexibility: Does not assume any functional form.

  • Exploration of Complex Relationships: Captures non-linearities, interactions, and transformations.

Limitations#

  • Computational Intensity: The evolutionary search is computationally demanding.

  • Overfitting Risk: Especially with noisy data or excessive search depth.

  • Parameter Sensitivity: The choice of operators and evolutionary settings can influence results.

Applications in Transportation and Logistics#

Symbolic regression can help:

  • Model fuel consumption as a function of speed, load, and terrain.

  • Discover non-linear dependencies in traffic flow models.

  • Derive empirical formulas from simulation or sensor data in smart logistics.

Summary#

Symbolic Regression bridges the gap between black-box machine learning models and interpretable regression by automatically discovering mathematical models from data. With tools like PySR, researchers can not only model complex phenomena but also understand them through concise, symbolic expressions.


In the next lecture, we will explore Generalized Linear Models (GLM), which extend linear models to accommodate different types of response variables and link functions.