Lecture 24: Symbolic Regression#
So far, we have explored linear regression, logistic regression, and their applications in modeling relationships between variables. However, these methods assume predefined model forms—linear or logistic in nature. Symbolic Regression offers a data-driven discovery of both the structure and parameters of the relationship between variables, without imposing a fixed functional form.
What is Symbolic Regression?#
Symbolic regression is a form of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset. Unlike linear or logistic regression, symbolic regression does not start with a predetermined equation structure. Instead, it uses evolutionary algorithms to evolve candidate equations that can model the data.
It combines principles from:
Machine Learning: For automated model discovery.
Genetic Programming: To evolve symbolic representations.
Classical Regression: To fit parameters within the discovered structure.
Symbolic regression is particularly useful when:
The underlying relationship between variables is unknown.
We suspect the relationship is non-linear or involves interactions not easily captured by standard models.
Interpretability of the resulting model is important.
Endogenous and Exogenous Variables#
Symbolic regression operates similarly to other regression methods with:
Endogenous Variable: The response or dependent variable we aim to predict or explain.
Exogenous Variables: The independent variables used to explain or predict the endogenous variable.
However, the functional form relating these variables is not predefined but discovered.
Example Use Cases#
Physical Sciences: Discovering governing equations from experimental data.
Economics: Uncovering non-linear dependencies in economic indicators.
Transportation: Modeling complex interactions between traffic flow, speed, and emissions.
Mathematical Formulation#
Instead of assuming a model like: $\( Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \)\( symbolic regression searches for: \)\( Y = f(X_1, X_2, ..., X_n) + \epsilon \)$ where (f) is an expression composed of:
Arithmetic operators (+, -, *, /)
Analytical functions (sin, cos, log, exp, etc.)
Constants and parameters to be optimized
PySR: Symbolic Regression in Python#
PySR is a high-performance symbolic regression library that combines evolutionary search with a focus on simplicity and interpretability of the discovered models.
Installation#
pip install pysr
Basic Example#
Let us explore symbolic regression using PySR.
# Step 1: Import Libraries
import numpy as np
from pysr import PySRRegressor
# Step 2: Generate Synthetic Data
np.random.seed(0)
X = np.random.randn(1000, 2)
y = X[:, 0] ** 2 + np.sin(X[:, 1]) + 0.1 * np.random.randn(1000)
# Step 3: Model Fitting with PySR
model = PySRRegressor(
niterations=40,
binary_operators=["+", "-", "*", "/"],
unary_operators=["sin", "cos", "exp", "log", "abs"],
populations=5,
progress=True
)
model.fit(X, y)
c:\Users\anmpa\AppData\Local\Programs\Python\Python313\Lib\site-packages\pysr\sr.py:2811: UserWarning: Note: it looks like you are running in Jupyter. The progress bar will be turned off.
warnings.warn(
Compiling Julia backend...
[ Info: Started!
───────────────────────────────────────────────────────────────────────────────────────────────────
Complexity Loss Score Equation
1 2.247e+00 0.000e+00 y = 0.93362
2 1.143e+00 6.756e-01 y = abs(x₀)
3 4.341e-01 9.684e-01 y = x₀ * x₀
5 1.973e-01 3.944e-01 y = (x₀ * x₀) + x₁
6 9.130e-03 3.073e+00 y = (x₀ * x₀) + sin(x₁)
8 9.103e-03 1.439e-03 y = ((x₀ * x₀) + -0.0051229) + sin(x₁)
10 9.097e-03 3.693e-04 y = (sin(x₁ * 1.0051) + (x₀ * x₀)) + -0.0050434
12 9.092e-03 2.580e-04 y = sin(x₁) + ((((x₀ * x₀) + 0.00052394) / 0.99751) + -0.0...
080193)
27 9.091e-03 1.016e-05 y = (sin(x₁ + (cos((x₁ + sin(cos(x₀ * (x₀ + x₀)) * -0.0096...
079)) + x₀) * -0.0033448)) + ((x₀ * x₀) / 0.9981)) + -0.00...
58046
28 9.087e-03 4.265e-04 y = (sin(x₁ + (cos(sin(x₀ + ((x₁ * (exp(x₀) + x₀)) * cos(-...
0.0093952))) + x₀) * -0.0093952)) + (x₀ * (x₀ / 0.99992)))...
+ -0.00085283
29 9.059e-03 3.082e-03 y = (sin(x₁) + (((x₀ * x₀) + -0.0013693) / 0.99985)) + (co...
s(sin(((((x₁ * x₀) * exp(x₀)) * x₁) + x₀) * 1.6642) + x₀) ...
* -0.013314)
───────────────────────────────────────────────────────────────────────────────────────────────────
- outputs\20250715_200304_Ub2Sxa\hall_of_fame.csv
[ Info: Final population:
[ Info: Results saved to:
PySRRegressor.equations_ = [ pick score equation \ 0 0.000000 0.9336204 1 0.675641 abs(x0) 2 0.968433 x0 * x0 3 0.394401 (x0 * x0) + x1 4 >>>> 3.073007 (x0 * x0) + sin(x1) 5 0.001439 ((x0 * x0) + -0.005122854) + sin(x1) 6 0.000369 (sin(x1 * 1.0051208) + (x0 * x0)) + -0.0050433893 7 0.000258 sin(x1) + ((((x0 * x0) + 0.0005239351) / 0.997... 8 0.000010 (sin(x1 + (cos((x1 + sin(cos(x0 * (x0 + x0)) *... 9 0.000426 (sin(x1 + (cos(sin(x0 + ((x1 * (exp(x0) + x0))... 10 0.003082 (sin(x1) + (((x0 * x0) + -0.0013692942) / 0.99... loss complexity 0 2.247133 1 1 1.143408 2 2 0.434126 3 3 0.197262 5 4 0.009130 6 5 0.009103 8 6 0.009097 10 7 0.009092 12 8 0.009091 27 9 0.009087 28 10 0.009059 29 ]In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
model_selection | 'best' | |
binary_operators | ['+', '-', ...] | |
unary_operators | ['sin', 'cos', ...] | |
expression_spec | None | |
niterations | 40 | |
populations | 5 | |
population_size | 27 | |
max_evals | None | |
maxsize | 30 | |
maxdepth | None | |
warmup_maxsize_by | None | |
timeout_in_seconds | None | |
constraints | None | |
nested_constraints | None | |
elementwise_loss | None | |
loss_function | None | |
loss_function_expression | None | |
loss_scale | 'log' | |
complexity_of_operators | None | |
complexity_of_constants | None | |
complexity_of_variables | None | |
complexity_mapping | None | |
parsimony | 0.0 | |
dimensional_constraint_penalty | None | |
dimensionless_constants_only | False | |
use_frequency | True | |
use_frequency_in_tournament | True | |
adaptive_parsimony_scaling | 1040.0 | |
alpha | 3.17 | |
annealing | False | |
early_stop_condition | None | |
ncycles_per_iteration | 380 | |
fraction_replaced | 0.00036 | |
fraction_replaced_hof | 0.0614 | |
weight_add_node | 2.47 | |
weight_insert_node | 0.0112 | |
weight_delete_node | 0.87 | |
weight_do_nothing | 0.273 | |
weight_mutate_constant | 0.0346 | |
weight_mutate_operator | 0.293 | |
weight_swap_operands | 0.198 | |
weight_rotate_tree | 4.26 | |
weight_randomize | 0.000502 | |
weight_simplify | 0.00209 | |
weight_optimize | 0.0 | |
crossover_probability | 0.0259 | |
skip_mutation_failures | True | |
migration | True | |
hof_migration | True | |
topn | 12 | |
should_simplify | True | |
should_optimize_constants | True | |
optimizer_algorithm | 'BFGS' | |
optimizer_nrestarts | 2 | |
optimizer_f_calls_limit | None | |
optimize_probability | 0.14 | |
optimizer_iterations | 8 | |
perturbation_factor | 0.129 | |
probability_negate_constant | 0.00743 | |
tournament_selection_n | 15 | |
tournament_selection_p | 0.982 | |
parallelism | None | |
procs | None | |
cluster_manager | None | |
heap_size_hint_in_bytes | None | |
batching | False | |
batch_size | 50 | |
fast_cycle | False | |
turbo | False | |
bumper | False | |
precision | 32 | |
autodiff_backend | None | |
random_state | None | |
deterministic | False | |
warm_start | False | |
verbosity | 1 | |
update_verbosity | None | |
print_precision | 5 | |
progress | True | |
logger_spec | None | |
input_stream | 'stdin' | |
run_id | None | |
output_directory | None | |
temp_equation_file | False | |
tempdir | None | |
delete_tempfiles | True | |
update | False | |
output_jax_format | False | |
output_torch_format | False | |
extra_sympy_mappings | None | |
extra_torch_mappings | None | |
extra_jax_mappings | None | |
denoise | False | |
select_k_features | None |
# Step 4: Viewing Discovered Equations
print(model)
PySRRegressor.equations_ = [
pick score equation \
0 0.000000 0.9336204
1 0.675641 abs(x0)
2 0.968433 x0 * x0
3 0.394401 (x0 * x0) + x1
4 >>>> 3.073007 (x0 * x0) + sin(x1)
5 0.001439 ((x0 * x0) + -0.005122854) + sin(x1)
6 0.000369 (sin(x1 * 1.0051208) + (x0 * x0)) + -0.0050433893
7 0.000258 sin(x1) + ((((x0 * x0) + 0.0005239351) / 0.997...
8 0.000010 (sin(x1 + (cos((x1 + sin(cos(x0 * (x0 + x0)) *...
9 0.000426 (sin(x1 + (cos(sin(x0 + ((x1 * (exp(x0) + x0))...
10 0.003082 (sin(x1) + (((x0 * x0) + -0.0013692942) / 0.99...
loss complexity
0 2.247133 1
1 1.143408 2
2 0.434126 3
3 0.197262 5
4 0.009130 6
5 0.009103 8
6 0.009097 10
7 0.009092 12
8 0.009091 27
9 0.009087 28
10 0.009059 29
]
# Step 5: Making Predictions
y_pred = model.predict(X)
Strengths of Symbolic Regression#
Interpretability: Yields human-readable equations.
Flexibility: Does not assume any functional form.
Exploration of Complex Relationships: Captures non-linearities, interactions, and transformations.
Limitations#
Computational Intensity: The evolutionary search is computationally demanding.
Overfitting Risk: Especially with noisy data or excessive search depth.
Parameter Sensitivity: The choice of operators and evolutionary settings can influence results.
Applications in Transportation and Logistics#
Symbolic regression can help:
Model fuel consumption as a function of speed, load, and terrain.
Discover non-linear dependencies in traffic flow models.
Derive empirical formulas from simulation or sensor data in smart logistics.
Summary#
Symbolic Regression bridges the gap between black-box machine learning models and interpretable regression by automatically discovering mathematical models from data. With tools like PySR, researchers can not only model complex phenomena but also understand them through concise, symbolic expressions.
In the next lecture, we will explore Generalized Linear Models (GLM), which extend linear models to accommodate different types of response variables and link functions.