This fork extends the original code by three methods:
- There is an option to provide initial guesses for programs in the form of equations with variable names
X0
,X1
, ... for features (e.g.'1.5*X0 + 10*X1/X2'
) as a list of strings specified for the optional parameterprevious_programs
of the modifiedSymbolicRegressor
. - Setting the new optional parameter
optimize
toTrue
forSymbolicRegressor
will trigger symbolic program simplification via sympy and optimization of numerical program parameters via scipy. - Setting the new optional parameter
n_program_sum
ofSymbolicRegressor
to integers larger than 1 will trigger interpretation of the first column of the observation input as a weightw0
and the followingn_features
columns as program feature input, the next column as weightw1
, etc., such that a program P is evaluated as a sum fromi=1
ton_program_sum
overw_i * P(features_i)
.
Additional extensions:
- Optional parameter
penalties
is a dictionary with function-specific weights as program penalties, e.g.{'add':2.0, 'var':1.0, 'coeff':1.5}
including penalties for variables 'var' and numerical coefficients 'coeff'. - Optional parameter
force_coeff
inserts factors of one before numerical optimization, so that e.g. sums of features with different physical units with summands without numerical pre-factors can be avoided. - Use
gplearn._programparser.program_to_math
to convertlist
representation of program to mathematical expression with standard math operators*
,/
,+
,-
, etc. instead ofmul(...)
, ... etc., e.g.mathstring=program_to_math(est_gp._program.program)
. - Implementation of modified AIC metric
aic0
. Use together withparsimony_coefficient=2.0
to properly penalize operators, variables, and numerical coefficients as degrees of freedom.
Original README below:
gplearn implements Genetic Programming in Python, with a scikit-learn inspired and compatible API.
While Genetic Programming (GP) can be used to perform a very wide variety of tasks, gplearn is purposefully constrained to solving symbolic regression problems. This is motivated by the scikit-learn ethos, of having powerful estimators that are straight-forward to implement.
Symbolic regression is a machine learning technique that aims to identify an underlying mathematical expression that best describes a relationship. It begins by building a population of naive random formulas to represent a relationship between known independent variables and their dependent variable targets in order to predict new data. Each successive generation of programs is then evolved from the one that came before it by selecting the fittest individuals from the population to undergo genetic operations.
gplearn retains the familiar scikit-learn fit/predict API and works with the existing scikit-learn pipeline and grid search modules. The package attempts to squeeze a lot of functionality into a scikit-learn-style API. While there are a lot of parameters to tweak, reading the documentation should make the more relevant ones clear for your problem.
gplearn supports regression through the SymbolicRegressor, binary classification with the SymbolicClassifier, as well as transformation for automated feature engineering with the SymbolicTransformer, which is designed to support regression problems, but should also work for binary classification.
gplearn is built on scikit-learn and a fairly recent copy (0.22.1+) is required for installation. If you come across any issues in running or installing the package, please submit a bug report.