Regressions

regressions.py: Class Regressions to test many regression models

class vulpes.automl.regressions.Regressions(*, models_to_try: str | List[Tuple[str, Any]] = 'all', custom_scorer: Dict[str, Any] = {'adj_r2': None, 'mae': make_scorer(mean_absolute_error, greater_is_better=False), 'mape': make_scorer(mean_absolute_percentage_error, greater_is_better=False), 'r2': make_scorer(r2_score), 'rmse': make_scorer(mean_squared_error, greater_is_better=False, squared=True)}, preprocessing: Pipeline | str = 'default', use_cross_validation: bool = True, cv: Any = 'default', test_size: float = 0.2, shuffle: bool = True, sort_result_by: str = 'MAPE', ascending: bool = True, save_results: bool = False, path_results: str = '', additional_model_params: Dict[str, Any] = {}, random_state: int | None = None, verbose: int = 0)[source]

Bases: CoreVulpes

Object to train many regressions. All the parameters are optionals and can be modified.

Parameters:
  • models_to_try (Union[str, List[Tuple[str, Any]]], optional) –

    List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“RandomForestRegressor”,

    sklearn.ensemble.RandomForestRegressor)).

    Defaults to “all” (train all the available regression algorithms).

  • custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_REG.

  • preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.

  • use_cross_validation (bool, optional) – whether or not we apply a cross-validation. Defaults to True.

  • cv (Any, optional) – cross-validation object. It can be a predefined cross-validation setting (“default”, “timeseries”, …), an iterable object, a cross-validation object from sklearn, etc. Defaults to “default”: it applies a StratifiedShuffleSplit if a groups object is given when applying the fit method, otherwise, it uses a RepeatedKFold. In both cases, n_splits is set to 5.

  • test_size (float, optional) – test of the size set when splitting. Defaults to 0.2.

  • shuffle (bool, optional) – whether or not the algorithm shuffle the sample when splitting the given dataset. Defaults to False.

  • sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “MAPE”.

  • ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to True.

  • save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.

  • path_results (str, optional) – path to use when saving the results. Defaults to “”.

  • additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {}.

  • random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.

  • verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.

Examples

>>> import pandas as pd
>>> from sklearn.datasets import make_regression
>>> from vulpes.automl import Regressions
>>> X, y = make_regression(
        n_samples=20, n_features=1, random_state=0, noise=4.0,
        bias=100.0)
>>> regressions = Regressions()
>>> df_models = regressions.fit(X, y)
>>> df_models
| Model                       |    R2    |   RMSE    | ...
|-----------------------------|----------|-----------|------
| LassoCV                     | 0.997649 | 19.818497 | ...
| HuberRegressor                  | 0.997776 | 19.881912 | ...
| Lars                        | 0.997694 | 19.555598 | ...
| TransformedTargetRegressor  | 0.997748 | 19.298391 | ...
| ...                         | ...      | ...       | ...
fit(X: List | DataFrame | Series | ndarray | Any, y: List | DataFrame | Series | ndarray | Any, *, sample_weight: List | DataFrame | Series | ndarray | Any | None = None, groups: List | DataFrame | Series | ndarray | Any | None = None) DataFrame[source]

Fit many models

Parameters:
  • X (Array_like) – Input dataset

  • y (Array_like) – Target variable

  • sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.

  • groups (Array_like, optional) – groups to use during cross validation to stratify. Defaults to None.

Raises:
  • ValueError – cross validation wrong type, or failed

  • RuntimeError – Error when fitting a model

Returns:

dataframe with the goodness-of-fit metrics

evaluated for each model.

Return type:

pd.DataFrame

Examples: