vulpes.automl package

Submodules

vulpes.automl.classifiers module

classifiers.py: Class Classifiers to test many classification models

class vulpes.automl.classifiers.Classifiers(*, models_to_try: str | List[Tuple[str, Any]] = 'all', custom_scorer: Dict[str, Any] = {'accuracy': make_scorer(accuracy_score), 'auprc': make_scorer(pr_auc_score, needs_proba=True), 'auroc': make_scorer(roc_auc_score, needs_proba=True, multi_class=ovo, average=macro), 'avg_precision': make_scorer(avg_precision, needs_proba=True), 'balanced_accuracy': make_scorer(balanced_accuracy_score), 'f1': make_scorer(f1_score, average=macro), 'precision': make_scorer(precision_score, average=macro), 'recall': make_scorer(recall_score, average=macro)}, preprocessing: Pipeline | str = 'default', use_cross_validation: bool = True, cv: Any = 'default', test_size: float = 0.2, shuffle: bool = False, sort_result_by: str = 'Balanced Accuracy', ascending: bool = False, save_results: bool = False, path_results: str = '', additional_model_params: Dict[str, Any] = {}, random_state: int = 42, verbose: int = 0)[source]

Bases: CoreVulpes

Object to train many classifiers. All the parameters are optionals and can be modified.

Parameters:
  • models_to_try (Union[str, List[Tuple[str, Any]]], optional) –

    List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“RandomForestClassifier”,

    sklearn.ensemble.RandomForestClassifier)).

    Defaults to “all” (train all the available classification algorithms).

  • custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_CLF.

  • preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.

  • use_cross_validation (bool, optional) – whether or not we apply a cross-validation. Defaults to True.

  • cv (Any, optional) – cross-validation object. It can be a predefined cross-validation setting (“default”, “timeseries”, …), an iterable object, a cross-validation object from sklearn, etc. Defaults to “default”: it applies a StratifiedShuffleSplit if a groups object is given when applying the fit method, otherwise, it uses a RepeatedKFold. In both cases, n_splits is set to 5.

  • test_size (float, optional) – test of the size set when splitting. Defaults to 0.2.

  • shuffle (bool, optional) – whether or not the algorithm shuffle the sample when splitting the given dataset. Defaults to False.

  • sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “Balanced Accuracy”.

  • ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to False.

  • save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.

  • path_results (str, optional) – path to use when saving the results. Defaults to “”.

  • additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {}.

  • random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.

  • verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.

Examples

>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> from vulpes.automl import Classifiers
>>> dataset = load_iris()
>>> X = pd.DataFrame(dataset["data"], columns=dataset["feature_names"])
>>> y = dataset["target"]
>>> classifiers = Classifiers()
>>> df_models = classifiers.fit(X, y)
>>> df_models
| Model                       | Balanced Accuracy | Accuracy | ...
|-----------------------------|-------------------|----------|------
Model
LinearDiscriminantAnalysis          0.977625            0.977333   ...
MLPClassifier                   0.973753                0.973333   ...
QuadraticDiscriminantAnalysis   0.973219                0.973333   ...
KNeighborsClassifier            0.971702                0.969333   ...
...                             ...                 ...        ...
fit(X: List | DataFrame | Series | ndarray | Any, y: List | DataFrame | Series | ndarray | Any, *, sample_weight: List | DataFrame | Series | ndarray | Any | None = None, groups: List | DataFrame | Series | ndarray | Any | None = None) DataFrame[source]

Fit many models

Parameters:
  • X (Array_like) – Input dataset

  • y (Array_like) – Target variable

  • sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.

  • groups (Array_like, optional) – groups to use during cross validation to stratify. Defaults to None.

Raises:
  • ValueError – cross validation wrong type, or failed

  • RuntimeError – Error when fitting a model

Returns:

dataframe with the goodness-of-fit metrics

evaluated for each model.

Return type:

pd.DataFrame

Examples:

vulpes.automl.clustering module

clustering.py: Class Clustering to test many clustering algorithms

class vulpes.automl.clustering.Clustering(*, models_to_try: str | ~typing.List[~typing.Tuple[str, ~typing.Any]] = 'all', custom_scorer: ~typing.Dict[str, ~typing.Any] = {'calinski_harabasz': <function calinski_harabasz_score>, 'davies_bouldin': <function davies_bouldin_score>, 'silhouette': <function silhouette_score>}, preprocessing: ~sklearn.pipeline.Pipeline | str = 'default', sort_result_by: str = 'Davies–Bouldin Index', ascending: bool = True, save_results: bool = False, path_results: str = '', additional_model_params: ~typing.Dict[str, ~typing.Any] = {'eps': 0.5, 'min_samples': 5, 'nb_clusters': 3}, random_state: int | None = None, verbose: int = 0)[source]

Bases: CoreVulpes

Object to train many regressions. All the parameters are optionals and can be modified.

Parameters:
  • models_to_try (Union[str, List[Tuple[str, Any]]], optional) –

    List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“KMeans”,

    sklearn.cluster.KMeans)).

    Defaults to “all” (train all the available clustering algorithms).

  • custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_CLT.

  • preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.

  • sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “Davies–Bouldin Index”.

  • ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to True.

  • save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.

  • path_results (str, optional) – path to use when saving the results. Defaults to “”.

  • additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {“nb_clusters”: 3, “min_samples”: 5, “eps”: 0.5}.

  • random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.

  • verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.

Examples

>>> from sklearn.datasets import load_iris
>>> from vulpes.automl import Clustering
>>> dataset = load_iris()
>>> X = pd.DataFrame(dataset["data"], columns=dataset["feature_names"])
>>> clustering = Clustering()
>>> df_models = clustering.fit(X)
>>> df_models
| Model                   | Calinski-Harabasz Index | ...
|-------------------------|-------------------------|----------
| AgglomerativeClustering | 502.821564                  | ...
| MeanShift                   | 509.703427                      | ...
| Birch                   | 458.472511              | ...
| SpectralClustering      | 410.369441              | ...
| ...                     | ...                     | ...
fit(X: List | DataFrame | Series | ndarray | Any, *, sample_weight: List | DataFrame | Series | ndarray | Any | None = None) DataFrame[source]

Fit many clustering algorithms

Parameters:
  • X (Array_like) – Input dataset

  • sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.

Raises:

RuntimeError – Error when fitting a model

Returns:

dataframe with the goodness-of-fit metrics

evaluated for each clustering algorithm.

Return type:

pd.DataFrame

Examples:

vulpes.automl.corevulpes module

corevulpes.py: Parent class that contains common methods shared between children classes

class vulpes.automl.corevulpes.CoreVulpes[source]

Bases: ABC

Parent class with shared methods between the classes Classifiers, Regressions and Clustering

build_best_models(X: List | DataFrame | Series | ndarray | Any, y: List | DataFrame | Series | ndarray | Any, *, sample_weight: List | DataFrame | Series | ndarray | Any | None = None, groups: List | DataFrame | Series | ndarray | Any | None = None, nb_models: int = 5, sort_result_by: str | None = None, ascending: bool | None = None, voting: str = 'hard', weights: List | DataFrame | Series | ndarray | Any | None = None) DataFrame[source]

When many models have been fitted, create an aggregated model using a voting system by selecting the nb_models best models based on the metric sort_result_by.

Parameters:
  • X (Array_like) – dataset to fit the ‘best model’

  • y (Array_like) – response/outcome variable

  • sample_weight (Array_like, optional) – sample weight. Defaults to None.

  • groups (Array_like, optional) – groups to stratify during cross validation. Defaults to None.

  • nb_models (int, optional) – number of models to select when creating the aggregated model. Defaults to 5.

  • sort_result_by (str, optional) – metrics to evaluate the best models that will be selected among the ones that we trained. Defaults to None.

  • ascending (bool, optional) – if ascending=True, the lower the metric sort_result_by is, the better the model is. Defaults to None.

  • voting (str, optional) – “hard” or “soft”. If “soft”, use the predicted probabilities of the different estimators to make a prediction. Defaults to “hard”.

  • weights (Array_like, optional) – attribute different weights to each estimators. Defaults to None, which is equal to equal weights.

Raises:
  • ValueError – Voting Clustering not available

  • NotFittedError – Fit models before building a ‘best model’

  • ValueError – less fitted models that the parameter nb_models

  • NotImplementedError – Voting Clustering not available

  • NotImplementedError – Sample weight not available

  • ValueError – Wrong type of cross validation

  • RuntimeError – Error when fitting an estimator

Returns:

Performance of the aggregated model

on different metrics

Return type:

pd.DataFrame

Examples:

create_pipeline(model_name: str, model: Any) Pipeline[source]

Create a pipeline by combining an optional preprocessing pipeline and the given model

Parameters:
  • model_name (str) – name of the model (lowercase)

  • model (Any) – Model at the end of the pipeline

Returns:

Pipeline with a preprocessing task (if not set to None) and the given model

Return type:

Pipeline

Examples:

get_best_model() Pipeline[source]

Get the model created and fitted by the method build_best_models

Raises:
  • NotFittedError – Models not trained

  • NotFittedError – Best model not calculated

Returns:

‘Best model’ using multiple fitted models

Return type:

Pipeline

Examples:

get_fitted_models() Dict[str, Pipeline | List[Pipeline]][source]

Get a dictionnary with the fitted models

Raises:

NotFittedError – Models have not been fitted yet

Returns:

Dictionnary with, for all models, either the fitted model, or all the fitted models (during cross validation).

Return type:

Dict[str, Union[Pipeline, List[Pipeline]]]

Examples:

missing_data(X: List | DataFrame | Series | ndarray | Any) DataFrame[source]

Evaluate the absolute count and the percentage of missing data in a particular dataset

Parameters:

X (Array_like) – Dataset

Returns:

Absolute count and percentage of missing data in X

Return type:

pd.DataFrame

Examples

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([["a", "x"],
                       [np.nan, "y"],
                       ["a", np.nan],
                       ["b", np.nan]],
                      dtype="category",
                      columns=["feature1", "feature2"])
>>> classifiers.missing_data(df)
| Total Missing | Percentage (%) | Accuracy |
|--------------:|---------------:|---------:|
|    feature2   |              2 |     50.0 |
|    feature1   |              1 |     25.0 |
predefined_cv(X: List | DataFrame | Series | ndarray | Any | None = None, groups: List | DataFrame | Series | ndarray | Any | None = None) Any[source]

Convert a cross validation string (self.cv parameter) into a predefined cross validation object

Parameters:
  • X (Array_like, optional) – if necessary, X is the dataset.

  • None. (the different folds. Defaults to) –

  • groups (Array_like, optional) – if necessary, groups is

  • create (an array-like object on which we stratify to) –

  • None.

Raises:
  • ValueError – raise an error if the string doesn’t correspond

  • to a predefined cross validation

Returns:

Cross validation object

Return type:

Any

Examples:

predefined_list_models(models_to_try: str | List[Tuple[str, Any]] = 'all') List[Tuple[str, Any]][source]

If models_to_try isn’t a list of models but is a string, return the corresponding predefined list of models to test

Parameters:
  • (Union[str (models_to_try) – optional): string of predefined list of models

  • List[Tuple[str – optional): string of predefined list of models

  • Any]]] – optional): string of predefined list of models

:param : optional): string of predefined list of models :param or list of tuple: :type or list of tuple: name of model, model

Raises:
  • ValueError – raise an error if models_to_try is a string that

  • doesn't correspond to any predefined list of models

Returns:

list of tuple (name of the model, model)

Return type:

List[Tuple[str, Any]]

Examples:

predefined_preprocessing() Pipeline[source]

Either return a predefined preprocessing pipeline if self.preprocessing is a string. Otherwise, check if self.preprocessing is in fact a Pipeline or a None object (in that case, no preprocessing applied).

Raises:
  • ValueError – Unknown string

  • TypeError – self.preprocessing is not a string,

  • not None, not a pipeline object

Returns:

None of preprocessing Pipeline to apply to each models

Return type:

Pipeline

Examples:

predict(X: List | DataFrame | Series | ndarray | Any, *, dataframe_format: bool = True) DataFrame | Dict[str, ndarray][source]

Evaluate all the fitted models on the dataset X

Parameters:
  • X (Array_like) – Array-like object on which we’ll make prediction(s)

  • dataframe_format (bool, optional) – if True, then the result

  • models. (is a dataframe with all the predictions for all the) –

  • True. (Defaults to) –

Returns:

Dictionnary or dataframe with the predictions

Return type:

Union[pd.DataFrame, Dict[str, np.ndarray]]

Examples:

predict_best(X: List | DataFrame | Series | ndarray | Any) ndarray[source]

Evaluate the fitted ‘best model’ on the array-like X

Parameters:

X (Array_like) – dataset

Returns:

array of predictions

Return type:

np.ndarray

Examples:

predict_proba(X: List | DataFrame | Series | ndarray | Any) Dict[str, ndarray][source]

Based on the fitted models, make many probability predictions on the dataset X

Parameters:

X (Array_like) – Dataset

Returns:

dictionnary with, for each model, an array of the corresponding predicted probabilities

Return type:

Dict[str, np.ndarray]

Examples:

predict_proba_best(X: List | DataFrame | Series | ndarray | Any) ndarray[source]

Evaluate the fitted ‘best model’ on the array-like X and return probabilities

Parameters:

X (Array_like) – dataset

Returns:

predicted probabilities

Return type:

np.ndarray

Examples:

remove_proba_metrics(dic_scorer: Dict[str, _ProbaScorer | _PredictScorer]) Dict[str, _ProbaScorer | _PredictScorer][source]

Take a dictionnary of metrics to evaluate the goodness-of-fit of classifiers as an input. Return a new version of this dictionnary with only the metrics that don’t need probabilities to be calculated (e.g. AUROC)

Parameters:
  • dic_scorer (Dict[str, Union[_ProbaScorer, _PredictScorer]]) –

  • metrics (dictionnary of) –

Returns:

filtered dictionnary of metrics

Return type:

Dict[str, Union[_ProbaScorer, _PredictScorer]]

Examples:

vulpes.automl.regressions module

regressions.py: Class Regressions to test many regression models

class vulpes.automl.regressions.Regressions(*, models_to_try: str | List[Tuple[str, Any]] = 'all', custom_scorer: Dict[str, Any] = {'adj_r2': None, 'mae': make_scorer(mean_absolute_error, greater_is_better=False), 'mape': make_scorer(mean_absolute_percentage_error, greater_is_better=False), 'r2': make_scorer(r2_score), 'rmse': make_scorer(mean_squared_error, greater_is_better=False, squared=True)}, preprocessing: Pipeline | str = 'default', use_cross_validation: bool = True, cv: Any = 'default', test_size: float = 0.2, shuffle: bool = True, sort_result_by: str = 'MAPE', ascending: bool = True, save_results: bool = False, path_results: str = '', additional_model_params: Dict[str, Any] = {}, random_state: int | None = None, verbose: int = 0)[source]

Bases: CoreVulpes

Object to train many regressions. All the parameters are optionals and can be modified.

Parameters:
  • models_to_try (Union[str, List[Tuple[str, Any]]], optional) –

    List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“RandomForestRegressor”,

    sklearn.ensemble.RandomForestRegressor)).

    Defaults to “all” (train all the available regression algorithms).

  • custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_REG.

  • preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.

  • use_cross_validation (bool, optional) – whether or not we apply a cross-validation. Defaults to True.

  • cv (Any, optional) – cross-validation object. It can be a predefined cross-validation setting (“default”, “timeseries”, …), an iterable object, a cross-validation object from sklearn, etc. Defaults to “default”: it applies a StratifiedShuffleSplit if a groups object is given when applying the fit method, otherwise, it uses a RepeatedKFold. In both cases, n_splits is set to 5.

  • test_size (float, optional) – test of the size set when splitting. Defaults to 0.2.

  • shuffle (bool, optional) – whether or not the algorithm shuffle the sample when splitting the given dataset. Defaults to False.

  • sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “MAPE”.

  • ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to True.

  • save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.

  • path_results (str, optional) – path to use when saving the results. Defaults to “”.

  • additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {}.

  • random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.

  • verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.

Examples

>>> import pandas as pd
>>> from sklearn.datasets import make_regression
>>> from vulpes.automl import Regressions
>>> X, y = make_regression(
        n_samples=20, n_features=1, random_state=0, noise=4.0,
        bias=100.0)
>>> regressions = Regressions()
>>> df_models = regressions.fit(X, y)
>>> df_models
| Model                       |    R2    |   RMSE    | ...
|-----------------------------|----------|-----------|------
| LassoCV                     | 0.997649 | 19.818497 | ...
| HuberRegressor                  | 0.997776 | 19.881912 | ...
| Lars                        | 0.997694 | 19.555598 | ...
| TransformedTargetRegressor  | 0.997748 | 19.298391 | ...
| ...                         | ...      | ...       | ...
fit(X: List | DataFrame | Series | ndarray | Any, y: List | DataFrame | Series | ndarray | Any, *, sample_weight: List | DataFrame | Series | ndarray | Any | None = None, groups: List | DataFrame | Series | ndarray | Any | None = None) DataFrame[source]

Fit many models

Parameters:
  • X (Array_like) – Input dataset

  • y (Array_like) – Target variable

  • sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.

  • groups (Array_like, optional) – groups to use during cross validation to stratify. Defaults to None.

Raises:
  • ValueError – cross validation wrong type, or failed

  • RuntimeError – Error when fitting a model

Returns:

dataframe with the goodness-of-fit metrics

evaluated for each model.

Return type:

pd.DataFrame

Examples:

Module contents