vulpes.automl package

Submodules

vulpes.automl.classifiers module

classifiers.py: Class Classifiers to test many classification models

class vulpes.automl.classifiers.Classifiers(*, models_to_try: str | List[Tuple[str, Any]] = 'all', custom_scorer: Dict[str, Any] = {'accuracy': make_scorer(accuracy_score), 'auprc': make_scorer(pr_auc_score, needs_proba=True), 'auroc': make_scorer(roc_auc_score, needs_proba=True, multi_class=ovo, average=macro), 'avg_precision': make_scorer(avg_precision, needs_proba=True), 'balanced_accuracy': make_scorer(balanced_accuracy_score), 'f1': make_scorer(f1_score, average=macro), 'precision': make_scorer(precision_score, average=macro), 'recall': make_scorer(recall_score, average=macro)}, preprocessing: Pipeline | str = 'default', use_cross_validation: bool = True, cv: Any = 'default', test_size: float = 0.2, shuffle: bool = False, sort_result_by: str = 'Balanced Accuracy', ascending: bool = False, save_results: bool = False, path_results: str = '', additional_model_params: Dict[str, Any] = {}, random_state: int = 42, verbose: int = 0)[source]

Bases: CoreVulpes

Object to train many classifiers. All the parameters are optionals and can be modified.

Parameters:

models_to_try (Union[str, List[Tuple[str, Any]]], optional) –
List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“RandomForestClassifier”,

sklearn.ensemble.RandomForestClassifier)).

Defaults to “all” (train all the available classification algorithms).
custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_CLF.
preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.
use_cross_validation (bool, optional) – whether or not we apply a cross-validation. Defaults to True.
cv (Any, optional) – cross-validation object. It can be a predefined cross-validation setting (“default”, “timeseries”, …), an iterable object, a cross-validation object from sklearn, etc. Defaults to “default”: it applies a StratifiedShuffleSplit if a groups object is given when applying the fit method, otherwise, it uses a RepeatedKFold. In both cases, n_splits is set to 5.
test_size (float, optional) – test of the size set when splitting. Defaults to 0.2.
shuffle (bool, optional) – whether or not the algorithm shuffle the sample when splitting the given dataset. Defaults to False.
sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “Balanced Accuracy”.
ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to False.
save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.
path_results (str, optional) – path to use when saving the results. Defaults to “”.
additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {}.
random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.
verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.

Examples

>>> import pandas as pd
>>> from sklearn.datasets import load_iris
>>> from vulpes.automl import Classifiers

>>> dataset = load_iris()
>>> X = pd.DataFrame(dataset["data"], columns=dataset["feature_names"])
>>> y = dataset["target"]

>>> classifiers = Classifiers()
>>> df_models = classifiers.fit(X, y)
>>> df_models
| Model                       | Balanced Accuracy | Accuracy | ...
|-----------------------------|-------------------|----------|------
Model
LinearDiscriminantAnalysis          0.977625            0.977333   ...
MLPClassifier                   0.973753                0.973333   ...
QuadraticDiscriminantAnalysis   0.973219                0.973333   ...
KNeighborsClassifier            0.971702                0.969333   ...
...                             ...                 ...        ...

Fit many models

Parameters:

X (Array_like) – Input dataset
y (Array_like) – Target variable
sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.
groups (Array_like, optional) – groups to use during cross validation to stratify. Defaults to None.

Raises:

ValueError – cross validation wrong type, or failed
RuntimeError – Error when fitting a model

Returns:

dataframe with the goodness-of-fit metrics: evaluated for each model.

Return type:

pd.DataFrame

Examples:

vulpes.automl.clustering module

clustering.py: Class Clustering to test many clustering algorithms

class vulpes.automl.clustering.Clustering(*, models_to_try: str | ~typing.List[~typing.Tuple[str, ~typing.Any]] = 'all', custom_scorer: ~typing.Dict[str, ~typing.Any] = {'calinski_harabasz': <function calinski_harabasz_score>, 'davies_bouldin': <function davies_bouldin_score>, 'silhouette': <function silhouette_score>}, preprocessing: ~sklearn.pipeline.Pipeline | str = 'default', sort_result_by: str = 'Davies–Bouldin Index', ascending: bool = True, save_results: bool = False, path_results: str = '', additional_model_params: ~typing.Dict[str, ~typing.Any] = {'eps': 0.5, 'min_samples': 5, 'nb_clusters': 3}, random_state: int | None = None, verbose: int = 0)[source]

Bases: CoreVulpes

Object to train many regressions. All the parameters are optionals and can be modified.

Parameters:

models_to_try (Union[str, List[Tuple[str, Any]]], optional) –
List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“KMeans”,

sklearn.cluster.KMeans)).

Defaults to “all” (train all the available clustering algorithms).
custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_CLT.
preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.
sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “Davies–Bouldin Index”.
ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to True.
save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.
path_results (str, optional) – path to use when saving the results. Defaults to “”.
additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {“nb_clusters”: 3, “min_samples”: 5, “eps”: 0.5}.
random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.
verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.

Examples

>>> from sklearn.datasets import load_iris
>>> from vulpes.automl import Clustering

>>> dataset = load_iris()
>>> X = pd.DataFrame(dataset["data"], columns=dataset["feature_names"])

>>> clustering = Clustering()
>>> df_models = clustering.fit(X)
>>> df_models
| Model                   | Calinski-Harabasz Index | ...
|-------------------------|-------------------------|----------
| AgglomerativeClustering | 502.821564                  | ...
| MeanShift                   | 509.703427                      | ...
| Birch                   | 458.472511              | ...
| SpectralClustering      | 410.369441              | ...
| ...                     | ...                     | ...

Fit many clustering algorithms

Parameters:

X (Array_like) – Input dataset
sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.

Raises:

RuntimeError – Error when fitting a model

Returns:

dataframe with the goodness-of-fit metrics: evaluated for each clustering algorithm.

Return type:

pd.DataFrame

Examples:

vulpes.automl.corevulpes module

corevulpes.py: Parent class that contains common methods shared between children classes

class vulpes.automl.corevulpes.CoreVulpes[source]

Bases: ABC

Parent class with shared methods between the classes Classifiers, Regressions and Clustering

When many models have been fitted, create an aggregated model using a voting system by selecting the nb_models best models based on the metric sort_result_by.

Parameters:

X (Array_like) – dataset to fit the ‘best model’
y (Array_like) – response/outcome variable
sample_weight (Array_like, optional) – sample weight. Defaults to None.
groups (Array_like, optional) – groups to stratify during cross validation. Defaults to None.
nb_models (int, optional) – number of models to select when creating the aggregated model. Defaults to 5.
sort_result_by (str, optional) – metrics to evaluate the best models that will be selected among the ones that we trained. Defaults to None.
ascending (bool, optional) – if ascending=True, the lower the metric sort_result_by is, the better the model is. Defaults to None.
voting (str, optional) – “hard” or “soft”. If “soft”, use the predicted probabilities of the different estimators to make a prediction. Defaults to “hard”.
weights (Array_like, optional) – attribute different weights to each estimators. Defaults to None, which is equal to equal weights.

Raises:

ValueError – Voting Clustering not available
NotFittedError – Fit models before building a ‘best model’
ValueError – less fitted models that the parameter nb_models
NotImplementedError – Voting Clustering not available
NotImplementedError – Sample weight not available
ValueError – Wrong type of cross validation
RuntimeError – Error when fitting an estimator

Returns:

Performance of the aggregated model: on different metrics

Return type:

pd.DataFrame

Examples:

create_pipeline(model_name: str, model: Any) → Pipeline[source]

Create a pipeline by combining an optional preprocessing pipeline and the given model

Parameters:

model_name (str) – name of the model (lowercase)
model (Any) – Model at the end of the pipeline

Returns:

Pipeline with a preprocessing task (if not set to None) and the given model

Return type:

Pipeline

Examples:

get_best_model() → Pipeline[source]

Get the model created and fitted by the method build_best_models

Raises:

NotFittedError – Models not trained
NotFittedError – Best model not calculated

Returns:

‘Best model’ using multiple fitted models

Return type:

Pipeline

Examples:

get_fitted_models() → Dict[str, Pipeline | List[Pipeline]][source]

Get a dictionnary with the fitted models

Raises:: NotFittedError – Models have not been fitted yet
Returns:: Dictionnary with, for all models, either the fitted model, or all the fitted models (during cross validation).
Return type:: Dict[str, Union[Pipeline, List[Pipeline]]]

Examples:

missing_data(X: List | DataFrame | Series | ndarray | Any) → DataFrame[source]

Evaluate the absolute count and the percentage of missing data in a particular dataset

Parameters:: X (Array_like) – Dataset
Returns:: Absolute count and percentage of missing data in X
Return type:: pd.DataFrame

Examples

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([["a", "x"],
                       [np.nan, "y"],
                       ["a", np.nan],
                       ["b", np.nan]],
                      dtype="category",
                      columns=["feature1", "feature2"])
>>> classifiers.missing_data(df)
| Total Missing | Percentage (%) | Accuracy |
|--------------:|---------------:|---------:|
|    feature2   |              2 |     50.0 |
|    feature1   |              1 |     25.0 |

Convert a cross validation string (self.cv parameter) into a predefined cross validation object

Parameters:

X (Array_like, optional) – if necessary, X is the dataset.
None. (the different folds. Defaults to) –
groups (Array_like, optional) – if necessary, groups is
create (an array-like object on which we stratify to) –
None. –

Raises:

ValueError – raise an error if the string doesn’t correspond
to a predefined cross validation –

Returns:

Cross validation object

Return type:

Any

Examples:

predefined_list_models(models_to_try: str | List[Tuple[str, Any]] = 'all') → List[Tuple[str, Any]][source]

If models_to_try isn’t a list of models but is a string, return the corresponding predefined list of models to test

Parameters:

(Union[str (models_to_try) – optional): string of predefined list of models
List[Tuple[str – optional): string of predefined list of models
Any]]] – optional): string of predefined list of models

:param : optional): string of predefined list of models :param or list of tuple: :type or list of tuple: name of model, model

Raises:

ValueError – raise an error if models_to_try is a string that
doesn't correspond to any predefined list of models –

Returns:

list of tuple (name of the model, model)

Return type:

List[Tuple[str, Any]]

Examples:

predefined_preprocessing() → Pipeline[source]

Either return a predefined preprocessing pipeline if self.preprocessing is a string. Otherwise, check if self.preprocessing is in fact a Pipeline or a None object (in that case, no preprocessing applied).

Raises:

ValueError – Unknown string
TypeError – self.preprocessing is not a string,
not None, not a pipeline object –

Returns:

None of preprocessing Pipeline to apply to each models

Return type:

Pipeline

Examples:

Evaluate all the fitted models on the dataset X

Parameters:

X (Array_like) – Array-like object on which we’ll make prediction(s)
dataframe_format (bool, optional) – if True, then the result
models. (is a dataframe with all the predictions for all the) –
True. (Defaults to) –

Returns:

Dictionnary or dataframe with the predictions

Return type:

Union[pd.DataFrame, Dict[str, np.ndarray]]

Examples:

predict_best(X: List | DataFrame | Series | ndarray | Any) → ndarray[source]

Evaluate the fitted ‘best model’ on the array-like X

Parameters:: X (Array_like) – dataset
Returns:: array of predictions
Return type:: np.ndarray

Examples:

predict_proba(X: List | DataFrame | Series | ndarray | Any) → Dict[str, ndarray][source]

Based on the fitted models, make many probability predictions on the dataset X

Parameters:: X (Array_like) – Dataset
Returns:: dictionnary with, for each model, an array of the corresponding predicted probabilities
Return type:: Dict[str, np.ndarray]

Examples:

predict_proba_best(X: List | DataFrame | Series | ndarray | Any) → ndarray[source]

Evaluate the fitted ‘best model’ on the array-like X and return probabilities

Parameters:: X (Array_like) – dataset
Returns:: predicted probabilities
Return type:: np.ndarray

Examples:

remove_proba_metrics(dic_scorer: Dict[str, _ProbaScorer | _PredictScorer]) → Dict[str, _ProbaScorer | _PredictScorer][source]

Take a dictionnary of metrics to evaluate the goodness-of-fit of classifiers as an input. Return a new version of this dictionnary with only the metrics that don’t need probabilities to be calculated (e.g. AUROC)

Parameters:

dic_scorer (Dict[str, Union[_ProbaScorer, _PredictScorer]]) –
metrics (dictionnary of) –

Returns:

filtered dictionnary of metrics

Return type:

Dict[str, Union[_ProbaScorer, _PredictScorer]]

Examples:

vulpes.automl.regressions module

regressions.py: Class Regressions to test many regression models

class vulpes.automl.regressions.Regressions(*, models_to_try: str | List[Tuple[str, Any]] = 'all', custom_scorer: Dict[str, Any] = {'adj_r2': None, 'mae': make_scorer(mean_absolute_error, greater_is_better=False), 'mape': make_scorer(mean_absolute_percentage_error, greater_is_better=False), 'r2': make_scorer(r2_score), 'rmse': make_scorer(mean_squared_error, greater_is_better=False, squared=True)}, preprocessing: Pipeline | str = 'default', use_cross_validation: bool = True, cv: Any = 'default', test_size: float = 0.2, shuffle: bool = True, sort_result_by: str = 'MAPE', ascending: bool = True, save_results: bool = False, path_results: str = '', additional_model_params: Dict[str, Any] = {}, random_state: int | None = None, verbose: int = 0)[source]

Bases: CoreVulpes

Object to train many regressions. All the parameters are optionals and can be modified.

Parameters:

models_to_try (Union[str, List[Tuple[str, Any]]], optional) –
List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“RandomForestRegressor”,

sklearn.ensemble.RandomForestRegressor)).

Defaults to “all” (train all the available regression algorithms).
custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_REG.
preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.
use_cross_validation (bool, optional) – whether or not we apply a cross-validation. Defaults to True.
cv (Any, optional) – cross-validation object. It can be a predefined cross-validation setting (“default”, “timeseries”, …), an iterable object, a cross-validation object from sklearn, etc. Defaults to “default”: it applies a StratifiedShuffleSplit if a groups object is given when applying the fit method, otherwise, it uses a RepeatedKFold. In both cases, n_splits is set to 5.
test_size (float, optional) – test of the size set when splitting. Defaults to 0.2.
shuffle (bool, optional) – whether or not the algorithm shuffle the sample when splitting the given dataset. Defaults to False.
sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “MAPE”.
ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to True.
save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.
path_results (str, optional) – path to use when saving the results. Defaults to “”.
additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {}.
random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.
verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.

Examples

>>> import pandas as pd
>>> from sklearn.datasets import make_regression
>>> from vulpes.automl import Regressions

>>> X, y = make_regression(
        n_samples=20, n_features=1, random_state=0, noise=4.0,
        bias=100.0)

>>> regressions = Regressions()
>>> df_models = regressions.fit(X, y)
>>> df_models
| Model                       |    R2    |   RMSE    | ...
|-----------------------------|----------|-----------|------
| LassoCV                     | 0.997649 | 19.818497 | ...
| HuberRegressor                  | 0.997776 | 19.881912 | ...
| Lars                        | 0.997694 | 19.555598 | ...
| TransformedTargetRegressor  | 0.997748 | 19.298391 | ...
| ...                         | ...      | ...       | ...

Fit many models

Parameters:

X (Array_like) – Input dataset
y (Array_like) – Target variable
sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.
groups (Array_like, optional) – groups to use during cross validation to stratify. Defaults to None.

Raises:

ValueError – cross validation wrong type, or failed
RuntimeError – Error when fitting a model

Returns:

dataframe with the goodness-of-fit metrics: evaluated for each model.

Return type:

pd.DataFrame

Examples:

vulpes.automl package

Submodules

vulpes.automl.classifiers module

vulpes.automl.clustering module

vulpes.automl.corevulpes module

vulpes.automl.regressions module

Module contents