vulpes.automl package
Submodules
vulpes.automl.classifiers module
classifiers.py: Class Classifiers to test many classification models
- class vulpes.automl.classifiers.Classifiers(*, models_to_try: str | List[Tuple[str, Any]] = 'all', custom_scorer: Dict[str, Any] = {'accuracy': make_scorer(accuracy_score), 'auprc': make_scorer(pr_auc_score, needs_proba=True), 'auroc': make_scorer(roc_auc_score, needs_proba=True, multi_class=ovo, average=macro), 'avg_precision': make_scorer(avg_precision, needs_proba=True), 'balanced_accuracy': make_scorer(balanced_accuracy_score), 'f1': make_scorer(f1_score, average=macro), 'precision': make_scorer(precision_score, average=macro), 'recall': make_scorer(recall_score, average=macro)}, preprocessing: Pipeline | str = 'default', use_cross_validation: bool = True, cv: Any = 'default', test_size: float = 0.2, shuffle: bool = False, sort_result_by: str = 'Balanced Accuracy', ascending: bool = False, save_results: bool = False, path_results: str = '', additional_model_params: Dict[str, Any] = {}, random_state: int = 42, verbose: int = 0)[source]
Bases:
CoreVulpes
Object to train many classifiers. All the parameters are optionals and can be modified.
- Parameters:
models_to_try (Union[str, List[Tuple[str, Any]]], optional) –
List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“RandomForestClassifier”,
sklearn.ensemble.RandomForestClassifier)).
Defaults to “all” (train all the available classification algorithms).
custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_CLF.
preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.
use_cross_validation (bool, optional) – whether or not we apply a cross-validation. Defaults to True.
cv (Any, optional) – cross-validation object. It can be a predefined cross-validation setting (“default”, “timeseries”, …), an iterable object, a cross-validation object from sklearn, etc. Defaults to “default”: it applies a StratifiedShuffleSplit if a groups object is given when applying the fit method, otherwise, it uses a RepeatedKFold. In both cases, n_splits is set to 5.
test_size (float, optional) – test of the size set when splitting. Defaults to 0.2.
shuffle (bool, optional) – whether or not the algorithm shuffle the sample when splitting the given dataset. Defaults to False.
sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “Balanced Accuracy”.
ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to False.
save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.
path_results (str, optional) – path to use when saving the results. Defaults to “”.
additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {}.
random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.
verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.
Examples
>>> import pandas as pd >>> from sklearn.datasets import load_iris >>> from vulpes.automl import Classifiers
>>> dataset = load_iris() >>> X = pd.DataFrame(dataset["data"], columns=dataset["feature_names"]) >>> y = dataset["target"]
>>> classifiers = Classifiers() >>> df_models = classifiers.fit(X, y) >>> df_models | Model | Balanced Accuracy | Accuracy | ... |-----------------------------|-------------------|----------|------ Model LinearDiscriminantAnalysis 0.977625 0.977333 ... MLPClassifier 0.973753 0.973333 ... QuadraticDiscriminantAnalysis 0.973219 0.973333 ... KNeighborsClassifier 0.971702 0.969333 ... ... ... ... ...
- fit(X: List | DataFrame | Series | ndarray | Any, y: List | DataFrame | Series | ndarray | Any, *, sample_weight: List | DataFrame | Series | ndarray | Any | None = None, groups: List | DataFrame | Series | ndarray | Any | None = None) DataFrame [source]
Fit many models
- Parameters:
X (Array_like) – Input dataset
y (Array_like) – Target variable
sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.
groups (Array_like, optional) – groups to use during cross validation to stratify. Defaults to None.
- Raises:
ValueError – cross validation wrong type, or failed
RuntimeError – Error when fitting a model
- Returns:
- dataframe with the goodness-of-fit metrics
evaluated for each model.
- Return type:
pd.DataFrame
Examples:
vulpes.automl.clustering module
clustering.py: Class Clustering to test many clustering algorithms
- class vulpes.automl.clustering.Clustering(*, models_to_try: str | ~typing.List[~typing.Tuple[str, ~typing.Any]] = 'all', custom_scorer: ~typing.Dict[str, ~typing.Any] = {'calinski_harabasz': <function calinski_harabasz_score>, 'davies_bouldin': <function davies_bouldin_score>, 'silhouette': <function silhouette_score>}, preprocessing: ~sklearn.pipeline.Pipeline | str = 'default', sort_result_by: str = 'Davies–Bouldin Index', ascending: bool = True, save_results: bool = False, path_results: str = '', additional_model_params: ~typing.Dict[str, ~typing.Any] = {'eps': 0.5, 'min_samples': 5, 'nb_clusters': 3}, random_state: int | None = None, verbose: int = 0)[source]
Bases:
CoreVulpes
Object to train many regressions. All the parameters are optionals and can be modified.
- Parameters:
models_to_try (Union[str, List[Tuple[str, Any]]], optional) –
List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“KMeans”,
sklearn.cluster.KMeans)).
Defaults to “all” (train all the available clustering algorithms).
custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_CLT.
preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.
sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “Davies–Bouldin Index”.
ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to True.
save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.
path_results (str, optional) – path to use when saving the results. Defaults to “”.
additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {“nb_clusters”: 3, “min_samples”: 5, “eps”: 0.5}.
random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.
verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.
Examples
>>> from sklearn.datasets import load_iris >>> from vulpes.automl import Clustering
>>> dataset = load_iris() >>> X = pd.DataFrame(dataset["data"], columns=dataset["feature_names"])
>>> clustering = Clustering() >>> df_models = clustering.fit(X) >>> df_models | Model | Calinski-Harabasz Index | ... |-------------------------|-------------------------|---------- | AgglomerativeClustering | 502.821564 | ... | MeanShift | 509.703427 | ... | Birch | 458.472511 | ... | SpectralClustering | 410.369441 | ... | ... | ... | ...
- fit(X: List | DataFrame | Series | ndarray | Any, *, sample_weight: List | DataFrame | Series | ndarray | Any | None = None) DataFrame [source]
Fit many clustering algorithms
- Parameters:
X (Array_like) – Input dataset
sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.
- Raises:
RuntimeError – Error when fitting a model
- Returns:
- dataframe with the goodness-of-fit metrics
evaluated for each clustering algorithm.
- Return type:
pd.DataFrame
Examples:
vulpes.automl.corevulpes module
corevulpes.py: Parent class that contains common methods shared between children classes
- class vulpes.automl.corevulpes.CoreVulpes[source]
Bases:
ABC
Parent class with shared methods between the classes Classifiers, Regressions and Clustering
- build_best_models(X: List | DataFrame | Series | ndarray | Any, y: List | DataFrame | Series | ndarray | Any, *, sample_weight: List | DataFrame | Series | ndarray | Any | None = None, groups: List | DataFrame | Series | ndarray | Any | None = None, nb_models: int = 5, sort_result_by: str | None = None, ascending: bool | None = None, voting: str = 'hard', weights: List | DataFrame | Series | ndarray | Any | None = None) DataFrame [source]
When many models have been fitted, create an aggregated model using a voting system by selecting the nb_models best models based on the metric sort_result_by.
- Parameters:
X (Array_like) – dataset to fit the ‘best model’
y (Array_like) – response/outcome variable
sample_weight (Array_like, optional) – sample weight. Defaults to None.
groups (Array_like, optional) – groups to stratify during cross validation. Defaults to None.
nb_models (int, optional) – number of models to select when creating the aggregated model. Defaults to 5.
sort_result_by (str, optional) – metrics to evaluate the best models that will be selected among the ones that we trained. Defaults to None.
ascending (bool, optional) – if ascending=True, the lower the metric sort_result_by is, the better the model is. Defaults to None.
voting (str, optional) – “hard” or “soft”. If “soft”, use the predicted probabilities of the different estimators to make a prediction. Defaults to “hard”.
weights (Array_like, optional) – attribute different weights to each estimators. Defaults to None, which is equal to equal weights.
- Raises:
ValueError – Voting Clustering not available
NotFittedError – Fit models before building a ‘best model’
ValueError – less fitted models that the parameter nb_models
NotImplementedError – Voting Clustering not available
NotImplementedError – Sample weight not available
ValueError – Wrong type of cross validation
RuntimeError – Error when fitting an estimator
- Returns:
- Performance of the aggregated model
on different metrics
- Return type:
pd.DataFrame
Examples:
- create_pipeline(model_name: str, model: Any) Pipeline [source]
Create a pipeline by combining an optional preprocessing pipeline and the given model
- Parameters:
model_name (str) – name of the model (lowercase)
model (Any) – Model at the end of the pipeline
- Returns:
Pipeline with a preprocessing task (if not set to None) and the given model
- Return type:
Pipeline
Examples:
- get_best_model() Pipeline [source]
Get the model created and fitted by the method build_best_models
- Raises:
NotFittedError – Models not trained
NotFittedError – Best model not calculated
- Returns:
‘Best model’ using multiple fitted models
- Return type:
Pipeline
Examples:
- get_fitted_models() Dict[str, Pipeline | List[Pipeline]] [source]
Get a dictionnary with the fitted models
- Raises:
NotFittedError – Models have not been fitted yet
- Returns:
Dictionnary with, for all models, either the fitted model, or all the fitted models (during cross validation).
- Return type:
Dict[str, Union[Pipeline, List[Pipeline]]]
Examples:
- missing_data(X: List | DataFrame | Series | ndarray | Any) DataFrame [source]
Evaluate the absolute count and the percentage of missing data in a particular dataset
- Parameters:
X (Array_like) – Dataset
- Returns:
Absolute count and percentage of missing data in X
- Return type:
pd.DataFrame
Examples
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame([["a", "x"], [np.nan, "y"], ["a", np.nan], ["b", np.nan]], dtype="category", columns=["feature1", "feature2"]) >>> classifiers.missing_data(df) | Total Missing | Percentage (%) | Accuracy | |--------------:|---------------:|---------:| | feature2 | 2 | 50.0 | | feature1 | 1 | 25.0 |
- predefined_cv(X: List | DataFrame | Series | ndarray | Any | None = None, groups: List | DataFrame | Series | ndarray | Any | None = None) Any [source]
Convert a cross validation string (self.cv parameter) into a predefined cross validation object
- Parameters:
X (Array_like, optional) – if necessary, X is the dataset.
None. (the different folds. Defaults to) –
groups (Array_like, optional) – if necessary, groups is
create (an array-like object on which we stratify to) –
None. –
- Raises:
ValueError – raise an error if the string doesn’t correspond
to a predefined cross validation –
- Returns:
Cross validation object
- Return type:
Any
Examples:
- predefined_list_models(models_to_try: str | List[Tuple[str, Any]] = 'all') List[Tuple[str, Any]] [source]
If models_to_try isn’t a list of models but is a string, return the corresponding predefined list of models to test
- Parameters:
(Union[str (models_to_try) – optional): string of predefined list of models
List[Tuple[str – optional): string of predefined list of models
Any]]] – optional): string of predefined list of models
:param : optional): string of predefined list of models :param or list of tuple: :type or list of tuple: name of model, model
- Raises:
ValueError – raise an error if models_to_try is a string that
doesn't correspond to any predefined list of models –
- Returns:
list of tuple (name of the model, model)
- Return type:
List[Tuple[str, Any]]
Examples:
- predefined_preprocessing() Pipeline [source]
Either return a predefined preprocessing pipeline if self.preprocessing is a string. Otherwise, check if self.preprocessing is in fact a Pipeline or a None object (in that case, no preprocessing applied).
- Raises:
ValueError – Unknown string
TypeError – self.preprocessing is not a string,
not None, not a pipeline object –
- Returns:
None of preprocessing Pipeline to apply to each models
- Return type:
Pipeline
Examples:
- predict(X: List | DataFrame | Series | ndarray | Any, *, dataframe_format: bool = True) DataFrame | Dict[str, ndarray] [source]
Evaluate all the fitted models on the dataset X
- Parameters:
X (Array_like) – Array-like object on which we’ll make prediction(s)
dataframe_format (bool, optional) – if True, then the result
models. (is a dataframe with all the predictions for all the) –
True. (Defaults to) –
- Returns:
Dictionnary or dataframe with the predictions
- Return type:
Union[pd.DataFrame, Dict[str, np.ndarray]]
Examples:
- predict_best(X: List | DataFrame | Series | ndarray | Any) ndarray [source]
Evaluate the fitted ‘best model’ on the array-like X
- Parameters:
X (Array_like) – dataset
- Returns:
array of predictions
- Return type:
np.ndarray
Examples:
- predict_proba(X: List | DataFrame | Series | ndarray | Any) Dict[str, ndarray] [source]
Based on the fitted models, make many probability predictions on the dataset X
- Parameters:
X (Array_like) – Dataset
- Returns:
dictionnary with, for each model, an array of the corresponding predicted probabilities
- Return type:
Dict[str, np.ndarray]
Examples:
- predict_proba_best(X: List | DataFrame | Series | ndarray | Any) ndarray [source]
Evaluate the fitted ‘best model’ on the array-like X and return probabilities
- Parameters:
X (Array_like) – dataset
- Returns:
predicted probabilities
- Return type:
np.ndarray
Examples:
- remove_proba_metrics(dic_scorer: Dict[str, _ProbaScorer | _PredictScorer]) Dict[str, _ProbaScorer | _PredictScorer] [source]
Take a dictionnary of metrics to evaluate the goodness-of-fit of classifiers as an input. Return a new version of this dictionnary with only the metrics that don’t need probabilities to be calculated (e.g. AUROC)
- Parameters:
dic_scorer (Dict[str, Union[_ProbaScorer, _PredictScorer]]) –
metrics (dictionnary of) –
- Returns:
filtered dictionnary of metrics
- Return type:
Dict[str, Union[_ProbaScorer, _PredictScorer]]
Examples:
vulpes.automl.regressions module
regressions.py: Class Regressions to test many regression models
- class vulpes.automl.regressions.Regressions(*, models_to_try: str | List[Tuple[str, Any]] = 'all', custom_scorer: Dict[str, Any] = {'adj_r2': None, 'mae': make_scorer(mean_absolute_error, greater_is_better=False), 'mape': make_scorer(mean_absolute_percentage_error, greater_is_better=False), 'r2': make_scorer(r2_score), 'rmse': make_scorer(mean_squared_error, greater_is_better=False, squared=True)}, preprocessing: Pipeline | str = 'default', use_cross_validation: bool = True, cv: Any = 'default', test_size: float = 0.2, shuffle: bool = True, sort_result_by: str = 'MAPE', ascending: bool = True, save_results: bool = False, path_results: str = '', additional_model_params: Dict[str, Any] = {}, random_state: int | None = None, verbose: int = 0)[source]
Bases:
CoreVulpes
Object to train many regressions. All the parameters are optionals and can be modified.
- Parameters:
models_to_try (Union[str, List[Tuple[str, Any]]], optional) –
List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“RandomForestRegressor”,
sklearn.ensemble.RandomForestRegressor)).
Defaults to “all” (train all the available regression algorithms).
custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_REG.
preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.
use_cross_validation (bool, optional) – whether or not we apply a cross-validation. Defaults to True.
cv (Any, optional) – cross-validation object. It can be a predefined cross-validation setting (“default”, “timeseries”, …), an iterable object, a cross-validation object from sklearn, etc. Defaults to “default”: it applies a StratifiedShuffleSplit if a groups object is given when applying the fit method, otherwise, it uses a RepeatedKFold. In both cases, n_splits is set to 5.
test_size (float, optional) – test of the size set when splitting. Defaults to 0.2.
shuffle (bool, optional) – whether or not the algorithm shuffle the sample when splitting the given dataset. Defaults to False.
sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “MAPE”.
ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to True.
save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.
path_results (str, optional) – path to use when saving the results. Defaults to “”.
additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {}.
random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.
verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.
Examples
>>> import pandas as pd >>> from sklearn.datasets import make_regression >>> from vulpes.automl import Regressions
>>> X, y = make_regression( n_samples=20, n_features=1, random_state=0, noise=4.0, bias=100.0)
>>> regressions = Regressions() >>> df_models = regressions.fit(X, y) >>> df_models | Model | R2 | RMSE | ... |-----------------------------|----------|-----------|------ | LassoCV | 0.997649 | 19.818497 | ... | HuberRegressor | 0.997776 | 19.881912 | ... | Lars | 0.997694 | 19.555598 | ... | TransformedTargetRegressor | 0.997748 | 19.298391 | ... | ... | ... | ... | ...
- fit(X: List | DataFrame | Series | ndarray | Any, y: List | DataFrame | Series | ndarray | Any, *, sample_weight: List | DataFrame | Series | ndarray | Any | None = None, groups: List | DataFrame | Series | ndarray | Any | None = None) DataFrame [source]
Fit many models
- Parameters:
X (Array_like) – Input dataset
y (Array_like) – Target variable
sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.
groups (Array_like, optional) – groups to use during cross validation to stratify. Defaults to None.
- Raises:
ValueError – cross validation wrong type, or failed
RuntimeError – Error when fitting a model
- Returns:
- dataframe with the goodness-of-fit metrics
evaluated for each model.
- Return type:
pd.DataFrame
Examples: