Classifiers
classifiers.py: Class Classifiers to test many classification models
- class vulpes.automl.classifiers.Classifiers(*, models_to_try: str | List[Tuple[str, Any]] = 'all', custom_scorer: Dict[str, Any] = {'accuracy': make_scorer(accuracy_score), 'auprc': make_scorer(pr_auc_score, needs_proba=True), 'auroc': make_scorer(roc_auc_score, needs_proba=True, multi_class=ovo, average=macro), 'avg_precision': make_scorer(avg_precision, needs_proba=True), 'balanced_accuracy': make_scorer(balanced_accuracy_score), 'f1': make_scorer(f1_score, average=macro), 'precision': make_scorer(precision_score, average=macro), 'recall': make_scorer(recall_score, average=macro)}, preprocessing: Pipeline | str = 'default', use_cross_validation: bool = True, cv: Any = 'default', test_size: float = 0.2, shuffle: bool = False, sort_result_by: str = 'Balanced Accuracy', ascending: bool = False, save_results: bool = False, path_results: str = '', additional_model_params: Dict[str, Any] = {}, random_state: int = 42, verbose: int = 0)[source]
Bases:
CoreVulpes
Object to train many classifiers. All the parameters are optionals and can be modified.
- Parameters:
models_to_try (Union[str, List[Tuple[str, Any]]], optional) –
List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“RandomForestClassifier”,
sklearn.ensemble.RandomForestClassifier)).
Defaults to “all” (train all the available classification algorithms).
custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_CLF.
preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.
use_cross_validation (bool, optional) – whether or not we apply a cross-validation. Defaults to True.
cv (Any, optional) – cross-validation object. It can be a predefined cross-validation setting (“default”, “timeseries”, …), an iterable object, a cross-validation object from sklearn, etc. Defaults to “default”: it applies a StratifiedShuffleSplit if a groups object is given when applying the fit method, otherwise, it uses a RepeatedKFold. In both cases, n_splits is set to 5.
test_size (float, optional) – test of the size set when splitting. Defaults to 0.2.
shuffle (bool, optional) – whether or not the algorithm shuffle the sample when splitting the given dataset. Defaults to False.
sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “Balanced Accuracy”.
ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to False.
save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.
path_results (str, optional) – path to use when saving the results. Defaults to “”.
additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {}.
random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.
verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.
Examples
>>> import pandas as pd >>> from sklearn.datasets import load_iris >>> from vulpes.automl import Classifiers
>>> dataset = load_iris() >>> X = pd.DataFrame(dataset["data"], columns=dataset["feature_names"]) >>> y = dataset["target"]
>>> classifiers = Classifiers() >>> df_models = classifiers.fit(X, y) >>> df_models | Model | Balanced Accuracy | Accuracy | ... |-----------------------------|-------------------|----------|------ Model LinearDiscriminantAnalysis 0.977625 0.977333 ... MLPClassifier 0.973753 0.973333 ... QuadraticDiscriminantAnalysis 0.973219 0.973333 ... KNeighborsClassifier 0.971702 0.969333 ... ... ... ... ...
- fit(X: List | DataFrame | Series | ndarray | Any, y: List | DataFrame | Series | ndarray | Any, *, sample_weight: List | DataFrame | Series | ndarray | Any | None = None, groups: List | DataFrame | Series | ndarray | Any | None = None) DataFrame [source]
Fit many models
- Parameters:
X (Array_like) – Input dataset
y (Array_like) – Target variable
sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.
groups (Array_like, optional) – groups to use during cross validation to stratify. Defaults to None.
- Raises:
ValueError – cross validation wrong type, or failed
RuntimeError – Error when fitting a model
- Returns:
- dataframe with the goodness-of-fit metrics
evaluated for each model.
- Return type:
pd.DataFrame
Examples: