Clustering

clustering.py: Class Clustering to test many clustering algorithms

class vulpes.automl.clustering.Clustering(*, models_to_try: str | ~typing.List[~typing.Tuple[str, ~typing.Any]] = 'all', custom_scorer: ~typing.Dict[str, ~typing.Any] = {'calinski_harabasz': <function calinski_harabasz_score>, 'davies_bouldin': <function davies_bouldin_score>, 'silhouette': <function silhouette_score>}, preprocessing: ~sklearn.pipeline.Pipeline | str = 'default', sort_result_by: str = 'Davies–Bouldin Index', ascending: bool = True, save_results: bool = False, path_results: str = '', additional_model_params: ~typing.Dict[str, ~typing.Any] = {'eps': 0.5, 'min_samples': 5, 'nb_clusters': 3}, random_state: int | None = None, verbose: int = 0)[source]

Bases: CoreVulpes

Object to train many regressions. All the parameters are optionals and can be modified.

Parameters:

models_to_try (Union[str, List[Tuple[str, Any]]], optional) –
List of models to try. It can be either a string that corresponds to a predefined list of models (“all”, …) or it can be a list of tuple (name of a model, class of a model) (e.g. (“KMeans”,

sklearn.cluster.KMeans)).

Defaults to “all” (train all the available clustering algorithms).
custom_scorer (Dict[str, Any], optional) – metrics to calculate after fitting a model. Dictionary with pairs name:scorer where the scorer is created using the function make_scorer from sklearn. Defaults to CUSTOM_SCORER_CLT.
preprocessing (Union[Pipeline, str], optional) – preprocessing pipeline to use. It can be None (no preprocessing), a predefined preprocessing pipeline (“default”, …) or a Pipeline object from sklearn. Defaults to “default”: it applies a OneHotEncoder to “category” and object features, and it applies a SimpleImputer (median strategy) and a StandardScaler to numerical features.
sort_result_by (str, optional) – on which metric do you want to sort the final dataframe. Defaults to “Davies–Bouldin Index”.
ascending (bool, optional) – sort the final dataframe in ascending order?. Defaults to True.
save_results (bool, optional) – if True, save the results in a csv file. Defaults to False.
path_results (str, optional) – path to use when saving the results. Defaults to “”.
additional_model_params (Dict[str, Any], optional) – dictionary that contains parameters to be applied to each element of the pipeline. E.g. {“n_estimators”: 100}, apply to all the preprocessing tasks and/or models that have the parameter n_estimators with the parameter n_estimators. Defaults to {“nb_clusters”: 3, “min_samples”: 5, “eps”: 0.5}.
random_state (int, optional) – random state variable. Is applied to every model and elements of the pipeline. Defaults to 42.
verbose (int, optional) – if greater than 1, print the warnings. Defaults to 0.

Examples

>>> from sklearn.datasets import load_iris
>>> from vulpes.automl import Clustering

>>> dataset = load_iris()
>>> X = pd.DataFrame(dataset["data"], columns=dataset["feature_names"])

>>> clustering = Clustering()
>>> df_models = clustering.fit(X)
>>> df_models
| Model                   | Calinski-Harabasz Index | ...
|-------------------------|-------------------------|----------
| AgglomerativeClustering | 502.821564                  | ...
| MeanShift                   | 509.703427                      | ...
| Birch                   | 458.472511              | ...
| SpectralClustering      | 410.369441              | ...
| ...                     | ...                     | ...

Fit many clustering algorithms

Parameters:

X (Array_like) – Input dataset
sample_weight (Array_like, optional) – sample weights to apply when fitting. Defaults to None.

Raises:

RuntimeError – Error when fitting a model

Returns:

dataframe with the goodness-of-fit metrics: evaluated for each clustering algorithm.

Return type:

pd.DataFrame

Examples: