Main Parameters
Major classe is CLASSIX
, which include clustering and visualization.
- class classix.clustering.CLASSIX(sorting='pca', radius=0.5, minPts=0, group_merging='distance', norm=True, scale=1.5, post_alloc=True, verbose=1)[source]
CLASSIX: Fast and explainable clustering based on sorting.
The user only need to concern the hyperparameters of
sorting
,radius
, andminPts
in the most cases. If want a flexible clustering, might consider other hyperparameters such asgroup_merging
,scale
, andpost_alloc
.Parameters
- sortingstr, {‘pca’, ‘norm-mean’, ‘norm-orthant’, None},default=’pca’
Sorting method used for the aggregation phase.
‘pca’: sort data points by their first principal component
‘norm-mean’: shift data to have zero mean and then sort by 2-norm values
‘norm-orthant’: shift data to positive orthant and then sort by 2-norm values
None: aggregate the raw data without any sorting
- radiusfloat, default=0.5
Tolerance to control the aggregation. If the distance between a starting point and an object is less than or equal to the tolerance, the object will be allocated to the group which the starting point belongs to. For details, we refer users to [1].
- group_mergingstr, {‘density’, ‘distance’, None}, default=’distance’
The method for merging the groups.
- ‘density’: two groups are merged if the density of data points in their intersection
is at least as high the smaller density of both groups. This option uses the disjoint set structure to speedup agglomerate.
- ‘distance’: two groups are merged if the distance of their starting points is at
most scale*radius (the parameter above). This option uses the disjoint set structure to speedup agglomerate.
For more details, we refer to [1]. If the users set group_merging to None, the clustering will only return the labels formed by aggregation as cluster labels.
- minPtsint, default=0
Clusters with less than minPts points are classified as abnormal clusters. The data points in an abnormal cluster will be redistributed to the nearest normal cluster. When set to 0, no redistribution is performed.
- normboolean, default=True
If normalize the data associated with the sorting, default as True.
- scalefloat
Design for distance-clustering, when distance between the two starting points associated with two distinct groups smaller than scale*radius, then the two groups merge.
- post_allocboolean, default=True
If allocate the outliers to the closest groups, hence the corresponding clusters. If False, all outliers will be labeled as -1.
- verboseboolean or int, default=1
Whether print the logs or not.
Attributes
- groups_numpy.ndarray
Groups labels of aggregation.
- splist_numpy.ndarray
List of starting points formed in the aggregation.
- labels_numpy.ndarray
Clustering class labels for data objects
- group_outliers_numpy.ndarray
Indices of outliers (aggregation groups level), i.e., indices of abnormal groups within the clusters with fewer data points than minPts points.
- clean_index_numpy.ndarray
The data without outliers. Given data X, the data without outliers can be exported by X_clean = X[classix.clean_index_,:] while the outliers can be exported by Outliers = X[~classix.clean_index_,:]
- connected_pairs_list
List for connected group labels.
Methods:
- fit(data):
Cluster data while the parameters of the model will be saved. The labels can be extracted by calling
self.labels_
.- fit_transform(data):
Cluster data and return labels. The labels can also be extracted by calling
self.labels_
.- predict(data):
After clustering the in-sample data, predict the out-sample data. Data will be allocated to the clusters with the nearest starting point in the stage of aggregation. Default values.
- explain(index1, index2):
Explain the computed clustering. The indices index1 and index2 are optional parameters (int) corresponding to the indices of the data points.
References
[1] X. Chen and S. Güttel. Fast and explainable sorted based clustering, 2022
- clustering(data, agg_labels, splist, sorting='pca', radius=0.5, method='distance', minPts=0)[source]
Merge groups after aggregation.
Parameters
- datanumpy.ndarray
The input that is array-like of shape (n_samples,).
- agg_labels: numpy.ndarray
Groups labels of aggregation.
- splist: numpy.ndarray
List formed in the aggregation storing starting points.
- sortingstr
The sorting way refered for aggregation, default=’pca’, other options: ‘norm-mean’, ‘norm-orthant’, ‘z-pca’, or None.
- radiusfloat, default=0.5
Tolerance to control the aggregation hence the whole clustering process. For aggregation, if the distance between a starting point and an object is less than or equal to the tolerance, the object will be allocated to the group which the starting point belongs to.
- methodstr
The method for groups merging, default=’distance’, other options: ‘density’, ‘mst-distance’, and ‘scc-distance’.
- minPtsint, default=0
The threshold, in the range of [0, infity] to determine the noise degree. When assgin it 0, algorithm won’t check noises.
Returns
- centersnumpy.ndarray
The return centers of clusters
- clabelsnumpy.ndarray
The clusters labels of the data
- explain(index1=None, index2=None, showsplist=True, max_colwidth=None, replace_name=None, plot=False, figsize=(11, 6), figstyle='ggplot', savefig=False, ind_color='k', ind_marker_size=150, sp_fcolor='tomato', sp_alpha=0.05, sp_pad=0.5, sp_fontsize=None, sp_bbox=None, dp_fcolor='bisque', dp_alpha=0.6, dp_pad=2, dp_fontsize=None, dp_bbox=None, cmap='turbo', cmin=0.07, cmax=0.97, color='red', connect_color='green', alpha=0.5, cline_width=0.5, axis='off', figname=None, fmt='pdf', *argv, **kwargs)[source]
‘self.explain(object/index) # prints an explanation for why a point object1 is in its cluster (or an outlier) ‘self.explain(object1/index1, object2/index2) # prints an explanation why object1 and object2 are either in the same or distinct clusters
- Here we unify the terminology:
[-] data points [-] groups (made up of data points, formed by aggregation) [-] clusters (made up of groups)
Parameters
- index1int or numpy.ndarray, optional
Input object1 [with index ‘index1’] for explanation.
- index2int or numpy.ndarray, optional
Input object2 [with index ‘index2’] for explanation, and compare objects [with indices ‘index1’ and ‘index2’].
- showsplistboolean
Determine if show the starting points information, which include the number of data points (NumPts), corresponding clusters, and associated coordinates. This only applies to both index1 and index2 are “NULL”. Default as True.
- max_colwidthint, optional
Max width to truncate each column in characters. By default, no limit.
- replace_namestr or list, optional
Replace the index with name. * For example: as for indices 1 and 1300 we have
classix.explain(1, 1300, plot=False, figstyle="seaborn") # or classix.explain(obj1, obj4)
The data point 1 is in group 9 and the data point 1300 is in group 8, both of which were merged into cluster #0. These two groups are connected via groups 9 -> 2 -> 8. * if we specify the replace name, then the output will be
classix.explain(1, 1300, replace_name=["Peter Meyer", "Anna Fields"], figstyle="seaborn")
The data point Peter Meyer is in group 9 and the data point Anna Fields is in group 8, both of which were merged into cluster #0. These two groups are connected via groups 9 -> 2 -> 8.
- plotboolean, default=False
Determine if visulize the explaination.
- figsizetuple, default=(10, 6)
Determine the size of visualization figure.
- figstylestr, default=”ggplot”
Determine the style of visualization. see reference: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html
- savefigboolean, default=False
Determine if save figure, the figure will be saved in the folder named “img”.
- indices_colorstr, default as ‘k’
Color for visualization of data with indices index1 and index2.
- ind_marker_sizefloat, optional:
Size for visualization of data with indices index1 and index2.
- sp_fcolorstr, default=’tomato’
The color marked for starting points text box.
- sp_alphafloat, default=0.3
The value setting for transprency of text box for starting points.
- sp_padint, default=2
The size of text box for starting points.
- sp_bboxdict, optional
Dict with properties for patches.FancyBboxPatch for starting points.
- sp_fontsizeint, optional
The fontsize for text marked for starting points.
- dp_fcolorstr, default=’bisque’
The color marked for specified data objects text box.
- dp_alphafloat, default=0.3
The value setting for transprency of text box for specified data objects.
- dp_padint, default=2
The size of text box for specified data objects.
- dp_fontsizeint, optional
The fontsize for text marked for specified data objects.
- dp_bboxdict, optional
Dict with properties for patches.FancyBboxPatch for specified data objects.
- cmapstr, default=’turbo’
The colormap to be employed.
- cminint or float, default=0.07
The minimum color range.
- cmaxint or float, default=0.97
The maximum color range.
- colorstr, default=’red’
Color for text of starting points labels in visualization.
- alphafloat, default=0.5
Scalar or None.
- cline_widthfloat, default=0.5
Set the patch linewidth of circle for starting points.
- fignamestr, optional
Set the figure name for the image to be saved.
- fmtstr
Specify the format of the image to be saved, default as ‘pdf’, other choice: png.
- explain_viz(figsize=(12, 8), figstyle='ggplot', savefig=False, fontsize=None, bbox={'alpha': 0.3, 'facecolor': 'tomato', 'pad': 2}, axis='off', fmt='pdf')[source]
Visualize the starting point and data points
- fit(data)[source]
Cluster the data and return the associated cluster labels.
Parameters
- datanumpy.ndarray
The ndarray-like input of shape (n_samples,)
- fit_transform(data)[source]
Cluster the data and return the associated cluster labels.
Parameters
- datanumpy.ndarray
The ndarray-like input of shape (n_samples,)
Returns
- labelsnumpy.ndarray
Index of the cluster each sample belongs to.
- form_starting_point_clusters_table(aggregate=False)[source]
form the columns details for starting points and clusters information
- outlier_filter(labels, min_samples=None, min_samples_rate=0.1)[source]
Filter outliers in terms of min_samples
- predict(data)[source]
Allocate the data to their nearest clusters.
- datanumpy.ndarray
The ndarray-like input of shape (n_samples,)
Returns
- labelsnumpy.ndarray
The predicted clustering labels.
- visualize_linkage(scale=1.5, figsize=(8, 8), labelsize=24, markersize=320, plot_boundary=False, bound_color='red', path='.', fmt='pdf')[source]
Visualize the linkage in the distance clustering.
Parameters
- scalefloat
Design for distance-clustering, when distance between the two starting points associated with two distinct groups smaller than scale*radius, then the two groups merge.
- labelsizeint
The fontsize of ticks.
- markersizeint
The size of the markers for starting points.
- plot_boundaryboolean
If it is true, will plot the boundary of groups for the starting points.
- bound_colorstr
The color for the boundary for groups with the specified radius.
- pathstr
Relative file location for figure storage.
- fmtstr
Specify the format of the image to be saved, default as ‘pdf’, other choice: png.