classix.CLASSIX
- class classix.CLASSIX(sorting='pca', radius=0.5, minPts=1, group_merging='distance', norm=True, mergeScale=1.5, post_alloc=True, mergeTinyGroups=True, memory=True, verbose=1, short_log_form=True)[source]
CLASSIX: Fast and explainable clustering based on sorting.
The main parameters are
radius
andminPts
.Parameters
- sortingstr, {‘pca’, ‘norm-mean’, ‘norm-orthant’, None}，default=’pca’
Sorting method used for the aggregation phase.
‘pca’: sort data points by their first principal component
‘norm-mean’: shift data to have zero mean and then sort by 2-norm values
‘norm-orthant’: shift data to positive orthant and then sort by 2-norm values
None: aggregate the raw data without any sorting
- radiusfloat, default=0.5
Tolerance to control the aggregation. If the distance between a group center and an object is less than or equal to the tolerance, the object will be allocated to the group which the group center belongs to. For details, we refer to [1].
- group_mergingstr, {‘density’, ‘distance’, None}, default=’distance’
The method for the merging of groups.
- ‘density’: two groups are merged if the density of data points in their intersection
is at least as high the smaller density of both groups. This option uses the disjoint set structure to speedup merging.
- ‘distance’: two groups are merged if the distance of their group centers is at
most mergeScale*radius (the parameter above). This option uses the disjoint set structure to speedup merging.
For more details, we refer to [1]. If group_merging is set to None, the method will return the labels formed by aggregation as cluster labels.
- minPtsint, default=1
Clusters with fewer than minPts points are classified as abnormal clusters. The data points in an abnormal cluster will be redistributed to the nearest normal cluster. When set to 1, no redistribution is performed.
- normboolean, default=True
If normalize the data associated with the sorting, default as True.
- mergeScalefloat
Design for distance-clustering, when distance between the two group centers associated with two distinct groups smaller than mergeScale*radius, then the two groups merge.
- post_allocboolean, default=True
If allocate the outliers to the closest groups, hence the corresponding clusters. If False, all outliers will be labeled as -1.
- mergeTinyGroupsboolean, default=True
If this is False, the group merging will ignore all groups with < minPts points.
- algorithmstr, default=’bf’
Algorithm to merge connected groups.
‘bf’: Use brute force routines to speed up the merging of connected groups.
‘set’: Use disjoint set structure to merge connected groups.
- memoryboolean, default=True
- If Cython memoryviews is disable, a fast algorithm with less efficient memory
consumption is triggered since precomputation for aggregation is used.
Setting it True will use a memory efficient computing. If Cython memoryviews is effective, this parameter can be ignored.
- verboseboolean or int, default=1
Whether to print the logs or not.
- short_log_formboolean, default=True
Whether or not to use short log form to truncate the clusters list.
Attributes
- groups_numpy.ndarray
Groups labels of aggregation.
- splist_numpy.ndarray
List of group centers formed in the aggregation.
- labels_numpy.ndarray
Clustering class labels for data objects
- group_outliers_numpy.ndarray
Indices of outliers (aggregation groups level), i.e., indices of abnormal groups within the clusters with fewer data points than minPts points.
- clusterSizes_array
The cardinality of each cluster.
- groupCenters_array
The indices for starting point corresponding to original data order.
- nrDistComp_float
The number of distance computations.
- dataScale_float
The value of data scaling.
Methods
- fit(data):
Cluster data while the parameters of the model will be saved. The labels can be extracted by calling
self.labels_
.- fit_transform(data):
Cluster data and return labels. The labels can also be extracted by calling
self.labels_
.- predict(data):
After clustering the in-sample data, predict the out-sample data. Data will be allocated to the clusters with the nearest starting point in the stage of aggregation. Default values.
- gcIndices(ids):
Return the group center (i.e., starting point) location in the data.
- explain(index1, index2, …):
Explain the computed clustering. The indices index1 and index2 are optional parameters (int) corresponding to the indices of the data points.
- load_group_centers(self):
Load group centers.
- load_cluster_centers(self):
Load cluster centers.
- getPath(index1, index2, include_dist=False):
Return the indices of connected data points between index1 data and index2 data.
- normalization(data):
Normalize the data according to the fitted model.
References
[1] X. Chen and S. Güttel. Fast and explainable sorted based clustering, 2022
- __init__(sorting='pca', radius=0.5, minPts=1, group_merging='distance', norm=True, mergeScale=1.5, post_alloc=True, mergeTinyGroups=True, memory=True, verbose=1, short_log_form=True)[source]
Methods
__init__
([sorting, radius, minPts, ...])calculate_group_centers
(data, labels)Compute data center for each label according to label sequence.
explain
([index1, index2, cmap, showalldata, ...])'self.explain(object/index) # prints an explanation for why a point object1 is in its cluster (or an outlier) 'self.explain(object1/index1, object2/index2) # prints an explanation why object1 and object2 are either in the same or distinct clusters
explain_viz
([showalldata, alpha, cmap, ...])Visualize the starting point and data points
fit
(data)Cluster the data and return the associated cluster labels.
fit_transform
(data)Cluster the data and return the associated cluster labels.
form_starting_point_clusters_table
([aggregate])form the columns details for group centers and clusters information
gc2ind
(spid)gcIndices
(ids)getPath
(index1, index2[, include_dist])Get the indices of connected data points between index1 data and index2 data.
Load cluster centers.
Load group centers.
merging
(data, agg_labels, splist, ind, sort_vals)Merge groups after aggregation.
normalization
(data)Normalize the data by the fitted model.
outlier_filter
([min_samples, min_samples_rate])Filter outliers in terms of
min_samples
ormin_samples_rate
.pprint_format
(items[, truncate])Format item value for clusters.
predict
(data[, memory])Allocate the data to their nearest clusters.
reassign_labels
(labels)Renumber the labels to 0, 1, 2, 3, ...
visualize_linkage
([scale, figsize, ...])Visualize the linkage in the distance clustering.
Attributes
clusterSizes_
groupCenters_
group_merging
minPts
radius
sorting