classix.CLASSIX

class classix.CLASSIX(sorting='pca', radius=0.5, minPts=1, group_merging='distance', norm=True, mergeScale=1.5, post_alloc=True, mergeTinyGroups=True, memory=True, verbose=1, short_log_form=True)[source]

CLASSIX: Fast and explainable clustering based on sorting.

The main parameters are radius and minPts.

Parameters

sortingstr, {‘pca’, ‘norm-mean’, ‘norm-orthant’, None},default=’pca’

Sorting method used for the aggregation phase.

  • ‘pca’: sort data points by their first principal component

  • ‘norm-mean’: shift data to have zero mean and then sort by 2-norm values

  • ‘norm-orthant’: shift data to positive orthant and then sort by 2-norm values

  • None: aggregate the raw data without any sorting

radiusfloat, default=0.5

Tolerance to control the aggregation. If the distance between a group center and an object is less than or equal to the tolerance, the object will be allocated to the group which the group center belongs to. For details, we refer to [1].

group_mergingstr, {‘density’, ‘distance’, None}, default=’distance’

The method for the merging of groups.

  • ‘density’: two groups are merged if the density of data points in their intersection

    is at least as high the smaller density of both groups. This option uses the disjoint set structure to speedup merging.

  • ‘distance’: two groups are merged if the distance of their group centers is at

    most mergeScale*radius (the parameter above). This option uses the disjoint set structure to speedup merging.

For more details, we refer to [1]. If group_merging is set to None, the method will return the labels formed by aggregation as cluster labels.

minPtsint, default=1

Clusters with fewer than minPts points are classified as abnormal clusters. The data points in an abnormal cluster will be redistributed to the nearest normal cluster. When set to 1, no redistribution is performed.

normboolean, default=True

If normalize the data associated with the sorting, default as True.

mergeScalefloat

Design for distance-clustering, when distance between the two group centers associated with two distinct groups smaller than mergeScale*radius, then the two groups merge.

post_allocboolean, default=True

If allocate the outliers to the closest groups, hence the corresponding clusters. If False, all outliers will be labeled as -1.

mergeTinyGroupsboolean, default=True

If this is False, the group merging will ignore all groups with < minPts points.

algorithmstr, default=’bf’

Algorithm to merge connected groups.

  • ‘bf’: Use brute force routines to speed up the merging of connected groups.

  • ‘set’: Use disjoint set structure to merge connected groups.

memoryboolean, default=True
If Cython memoryviews is disable, a fast algorithm with less efficient memory

consumption is triggered since precomputation for aggregation is used.

Setting it True will use a memory efficient computing. If Cython memoryviews is effective, this parameter can be ignored.

verboseboolean or int, default=1

Whether to print the logs or not.

short_log_formboolean, default=True

Whether or not to use short log form to truncate the clusters list.

Attributes

groups_numpy.ndarray

Groups labels of aggregation.

splist_numpy.ndarray

List of group centers formed in the aggregation.

labels_numpy.ndarray

Clustering class labels for data objects

group_outliers_numpy.ndarray

Indices of outliers (aggregation groups level), i.e., indices of abnormal groups within the clusters with fewer data points than minPts points.

clusterSizes_array

The cardinality of each cluster.

groupCenters_array

The indices for starting point corresponding to original data order.

nrDistComp_float

The number of distance computations.

dataScale_float

The value of data scaling.

Methods

fit(data):

Cluster data while the parameters of the model will be saved. The labels can be extracted by calling self.labels_.

fit_transform(data):

Cluster data and return labels. The labels can also be extracted by calling self.labels_.

predict(data):

After clustering the in-sample data, predict the out-sample data. Data will be allocated to the clusters with the nearest starting point in the stage of aggregation. Default values.

gcIndices(ids):

Return the group center (i.e., starting point) location in the data.

explain(index1, index2, …):

Explain the computed clustering. The indices index1 and index2 are optional parameters (int) corresponding to the indices of the data points.

load_group_centers(self):

Load group centers.

load_cluster_centers(self):

Load cluster centers.

getPath(index1, index2, include_dist=False):

Return the indices of connected data points between index1 data and index2 data.

normalization(data):

Normalize the data according to the fitted model.

References

[1] X. Chen and S. Güttel. Fast and explainable sorted based clustering, 2022

__init__(sorting='pca', radius=0.5, minPts=1, group_merging='distance', norm=True, mergeScale=1.5, post_alloc=True, mergeTinyGroups=True, memory=True, verbose=1, short_log_form=True)[source]

Methods

__init__([sorting, radius, minPts, ...])

calculate_group_centers(data, labels)

Compute data center for each label according to label sequence.

explain([index1, index2, cmap, showalldata, ...])

'self.explain(object/index) # prints an explanation for why a point object1 is in its cluster (or an outlier) 'self.explain(object1/index1, object2/index2) # prints an explanation why object1 and object2 are either in the same or distinct clusters

explain_viz([showalldata, alpha, cmap, ...])

Visualize the starting point and data points

fit(data)

Cluster the data and return the associated cluster labels.

fit_transform(data)

Cluster the data and return the associated cluster labels.

form_starting_point_clusters_table([aggregate])

form the columns details for group centers and clusters information

gc2ind(spid)

gcIndices(ids)

getPath(index1, index2[, include_dist])

Get the indices of connected data points between index1 data and index2 data.

load_cluster_centers()

Load cluster centers.

load_group_centers()

Load group centers.

merging(data, agg_labels, splist, ind, sort_vals)

Merge groups after aggregation.

normalization(data)

Normalize the data by the fitted model.

outlier_filter([min_samples, min_samples_rate])

Filter outliers in terms of min_samples or min_samples_rate.

pprint_format(items[, truncate])

Format item value for clusters.

predict(data[, memory])

Allocate the data to their nearest clusters.

reassign_labels(labels)

Renumber the labels to 0, 1, 2, 3, ...

visualize_linkage([scale, figsize, ...])

Visualize the linkage in the distance clustering.

Attributes

clusterSizes_

groupCenters_

group_merging

minPts

radius

sorting