Main Parameters

Major classe is CLASSIX, which include clustering and visualization.

class classix.clustering.CLASSIX(sorting='pca', radius=0.5, minPts=0, group_merging='distance', norm=True, scale=1.5, post_alloc=True, verbose=1)[source]

CLASSIX: Fast and explainable clustering based on sorting.

The user only need to concern the hyperparameters of sorting, radius, and minPts in the most cases. If want a flexible clustering, might consider other hyperparameters such as group_merging, scale, and post_alloc.

Parameters

sortingstr, {‘pca’, ‘norm-mean’, ‘norm-orthant’, None},default=’pca’

Sorting method used for the aggregation phase.

  • ‘pca’: sort data points by their first principal component

  • ‘norm-mean’: shift data to have zero mean and then sort by 2-norm values

  • ‘norm-orthant’: shift data to positive orthant and then sort by 2-norm values

  • None: aggregate the raw data without any sorting

radiusfloat, default=0.5

Tolerance to control the aggregation. If the distance between a starting point and an object is less than or equal to the tolerance, the object will be allocated to the group which the starting point belongs to. For details, we refer users to [1].

group_mergingstr, {‘density’, ‘distance’, None}, default=’distance’

The method for merging the groups.

  • ‘density’: two groups are merged if the density of data points in their intersection

    is at least as high the smaller density of both groups. This option uses the disjoint set structure to speedup agglomerate.

  • ‘distance’: two groups are merged if the distance of their starting points is at

    most scale*radius (the parameter above). This option uses the disjoint set structure to speedup agglomerate.

For more details, we refer to [1]. If the users set group_merging to None, the clustering will only return the labels formed by aggregation as cluster labels.

minPtsint, default=0

Clusters with less than minPts points are classified as abnormal clusters. The data points in an abnormal cluster will be redistributed to the nearest normal cluster. When set to 0, no redistribution is performed.

normboolean, default=True

If normalize the data associated with the sorting, default as True.

scalefloat

Design for distance-clustering, when distance between the two starting points associated with two distinct groups smaller than scale*radius, then the two groups merge.

post_allocboolean, default=True

If allocate the outliers to the closest groups, hence the corresponding clusters. If False, all outliers will be labeled as -1.

verboseboolean or int, default=1

Whether print the logs or not.

Attributes

groups_numpy.ndarray

Groups labels of aggregation.

splist_numpy.ndarray

List of starting points formed in the aggregation.

labels_numpy.ndarray

Clustering class labels for data objects

group_outliers_numpy.ndarray

Indices of outliers (aggregation groups level), i.e., indices of abnormal groups within the clusters with fewer data points than minPts points.

clean_index_numpy.ndarray

The data without outliers. Given data X, the data without outliers can be exported by X_clean = X[classix.clean_index_,:] while the outliers can be exported by Outliers = X[~classix.clean_index_,:]

connected_pairs_list

List for connected group labels.

Methods:

fit(data):

Cluster data while the parameters of the model will be saved. The labels can be extracted by calling self.labels_.

fit_transform(data):

Cluster data and return labels. The labels can also be extracted by calling self.labels_.

predict(data):

After clustering the in-sample data, predict the out-sample data. Data will be allocated to the clusters with the nearest starting point in the stage of aggregation. Default values.

explain(index1, index2):

Explain the computed clustering. The indices index1 and index2 are optional parameters (int) corresponding to the indices of the data points.

References

[1] X. Chen and S. Güttel. Fast and explainable sorted based clustering, 2022

clustering(data, agg_labels, splist, sorting='pca', radius=0.5, method='distance', minPts=0)[source]

Merge groups after aggregation.

Parameters

datanumpy.ndarray

The input that is array-like of shape (n_samples,).

agg_labels: numpy.ndarray

Groups labels of aggregation.

splist: numpy.ndarray

List formed in the aggregation storing starting points.

sortingstr

The sorting way refered for aggregation, default=’pca’, other options: ‘norm-mean’, ‘norm-orthant’, ‘z-pca’, or None.

radiusfloat, default=0.5

Tolerance to control the aggregation hence the whole clustering process. For aggregation, if the distance between a starting point and an object is less than or equal to the tolerance, the object will be allocated to the group which the starting point belongs to.

methodstr

The method for groups merging, default=’distance’, other options: ‘density’, ‘mst-distance’, and ‘scc-distance’.

minPtsint, default=0

The threshold, in the range of [0, infity] to determine the noise degree. When assgin it 0, algorithm won’t check noises.

Returns

centersnumpy.ndarray

The return centers of clusters

clabelsnumpy.ndarray

The clusters labels of the data

explain(index1=None, index2=None, showsplist=True, max_colwidth=None, replace_name=None, plot=False, figsize=(11, 6), figstyle='ggplot', savefig=False, ind_color='k', ind_marker_size=150, sp_fcolor='tomato', sp_alpha=0.05, sp_pad=0.5, sp_fontsize=None, sp_bbox=None, dp_fcolor='bisque', dp_alpha=0.6, dp_pad=2, dp_fontsize=None, dp_bbox=None, cmap='turbo', cmin=0.07, cmax=0.97, color='red', connect_color='green', alpha=0.5, cline_width=0.5, axis='off', figname=None, fmt='pdf', *argv, **kwargs)[source]

‘self.explain(object/index) # prints an explanation for why a point object1 is in its cluster (or an outlier) ‘self.explain(object1/index1, object2/index2) # prints an explanation why object1 and object2 are either in the same or distinct clusters

Here we unify the terminology:

[-] data points [-] groups (made up of data points, formed by aggregation) [-] clusters (made up of groups)

Parameters

index1int or numpy.ndarray, optional

Input object1 [with index ‘index1’] for explanation.

index2int or numpy.ndarray, optional

Input object2 [with index ‘index2’] for explanation, and compare objects [with indices ‘index1’ and ‘index2’].

showsplistboolean

Determine if show the starting points information, which include the number of data points (NumPts), corresponding clusters, and associated coordinates. This only applies to both index1 and index2 are “NULL”. Default as True.

max_colwidthint, optional

Max width to truncate each column in characters. By default, no limit.

replace_namestr or list, optional

Replace the index with name. * For example: as for indices 1 and 1300 we have

classix.explain(1, 1300, plot=False, figstyle="seaborn") # or classix.explain(obj1, obj4)

The data point 1 is in group 9 and the data point 1300 is in group 8, both of which were merged into cluster #0. These two groups are connected via groups 9 -> 2 -> 8. * if we specify the replace name, then the output will be

classix.explain(1, 1300, replace_name=["Peter Meyer", "Anna Fields"], figstyle="seaborn")

The data point Peter Meyer is in group 9 and the data point Anna Fields is in group 8, both of which were merged into cluster #0. These two groups are connected via groups 9 -> 2 -> 8.

plotboolean, default=False

Determine if visulize the explaination.

figsizetuple, default=(10, 6)

Determine the size of visualization figure.

figstylestr, default=”ggplot”

Determine the style of visualization. see reference: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html

savefigboolean, default=False

Determine if save figure, the figure will be saved in the folder named “img”.

indices_colorstr, default as ‘k’

Color for visualization of data with indices index1 and index2.

ind_marker_sizefloat, optional:

Size for visualization of data with indices index1 and index2.

sp_fcolorstr, default=’tomato’

The color marked for starting points text box.

sp_alphafloat, default=0.3

The value setting for transprency of text box for starting points.

sp_padint, default=2

The size of text box for starting points.

sp_bboxdict, optional

Dict with properties for patches.FancyBboxPatch for starting points.

sp_fontsizeint, optional

The fontsize for text marked for starting points.

dp_fcolorstr, default=’bisque’

The color marked for specified data objects text box.

dp_alphafloat, default=0.3

The value setting for transprency of text box for specified data objects.

dp_padint, default=2

The size of text box for specified data objects.

dp_fontsizeint, optional

The fontsize for text marked for specified data objects.

dp_bboxdict, optional

Dict with properties for patches.FancyBboxPatch for specified data objects.

cmapstr, default=’turbo’

The colormap to be employed.

cminint or float, default=0.07

The minimum color range.

cmaxint or float, default=0.97

The maximum color range.

colorstr, default=’red’

Color for text of starting points labels in visualization.

alphafloat, default=0.5

Scalar or None.

cline_widthfloat, default=0.5

Set the patch linewidth of circle for starting points.

fignamestr, optional

Set the figure name for the image to be saved.

fmtstr

Specify the format of the image to be saved, default as ‘pdf’, other choice: png.

explain_viz(figsize=(12, 8), figstyle='ggplot', savefig=False, fontsize=None, bbox={'alpha': 0.3, 'facecolor': 'tomato', 'pad': 2}, axis='off', fmt='pdf')[source]

Visualize the starting point and data points

fit(data)[source]

Cluster the data and return the associated cluster labels.

Parameters

datanumpy.ndarray

The ndarray-like input of shape (n_samples,)

fit_transform(data)[source]

Cluster the data and return the associated cluster labels.

Parameters

datanumpy.ndarray

The ndarray-like input of shape (n_samples,)

Returns

labelsnumpy.ndarray

Index of the cluster each sample belongs to.

form_starting_point_clusters_table(aggregate=False)[source]

form the columns details for starting points and clusters information

outlier_filter(labels, min_samples=None, min_samples_rate=0.1)[source]

Filter outliers in terms of min_samples

predict(data)[source]

Allocate the data to their nearest clusters.

datanumpy.ndarray

The ndarray-like input of shape (n_samples,)

Returns

labelsnumpy.ndarray

The predicted clustering labels.

visualize_linkage(scale=1.5, figsize=(8, 8), labelsize=24, markersize=320, plot_boundary=False, bound_color='red', path='.', fmt='pdf')[source]

Visualize the linkage in the distance clustering.

Parameters

scalefloat

Design for distance-clustering, when distance between the two starting points associated with two distinct groups smaller than scale*radius, then the two groups merge.

labelsizeint

The fontsize of ticks.

markersizeint

The size of the markers for starting points.

plot_boundaryboolean

If it is true, will plot the boundary of groups for the starting points.

bound_colorstr

The color for the boundary for groups with the specified radius.

pathstr

Relative file location for figure storage.

fmtstr

Specify the format of the image to be saved, default as ‘pdf’, other choice: png.