API Reference

class classix.cython_is_available(verbose=0)[source]

Check if CLASSIX is using Cython.

class classix.loadData(name='vdu_signals')[source]

Load built-in sample data.

Parameters:

name (str, {'vdu_signals', 'Iris', 'Dermatology', 'Ecoli', 'Glass',) –

‘Banknote’, ‘Seeds’, ‘Phoneme’, ‘Wine’, ‘Covid3MC’, ‘CovidENV’},

default=’vdu_signals’

Identifier of the built-in dataset.

Returns:

X, y – Data and ground-truth labels (if available).

Return type:

numpy.ndarray

class classix.CLASSIX(sorting='pca', radius=0.5, minPts=1, group_merging='distance', mergeScale=1.5, post_alloc=True, mergeTinyGroups=True, verbose=1, short_log_form=True)[source]

CLASSIX: Fast and explainable clustering based on sorting.

The main parameters are radius and minPts.

Parameters:

sorting (str, {'pca', 'norm-mean', 'norm-orthant', None},default='pca') –

Sorting method used for the aggregation phase.

  • ’pca’: sort data points by their first principal component

  • ’norm-mean’: shift data to have zero mean and then sort by 2-norm values

  • ’norm-orthant’: shift data to positive orthant and then sort by 2-norm values

  • None: aggregate the raw data without any sorting

radiusfloat, default=0.5

Tolerance to control the aggregation. If the distance between a group center and an object is less than or equal to the tolerance, the object will be allocated to the group which the group center belongs to. For details, we refer to [1].

group_mergingstr, {‘density’, ‘distance’, None}, default=’distance’

The method for the merging of groups.

  • ‘distance’: two groups are merged if the distance of their group centers is at

    most mergeScale*radius (the parameter above).

  • ‘density’: two groups are merged if the density of data points in their intersection

    is at least as high the smaller density of both groups. This option uses a disjoint set structure for the merging.

If group_merging is set to None, the method will return the labels formed by aggregation as the cluster labels.

minPtsint, default=1

Clusters with fewer than minPts points are classified as abnormal clusters. The data points in an abnormal cluster will be redistributed to the nearest normal cluster. When set to 1, no redistribution is performed.

mergeScalefloat

Used with distance-clustering; when distance between the two group centers associated with two distinct groups smaller than mergeScale*radius, then the two groups merge.

post_allocboolean, default=True

Whether to allocate outliers to the closest groups, hence the corresponding clusters. If False, all outliers will be labeled as -1.

mergeTinyGroupsboolean, default=True

If this is False, the group merging will ignore all groups with < minPts points.

verboseboolean or int, default=1

Whether to print the logs or not.

short_log_formboolean, default=True

Whether or not to use short log form to truncate the clusters list.

groups_

Groups labels of aggregation.

Type:

numpy.ndarray

splist_

List of group centers formed in the aggregation.

Type:

numpy.ndarray

labels_

Clustering class labels for data objects

Type:

numpy.ndarray

group_outliers_

Indices of outliers (aggregation groups level), i.e., indices of abnormal groups within the clusters with fewer data points than minPts points.

Type:

numpy.ndarray

clusterSizes_

The cardinality of each cluster.

Type:

array

groupCenters_

The indices for starting point corresponding to original data order.

Type:

array

nrDistComp_

The number of distance computations.

Type:

float

dataScale_

The value of data scaling.

Type:

float

fit(data):

Cluster data while the parameters of the model will be saved. The labels can be extracted by calling self.labels_.

fit_transform(data):

Cluster data and return labels. The labels can also be extracted by calling self.labels_.

predict(data):

After clustering the in-sample data, predict the out-sample data. Data will be allocated to the clusters with the nearest starting point in the stage of aggregation. Default values.

gcIndices(ids):

Return the group center (i.e., starting point) location in the data.

explain(index1, index2, ...):

Explain the computed clustering. The indices index1 and index2 are optional parameters (int) corresponding to the indices of the data points.

load_group_centers(self):

Load group centers.

load_cluster_centers(self):

Load cluster centers.

getPath(index1, index2, include_dist=False):

Return the indices of connected data points between index1 data and index2 data.

preprocessing(data):

Normalize the data according to the fitted model.

References

[1] X. Chen and S. Güttel. Fast and explainable clustering based on sorting,

https://arxiv.org/abs/2202.01456, 2022.

explain(index1=None, index2=None, cmap='jet', showalldata=False, showallgroups=False, showsplist=False, max_colwidth=None, replace_name=None, plot=False, figsize=(10, 7), figstyle='default', savefig=False, bcolor='#f5f9f9', obj_color='k', width=1.5, obj_msize=160, sp1_color='lime', sp2_color='cyan', sp_fcolor='tomato', sp_marker='+', sp_size=72, sp_mcolor='k', sp_alpha=0.05, sp_pad=0.5, sp_fontsize=10, sp_bbox=None, sp_cmarker='+', sp_csize=110, sp_ccolor='crimson', sp_clinewidths=2.7, dp_fcolor='white', dp_alpha=0.5, dp_pad=2, dp_fontsize=10, dp_bbox=None, show_all_grp_circle=False, show_connected_grp_circle=False, show_obj_grp_circle=True, color='red', connect_color='green', alpha=0.3, cline_width=2, add_arrow=True, arrow_linestyle='--', arrow_fc='darkslategrey', arrow_ec='k', arrow_linewidth=1, arrow_shrinkA=2, arrow_shrinkB=2, directed_arrow=0, axis='off', include_dist=False, show_connected_label=True, figname=None, fmt='pdf')[source]

‘self.explain(object/index) # prints an explanation for why a point object1 is in its cluster (or an outlier) ‘self.explain(object1/index1, object2/index2) # prints an explanation why object1 and object2 are either in the same or distinct clusters

Here we unify the terminology:

[-] data points [-] groups (made up of data points, formed by aggregation) [-] clusters (made up of groups)

Parameters:
  • index1 (int or numpy.ndarray, optional) – Input object1 [with index ‘index1’] for explanation.

  • index2 (int or numpy.ndarray, optional) – Input object2 [with index ‘index2’] for explanation, and compare objects [with indices ‘index1’ and ‘index2’].

  • cmap (str, default='Set3') – Colormaps for scatter plot.

  • showalldata (boolean, default=False) – Whether or not to show all data points in global view when too many data points for plot.

  • showallgroups (boolean, default=False) – Whether or not to show the start points marker.

  • showsplist (boolean, default=False) – Whether or not to show the group centers information, which include the number of data points (NumPts), corresponding clusters, and associated coordinates. This only applies to both index1 and index2 are “NULL”. Default as True.

  • max_colwidth (int, optional) – Max width to truncate each column in characters. By default, no limit.

  • replace_name (str or list, optional) –

    Replace the index with name. * For example: as for indices 1 and 1300 we have

    classix.explain(1, 1300, plot=False, figstyle="seaborn") # or classix.explain(obj1, obj4)

    The data point 1 is in group 9 and the data point 1300 is in group 8, both of which were merged into cluster #0. The two groups are connected via groups 9 -> 2 -> 8. * if we specify the replace name, then the output will be

    classix.explain(1, 1300, replace_name=["Peter Meyer", "Anna Fields"], figstyle="seaborn")

    The data point Peter Meyer is in group 9 and the data point Anna Fields is in group 8, both of which were merged into cluster #0. The two groups are connected via groups 9 -> 2 -> 8.

  • plot (boolean, default=False) – Determine if visulize the explanation.

  • figsize (tuple, default=(9, 6)) – Determine the size of explain figure.

  • figstyle (str, default="default") – Determine the style of visualization. see reference: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html

  • savefig (boolean, default=False) – Determine if save figure, the figure will be saved in the folder named “img”.

  • bcolor (str, default="#f5f9f9") – Color for figure background.

  • obj_color (str, default as "k") – Color for the text of data of index1 and index2.

  • obj_msize (float, optional:) – Size for markers for data of index1 and index2.

  • sp_fcolor (str, default='tomato') – The color marked for group centers text box.

  • sp_marker (str, default="+") – The marker for the start points.

  • sp_size (int, default=66) – The marker size for the start points.

  • sp_mcolor (str, default='k') – The color marked for startpoint points scatter marker.

  • sp_alpha (float, default=0.3) – The value setting for transparency of text box for group centers.

  • sp_pad (int, default=2) – The size of text box for group centers.

  • sp_bbox (dict, optional) – Dict with properties for patches.FancyBboxPatch for group centers.

  • sp_fontsize (int, optional) – The fontsize for text marked for group centers.

  • sp_cmarker (str, default="+") – The marker for the connected group centers.

  • sp_csize (int, default=100) – The marker size for the connected group centers.

  • sp_ccolor (str, default="crimson") – The marker color for the connected group centers.

  • sp_clinewidths (str, default=2.5) – The marker width for the connected group centers.

  • dp_fcolor (str, default='white') – The color marked for specified data objects text box.

  • dp_alpha (float, default=0.5) – The value setting for transparency of text box for specified data objects.

  • dp_pad (int, default=2) – The size of text box for specified data objects.

  • dp_fontsize (int, optional) – The fontsize for text marked for specified data objects.

  • dp_bbox (dict, optional) – Dict with properties for patches.FancyBboxPatch for specified data objects.

  • show_all_grp_circle (bool, default=False) – Whether or not to show all groups’ periphery within the objects’ clusters (only applies to when data dimension is less than or equal to 2).

  • show_connected_grp_circle (bool, default=False) – Whether or not to show all connected groups’ periphery within the objects’ clusters (only applies to when data dimension is less than or equal to 2).

  • show_obj_grp_circle (bool, default=True) – Whether or not to show the groups’ periphery of the objects (only applies to when data dimension is less than or equal to 2).

  • color (str, default='red') – Color for text of group centers labels in visualization.

  • alpha (float, default=0.3) – Transparency of data points. Scalar or None.

  • cline_width (float, default=2) – Set the patch linewidth of circle for group centers.

  • add_arrow (bool, default=False) – Whether or not add arrows for connected paths.

  • arrow_linestyle (str, default='--') – Linestyle for arrow.

  • arrow_fc (str, default='darkslategrey') – Face color for arrow.

  • arrow_ec (str, default='k') – Edge color for arrow.

  • arrow_linewidth (float, default=1) – Set the linewidth of the arrow edges.

  • directed_arrow (int, default=0) – Whether or not the edges for arrows is directed. Values at {-1, 0, 1}, 0 refers to undirected, -1 refers to the edge direction opposite to 1.

  • shrinkA (float, default=2) – Shrinking factor of the tail and head of the arrow respectively.

  • shrinkB (float, default=2) – Shrinking factor of the tail and head of the arrow respectively.

  • axis (boolean, default=True) – Whether or not add x,y axes to plot.

  • include_dist (boolean, default=False) – Whether or not to include distance information to compute the shortest path between objects.

  • show_connected_label (boolean, default=True) – Whether or not to show the named labels of the connected data points, where the named labels are given by pandas dataframe index.

  • figname (str, optional) – Set the figure name for the image to be saved.

  • fmt (str) – Specify the format of the image to be saved, default as ‘pdf’, other choice: png.

explain_viz(showalldata=False, alpha=0.5, cmap='Set3', figsize=(10, 7), showallgroups=False, figstyle='default', bcolor='white', width=0.5, sp_marker='+', sp_mcolor='k', savefig=False, fontsize=None, bbox=None, axis='off', fmt='pdf')[source]

Visualize the starting point and data points

fit(data)[source]

Cluster the data and return the associated cluster labels.

Parameters:

data (numpy.ndarray) – The ndarray-like input of shape (n_samples,)

fit_transform(data)[source]

Cluster the data and return the associated cluster labels.

Parameters:

data (numpy.ndarray) – The ndarray-like input of shape (n_samples,)

Returns:

labels – Index of the cluster each sample belongs to.

Return type:

numpy.ndarray

form_starting_point_clusters_table(aggregate=False)[source]

form the columns details for group centers and clusters information

getPath(index1, index2, include_dist=False)[source]

Get the indices of connected data points between index1 data and index2 data.

Parameters:
  • index1 (int) – Index for data point.

  • index2 (int) – Index for data point.

Returns:

connected_points – connected data points.

Return type:

numpy.ndarray

load_cluster_centers()[source]

Load cluster centers.

load_group_centers()[source]

Load group centers.

merging(data, agg_labels, splist, ind, sort_vals, radius=0.5, method='distance', minPts=1)[source]

Merge groups after aggregation.

Parameters:
  • data (numpy.ndarray) – The input that is array-like of shape (n_samples,).

  • agg_labels (list) – Groups labels of aggregation.

  • splist (numpy.ndarray) – List formed in the aggregation storing group centers.

  • ind (numpy.ndarray) – Sort values.

  • radius (float, default=0.5) – Tolerance to control the aggregation hence the whole clustering process. For aggregation, if the distance between a starting point and an object is less than or equal to the tolerance, the object will be allocated to the group which the starting point belongs to.

  • method (str) – The method for groups merging, default=’distance’, other options: ‘density’, ‘mst-distance’, and ‘scc-distance’.

  • minPts (int, default=0) – The threshold, in the range of [0, infity] to determine the noise degree. When assign it 0, algorithm won’t check noises.

Returns:

labels – The clusters labels of the data

Return type:

numpy.ndarray

outlier_filter(min_samples=None, min_samples_rate=0.1)[source]

Filter outliers in terms of min_samples or min_samples_rate.

pprint_format(items, truncate=True)[source]

Format item value for clusters.

predict(data)[source]

Allocate the data to their nearest clusters.

  • datanumpy.ndarray

    The ndarray-like input of shape (n_samples,)

Returns:

labels – The predicted clustering labels.

Return type:

numpy.ndarray

preprocessing(data)[source]

Normalize the data by the fitted model.

timing()[source]
This method will print five timing information regarding classix clustering:
  1. t1_prepare: The initial data preparation, which mainly comprises data scaling and the computation of the first two principal axes.

  2. t2_aggregate: This phase aggregates all data points into groups determined by the radius parameter of CLASSIX.

  3. t3_merge: The computed groups will be merged into clusters when their group centers (starting points) are sufficiently close.

  4. t4_minPts: Clusters with fewer than minPts points will be dissolved into their groups, and each of the groups will then be reassigned to a large enough cluster.

  5. t5_finalize: Any cleanup activities.

visualize_linkage(scale=1.5, figsize=(10, 7), labelsize=24, markersize=320, plot_boundary=False, bound_color='red', path='.', fmt='pdf')[source]

Visualize the linkage in the distance clustering.

Parameters:
  • scale (float) – Design for distance-clustering, when distance between the two group centers associated with two distinct groups smaller than scale*radius, then the two groups merge.

  • labelsize (int) – The fontsize of ticks.

  • markersize (int) – The size of the markers for group centers.

  • plot_boundary (boolean) – If it is true, will plot the boundary of groups for the group centers.

  • bound_color (str) – The color for the boundary for groups with the specified radius.

  • path (str) – Relative file location for figure storage.

  • fmt (str) – Specify the format of the image to be saved, default as ‘pdf’, other choice: png.

class classix.calculate_cluster_centers(data, labels)[source]

Calculate the mean centers of clusters from given data.

class classix.preprocessing(data, base)[source]

Initial data preparation of CLASSIX.