Explainable Clustering

CLASSIX provides an appealing explanation for clustering results, either in global view or by specific indexing.

If we would like to make plot accompany just remember to set plot to True.

Global insight

from sklearn import datasets
import numpy as np
from classix import CLASSIX

X, y = datasets.make_blobs(n_samples=5000, centers=2, n_features=2, cluster_std=1, random_state=1)

clx = CLASSIX(sorting='pca', group_merging='density', radius=0.5, verbose=1, minPts=4)
clx.fit(X)

clx.explain(plot=True, savefig=True, figsize=(10,10))

The output is:

A clustering of 5000 data points with 2 features has been performed.
The radius parameter was set to 0.50 and MinPts was set to 4.
As the provided data has been scaled by a factor of 1/6.01,
data points within a radius of R=0.50*6.01=3.01 were aggregated into groups.
In total 7903 comparisons were required (1.58 comparisons per data point).
This resulted in 14 groups, each uniquely associated with a starting point.
These 14 groups were subsequently merged into 2 clusters.
A list of all starting points is shown below.
----------------------------------------
Group  NrPts  Cluster  Coordinates
0     398      0     -1.19 -1.09
1    1073      0     -0.65 -1.15
2     553      0     -1.17 -0.56
3     466      0     -0.67 -0.65
4       6      0     -0.19 -0.88
5       3      0     -0.72 -0.03
6       1      0     -0.22 -0.28
7     470      1       0.31 0.21
8     675      1       0.18 0.71
9     579      1       0.86 0.19
10     763      1       0.69 0.67
11       6      1       0.42 1.35
12       5      1       1.24 0.59
13       2      1        1.0 1.08
----------------------------------------
In order to explain the clustering of individual data points,
use .explain(ind1) or .explain(ind1, ind2) with indices of the data points.
_images/explain_viz.png

Track single data

Following the previous steps, we can analyze the specific data by refering to the index, for example here, we want to track the data with index 0:

clx.explain(0,  plot=True, savefig=True, fmt='PNG')

Output:

The data point is in group 2, which has been merged into cluster #0.
_images/None0.png

Comparison insight

We give two examples to compare the data pair cluster assignment as follows.

clx.explain(0, 2000,  plot=True, savefig=True, fmt='png')
The data point 0 is in group 2, which has been merged into cluster 0.
The data point 2000 is in group 10, which has been merged into cluster 1.
There is no path of overlapping groups between these clusters.
_images/None0_2000.png
clx.explain(0, 2008,  plot=True, savefig=True, fmt='png')
The data point 0 is in group 2 and the data point 2008 is in group 4,
both of which were merged into cluster #0.
These two groups are connected via groups 2 <-> 1 <-> 4.
_images/None0_2008.png

Case study of industry data

Here, we turn our attention on practical data. Similar to above, we load the necessary data to produce the analytical result.

import time
import numpy as np
import classix

To load the industry data provided by Kamil, we can simply use the API load_data and require the paramter as vdu_signals we leave the default parameters except setting radius to 1.

data = classix.loadData('vdu_signals')
clx = classix.CLASSIX(radius=1, group_merging='distance')

Note

The method loadData also supports other typical UCI datasets for clustering, which include 'vdu_signals', 'Iris', 'Dermatology', 'Ecoli', 'Glass', 'Banknote', 'Seeds', 'Phoneme', and 'Wine'.

Then, we employ classix model to train the data and record the timing:

st = time.time()
clx.fit_transform(data)
et = time.time()
print("consume time:", et - st)
CLASSIX(sorting='pca', radius=1, minPts=0, group_merging='distance')
The 2028780 data points were aggregated into 36 groups.
In total 3920623 comparisons were required (1.93 comparisons per data point).
The 36 groups were merged into 4 clusters with the following sizes:
    * cluster 0 : 2008943
    * cluster 1 : 16920
    * cluster 2 : 1800
    * cluster 3 : 1117
Try the .explain() method to explain the clustering.
consume time: 1.1904590129852295

If you set radius to 0.5, you can get the output: .. parsed-literal:

CLASSIX(sorting='pca', radius=0.5, minPts=0, group_merging='distance')
The 2028780 data points were aggregated into 93 groups.
In total 6252385 comparisons were required (3.08 comparisons per data point).
The 93 groups were merged into 7 clusters with the following sizes:
    * cluster 0 : 2008943
    * cluster 1 : 16909
    * cluster 2 : 1800
    * cluster 3 : 900
    * cluster 4 : 180
    * cluster 5 : 37
    * cluster 6 : 11
Try the .explain() method to explain the clustering.
consume time: 1.3505780696868896

From this, we can see there is big gap between the number of cluster 4 and cluster 5, by which we can assume the data within a cluster with size smaller than 38 are outliers. Therefore, we set minPts to 38. After that, we can get the same result as that with radius of 1. You can also set the parameter of post_alloc to False, then all outliers will be marked as label of -1 instead of executing the allocation strategy. Though in most cases outliers are hard to define and capture, this case tells us how to select an appropriate value for minPts to separate outliers or deal with outliers based on distance.

As above, we view the whole picture for data simply by

clx.explain(plot=True)

You can also specify other parameters to personalize the visualization to make it easier to analyze. For example, you can enlarge the fontsize of starting points labels by setting sp_fontsize larger or change the shape by tunning appropriate value for figsize. For more details about parameter settings, we refer to our API Reference. So, we try:

clx.explain(plot=True, figsize=(24,10), sp_fontsize=12)
_images/kamil_explain_viz.png
A clustering of 2028780 data points with 2 features has been performed.
The radius parameter was set to 1.00 and MinPts was set to 0.
As the provided data has been scaled by a factor of 1/2.46,
data points within a radius of R=1.00*2.46=2.46 were aggregated into groups.
In total 3920623 comparisons were required (1.93 comparisons per data point).
This resulted in 36 groups, each uniquely associated with a starting point.
These 36 groups were subsequently merged into 4 clusters.
A list of all starting points is shown below.
----------------------------------------
Group   NrPts  Cluster  Coordinates
0     10560     1      16.35 3.26
1      1800     2      15.81 1.85
2      2580     1      15.38 3.47
3       656     1      14.83 4.33
4       177     1      13.87 4.59
5      1058     1       12.9 4.23
6       392     1        12.0 4.8
7       664     1      11.98 2.94
8       806     1       11.6 3.88
9        18     1      10.89 3.15
10         9     1      10.66 2.05
11       128     3        9.0 1.93
12        45     3       8.04 1.51
13        23     3       7.82 2.55
14       183     3       6.97 0.56
15       146     3       6.93 2.06
16       138     3       6.23 1.33
17        47     3       6.16 2.79
18        40     3      5.81 -0.33
19       317     3        5.4 0.69
20        50     3       5.31 2.03
21       576     0      3.06 -0.02
22     12001     0      2.25 -0.61
23         2     0        2.0 0.94
24     76469     0      1.87 -1.56
25     47743     0      1.38 -0.07
26    500225     0      1.04 -1.01
27    145955     0        0.7 0.69
28     16456     0       0.6 -1.91
29    506281     0      0.38 -0.25
30    455788     0      -0.04 1.37
31     13196     0     -0.05 -1.16
32    110364     0      -0.36 0.42
33    123548     0      -0.89 1.92
34       274     0       -1.2 0.96
35        65     0       -1.87 1.7
----------------------------------------
In order to explain the clustering of individual data points,
use .explain(ind1) or .explain(ind1, ind2) with indices of the data points.

We can see most of data objects are allocated to groups 26~33, which correspond to cluster 0.

Then to track or compare any data by indexing, you can enter like

clx.explain(14940, 16943,  plot=True, savefig=True, sp_fontsize=10)
_images/kamil_14940_16943.png
The data point 14940 is in group 7, which has been merged into cluster 1.
The data point 16943 is in group 11, which has been merged into cluster 3.
There is no path of overlapping groups between these clusters.

The output documentation describes how two data objects are separated into two clusters, and also how far or close they are.