cplearn: A novel inference tool for high-dimensional complex data via density-geometry correlations

Author

Chandra Sekhar Mukherjee, Joonyoung Bae, and Jiapeng Zhang

Example run-through with CIFAR10-noisy-clip-embedding.

The UMAP of the embeddings look as follows (note the overlaps):

#Load CoreSPECT and configuration module
from cplearn.corespect import CorespectModel
from cplearn.corespect.config import CoreSpectConfig

#Initial parameters.
cfg = CoreSpectConfig(
    q=40,               #Determines neighborhood size for the underlying q-NN graph
    r=40,               #Neighborhood radius parameter for ascending random walk with FlowRank
    core_frac=0.3,      #Fraction of points in the top-layer
    densify=False,      #Densifying different parts of the data to reduce fragmentation
    granularity=0.5,    #Higher granularity finds more local cores but can lead to missing out on weaker clusters.
    resolution=0.5      #Resolution for clustering with Leiden (more clustering methods will be added later)
).configure()


# Run **CoreSPECT**
model = CorespectModel(X, **cfg.unpack()).run(fine_grained=True,propagate=True)

'''
Main components of model:
model.layers_: Containts a list of lists. Each list consists of a subset of indices (between 0 and n-1, where n:= X.shape[0])
The first list corresponds to the indices that form the cores, the subsequent lists contain the outer layers.

model.labels_: n-sized integer array.
    If propagate==False: Contains clustering label for the core (model.layers_[0]) indices, -1 in other places.
    If propagate==True:  Contains clustering label for all the points.

'''
#Visualizing the outcomes:

#Step 1: Generate UMAP skeleton.
import umap
reducer=umap.UMAP()
X_umap=reducer.fit_transform(X)

Layer-wise visualization with corespect labels (move the slider to see the transition):

Next, we present the layer extraction with a new visualization tool (we use the labels from model.labels_)

#Step 2: Initiate the **coremap** module.
from cplearn.coremap import Coremap
cmap=Coremap(model,global_umap=X_umap,fast_view=True)

'''
If fast_view= True, then we just use the UMAP skeleton, and then later show the visualization in a layer-wise manner.
If fast_view==False, we generate our own layer-wise visualization with the coremap algorithm.
'''


#Step 3: Layer-wise visualization (you can use your own labels instead of model.labels_)
from cplearn.coremap.vizualizer import visualize_coremap
fig=visualize_coremap(cmap,model.labels_, use_webgl=True)
fig.show()

Layer-wise visualization with ground-truth labels:

from cplearn.coremap.vizualizer import visualize_coremap
fig=visualize_coremap(cmap,label, use_webgl=True)
fig.show()

Additional note:

As we can see, the points in the core layer are very well-separated w.r.t. their ground-truth labels as well! As more points are added, overlaps increase.

We note that our result has more clusters (14) compared to the truth number of ground truth clusters (10). Out of the extra clusters, 3 are small sub-parts, and only cluster 2 gets broken into two roughly equal halves. This can be improved by changing the parameters, such as increasing q,r, decreasing granularity and resolution, and turn densify->'rw'.

NMI of clustering vs. layers:

Finally, we also plot the NMI values of points in the core-layer towards outer-layers.

Initially, the NMI value is very high, and remains around 0.9 until 75% of points are considered, then dipping, indicating possibly very-hard-to-separate regions with complex geometry in the outer layer.