Research
In my PhD so far, I have focused on unsupervised learning, starting with statistical models and extending to real-world data.
Unsupervised learning on statistical models (stochastic block model):
At the start of my PhD, I wanted to explore random structures(such as random graphs) and learning theory. At this point, I came across a stochastic block model (SBM), one of the most fundamental statistical models for graph clustering.
Overview:
The simple case can be defined as follows. A graph is built on n vertices, where the vertices have a "hidden" partition into two communities. Then, each pair of vertices belonging to the same community is connected by an edge with some probability p. Each pair of vertices from different communities is connected with probability q (with p>q; assume p=0.51 and q=0.49, for example). Then, given such a graph, the task is to recover these hidden communities with high probability. This is one of the most well-studied problems in clustering, with several important and beautiful results in the last 40 years (Read references in [1] for an in-depth review). However, we observed that some important problems are unresolved.Unbalanced SBM:
Vanilla algorithms: The "power" of power method:
At this point, we observed that the algorithms that the previously state-of-the-art algorithms for the aforementioned problems, as well as our algorithms, were somewhat complex. In contrast, practitioners often use simple algorithms (such as spectral clustering) to recover clusters on real-world graphs. Thus, it seemed that the algorithms were complicated in simplifying the proofs and not boosting the actual performance of the algorithm!Indeed, this phenomenon was observed by mathematicians such as Emmanuel Abbe and Van Vu when considering spectral algorithms. They conjectured that a simple SVD-based projection of the adjacency matrix should separate the communities.
Motivated by this, we showed that a simple power method can recover the communities and is logarithmically tight compared to best-known bounds [1]. Our algorithm is very simple. You first centralize the adjacency matrix of the graph and then take log(n)-th power of this matrix. We showed that in this powered matrix, rows belonging to vertices from the same community would have much less Euclidean distance than the inter-community rows. As a consequence,
i) We resolved the conjecture of Vu when the size of all communities is the same (balanced SBM) via a connection between SVD projection and the power of a matrix.
ii) Obtained the first parameter-free algorithm that overcomes the small cluster barrier. In comparison, previous works needed knowledge of the probability parameters p and q to recover large clusters in the presence of small clusters.
Thus, our analysis had new implications both in the balanced and unbalanced SBM. To prove the correctness of this simple algorithm, we devised certain random partition based ideas to analyze low-degree polynomials of random variables that we think may be of independent interest.
Modeling and analysis of real-world graphs:
Two of the primary reasons behind the popularity of SBM are its ability to capture certain real-world networks and its simplicity, which allows inference and understanding of various statistical and computational phenomena. However, the model can be too simple to capture more complex graphs that may arise in real-world scenarios. Motivated by this, we aimed to explore the structural properties of real-world graphs with underlying communities.
Overview:
In this direction, we focused on single-cell RNA seq data, a very influential data type in biology that has been crucial in identifying genes responsible for different types of cancer, among many other applications. Here, each data point corresponds to a cell, and a fundamental task is to partition the cells according to their underlying cell type, which is costly to obtain through biological experiments alone, necessitating the use of clustering algorithms. The standard pipeline isData(10,000+ features) ->PCA(50-100 dimensions)-> Embedding onto a graph->graph clustering.
The data is first passed through PCA to reduce dimensionality and noise (due to experimental error and biological variance). In this direction in [2] we captured the denoising ability of PCA via a novel metric called compression ratio. We designed an outlier detection algorithm that improves the separability of the underlying communities in the data. We proved the effectiveness of our algorithm in a novel random vector mixture model and verified it extensively on several real-world datasets.Multi-core-periphery with communities (MCPC)
In the aforementioned pipeline, once the data is embedded onto a graph (with a datapoint-vertex correspondence), one applies graph clustering algorithms to recover the underlying communities (cell-type). Here, it is important to note that the clustering algorithm's success depends on the correctness of our assumption about the graph's communities. One of the most popular assumptions is community structure where all vertices from a community have more intra-community edges than inter-community edges (For example, SBM graphs follow this assumption). However, algorithms based on this assumption often have subpar performance on real-world datasets.We applied our algorithms to a large set of real-world single-cell datasets. We observed that the top-ranked vertices contain sufficient vertices from all the underlying (ground truth) communities. Yet, they are better separable by popular graph clustering algorithms than the whole graph.
Currently, we are working on further improving the algorithms and better understanding the presence of MCPC structures in real-world graphs.
Publications
Chandra Sekhar Mukherjee and Jiapeng Zhang, Balanced Ranking with Relative Centrality: A multi-core periphery perspective. ICLR 2025
Chandra Sekhar Mukherjee, Nikhil Deorkar, and Jiapeng Zhang, Capturing the Denoising Effect of PCA via Compression Ratio. NeurIPS 2024.
Chandra Sekhar Mukherjee and Jiapeng Zhang, Detecting Hidden Communities by Power Iterations with Connections to Vanilla Spectral Algorithms. SODA 2024.
Chandra Sekhar Mukherjee, Pan Peng, and Jiapeng Zhang, Recovering Unbalanced Communities in the Stochastic Block Model With Application to Clustering with a Faulty Oracle. Neurips 2023.
Some pre-PhD works
Previously, I worked under the guidance of Prof. Subhamoy Maitra during my time at ISI Kolkata, working on quantum computation and information.
Subhamoy Maitra, Chandra Sekhar Mukherjee, Pantelimon Stanica, and Deng Tang, On Boolean Functions with Low Polynomial Degree and Higher Order Sensitivity, 2021.
Suman Dutta, Subhamoy Maitra, and Chandra Sekhar Mukherjee, Following Forrelation – Quantum Algorithms in Exploring Boolean Functions’ Spectra, Advances in Mathematics of Communications. 2021.
Ajeet Kumar, Subhamoy Maitra, and Chandra Sekhar Mukherjee, On approximate real mutually unbiased bases in square dimension, Cryptography and Communications, 2021.
Chandra Sekhar Mukherjee, Subhamoy Maitra, Vineet Gaurav, and Dibyendu Roy, Preparing Dicke States on a Quantum Computer, IEEE Transactions on Quantum Engineering. 2020.
Monographs
- Chandra Sekhar Mukherjee, Dibyendu Roy, and Subhamoy Maitra, Design and Cryptanalysis of ZUC–A Stream Cipher in Mobile Telephony, SpringerBriefs on Cyber Security Systems and Networks. 2021.