Chapter 5 Clustering

5.1 Hierarchical Spectral Clustering

To identify cellular subpopulations, CellTrails performs hierarchical clustering via minimization of a square error criterion (and 1963) in the lower-dimensional space. To determine the number of clusters, CellTrails conducts an unsupervised post-hoc analysis. Here, it is assumed that differential expression of assayed features determines distinct cellular stages. Hierarchical clustering in the latent space generates a cluster dendrogram. CellTrails makes use of this information and identifies the maximal fragmentation of the data space, i.e. the lowest cutting height in the clustering dendrogram that ensures that the resulting clusters contain at least a certain fraction of samples. Then, processing from this height towards the root, CellTrails iteratively joins siblings if they do not have at least a certain number of differentially expressed features. Statistical significance is tested by means of a two-sample non-parametric linear rank test accounting for censored values (R. Peto and Peto 1972). The null hypothesis is rejected using the Benjamini-Hochberg (Benjamini and Hochberg 1995) procedure for a given significance level. The number of clusters can impact the outcome of the trajectory reconstruction and therefore, this step might require some parameter tuning depending on the input data (for more information on the parameters call ?findStates).

cl <- findStates(exBundle, min_size=0.01, min_feat=5, max_pval=1e-4, min_fc=2)
## Initialized 25 clusters with a minimum size of 10 samples each.
## Performing post-hoc test ...
## Found 11 states.
head(cl)
## [1] S7  S1  S4  S11 S9  S8 
## Levels: S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11

The clusters identified by CellTrails are referred to as states along the trajectory. The function states can be used to set the clusters to the SingleCellExperiment object.

# Set clusters
states(exBundle) <- cl

State assignments are stored as sample metainformation and can be either recieved via colData or states. Since CellTrails operates on a SingleCellExperiment object, its results can be easily used by other packages. For example, visualizing a principal component analysis with scater (McCarthy et al. 2017):

## Not run: 
##library(scater)
## End(Not run)

# Plot scater PCA with CellTrails cluster information
scater::plotPCA(exBundle, colour_by="CellTrails.state")

Please note that the (Bioconductor) package scater is not part of CellTrails and may be needed to be installed first.

5.2 Using Alternative Methods

Technically, the function states<- allows to set any clustering result to a SingleCellExperiment object. Any numeric, character or factor vector containing the cluster assignments for each sample is accepted.

5.3 Visualization

As before, we can visualize the approximated lower-dimensional manifold and colorize each sample by its assigned state.

# States are now listed as phenotype
phenoNames(exBundle)
## [1] "fm143"  "origin" "state"
# Show manifold
plotManifold(exBundle, color_by="phenoName", name="state")

The function plotStateSize generates a barplot showing the absolute sizes of each state.

plotStateSize(exBundle)

Further, violin plots can be produced showing the expression distribution of a feature per state. Each point displays the feature’s expression value in a single sample. A violine represents a vertically mirrored density plot on each side.

plotStateExpression(exBundle, feature_name="CALB2")

References

and, JH Ward. 1963. “Hierarchical Grouping to Optimize an Objective Function.” Journal of the American Statistical Association 58: 236–44.

Peto, R, and J Peto. 1972. “Asymptotically Efficient Bank Invariant Test Procedures (with Discussion).” Journal of the Royal Statistical Society Series A 135: 185–206.

Benjamini, Y, and Y. Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society Series B 57: 289–300.

McCarthy, DJ, KR Campbell, ATL Lun, and QF Wills. 2017. “Scater: Pre-Processing, Quality Control, Normalisation and Visualisation of Single-Cell Rna-Seq Data in R.” Bioinformatics 14: 1179–86.