Dimensionality reduction with applications to single cell RNA sequencing

Per Unneberg

02 October, 2017

Outline

Statistics in the big data era

Dimensionality reduction

Applications to single-cell rna sequencing

Statistics in the big data era

Background on classical statistics (1930’s)

Tables are long and lean

Data today

Tables are wide - e.g. scRNAseq with few observations (cells) and many variables (genes)

Problems:

  • many variables
  • few observations
  • noisy data
  • missing data
  • multiple responses

Implications:

  • high degree of correlation
  • difficult to analyse with conventional (classical) methods

Univariate analyses of multivariate data

Prostate cancer data set
Prostate cancer data set

Data from (Hastie and Tibshirani 2009)

Terminology

Data

  • input (dependent variables): e.g. height and weight of a person, denoted by \(\mathbf{x}\); also called features, covariates, or attributes
  • output (response variable): typically represented by \(y\)

Data types

Output \(y\) is typically one of

  • categorical/nominal: variable takes values in finite set \(\{1; @\dots; @C\}\) of \(C\) classes
  • real-valued/continuous: continuous variables
  • ordinal: categorical variable that is ordered, e.g. A-F grades

Input \(x\) can be any of the above, or more complex data structures, such as images

(Murphy 2012)

Data analysis objectives

  • exploratory data analysis: visualize data for quality control, discoving trends and identifying outliers among other things
  • classification: given an input \(\mathbf{x}\), assign it to one of \(C\) classes in the categorical output \(y\)
  • regression: fit input \(\mathbf{x}\) to continuous response \(y\); model building and prediction
  • clustering: group input \(\mathbf{x}\) according to “similarity”
  • discover latent factors: dimensionality reduction of \(\mathbf{x}\) to discover (unobserved) latent factors that describe most of the variability in the data; typically when features \(p >> n\) inputs

classification and regression are examples of supervised learning, where data consists of input-output pairs; clustering and dimensionality reduction examples of unsupervised learning where data consists of inputs only

Supervised learning

Example - classification of flowers

Iris data set

Iris data set

Supervised learning

Example - linear regression (data from (Murphy 2012))
lm(y ~ x) 
Linear regression
Linear regression

Linear regression minimizes the squared distance from the points to the regression line.

Unsupervised learning

Example - clustering (data from (Murphy 2012))
Height and weight

Height and weight

Goals

  1. estimate distribution over number of clusters \(K\)
  2. determine which cluster a point belongs to

Harder problem - we don’t know the answer beforehand. How to choose \(K\) wisely? Model selection.

Unsupervised learning

Example - dimensionality reduction with pca (data from (Murphy 2012)))

Here project 3-dimensional point cloud to 2-dimensional subspace, which approximates data reasonably well. The motivation behind this technique is that although the data may appear high dimensional, there may only be a small number of degrees of variability, corresponding to latent factors.

Question: how do we interpret the so-called principal components?

Dimensionality reduction

PCA

Developed by Pearson (1901), finally formalized by Hotelling (1931)

What is PCA?

The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set

(Principal Component Analysis I.T. Jolliffe Springer 2002)

In other words, if we have a data table with \(n\) samples and \(p\) variables (typically \(p >> n\) in scRNASeq), we want to project the data to some (lower) dimension \(k \leq p\) while retaining as much as possible of the variation

Geometric interpretation

PCA finds principal components that are a series of (orthogonal) projections of the data. The components are orthogonal and ordered in variance, so that here PC1 is aligned along the direction of highest variance, PC2 along the direction of second-highest variance subject to the condition that it is uncorrelated to PC1. PCA essentially rotates the set of points around their mean in order to align with the principal components. The principal components span a subspace and intersect at the centroid (green point).

What is an orthogonal projection?

Look at 1-dimensional PCA (project 2D to 1D)

Here, the 1-dimensional subspace is a line. The blue points are orthogonal projections of the data points onto the line. The projections minimize the squared lengths of the dashed lines, subject to the condition that the variance along the line is maximized.

Orthogonal projects and maximal variance

An animation says more than a million words…

Incidentally, I haven’t said anything about where the centroid is located - here it is at the origin. This can (and should) always be achieved by mean-centering the data prior to analysis. The R function prcomp does this behind the scenes.

PCA and singular value decomposition

Denote the data by the \(n \times p\) matrix \(\mathbf{X}\). There exists a factorization called singular value decomposition of the form

\[\begin{equation*} \mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V^T} \end{equation*}\]

where

  • \(\mathbf{U}\): \(n \times p\) orthogonal matrix whose columns \(\mathbf{u_j}\) are called left singular vectors
  • \(\mathbf{V}\): \(p \times p\) orthogonal matrix whose columns \(\mathbf{v_j}\) are called right singular vectors
  • \(\mathbf{D}\): \(p\times p\) diagonal matrix with diagonal elements \(d_1 \geq d_2 \geq \cdots \geq d_p \geq 0\).

The columns of \(\mathbf{UD}\) are score vectors of the principal components, columns of \(\mathbf{V}\) loadings. The first principal component can also be written \(\mathbf{z_1}=\mathbf{X}v_1=\mathbf{U}\mathbf{D}\mathbf{V^T}v_1=\mathbf{u_1}d_1\) where \(\mathsf{Var(}{\mathbf{z_1}}\mathsf{)}=\frac{d^2_1}{n}\) and so on.

Projection points

Back to our PCA plot.

For a given point \(x_i\), the projection point on PC1 is given by \(u_{i1}d_1v_1\). Here, \(x_i\) has a higher PC1 score than \(x_j\).

Interpretation of PCs - loadings

One goal of dimensionality reduction via PCA is to capture the “essence” of the data, revealing underlying latent factors that explain most of the variability, here the PCs. Now one important question is: how do we interpret the PCs? What do they “represent”?

One approach is to use loadings which represent the covariance of the original variables with the (unit scaled) principal components. A biplot plots the scores together with the loadings of the original variables. They show how much a component loads an original variable.

Interpretation of PCs - overlaying metadata

Height and weight

Height and weight

Coloring by metadata also facilitates interpretation.

Standardization

Since PCA aligns PCs along directions of maximum variance, it is sensitive to differences in scale, e.g. caused by different measurement units. Therefore it is standard practice to scale variables to unit variance when running PCA.

PCA as data compression algorithm and the PCA model

\(\mathbf{X}\) can be approximated by using only the first \(q\) principal components, i.e.:

\[ X \approx U D_q V_q^T \]

In essence, PCA can be seen as a compression algorithm in which we throw away the smallest terms (smallest variances) in the diagonal matrix and approximate \(\mathbf{X}\). Any given PC \(j\) explains a proportion \(\pi_j\) of the variance which is defined as \(\pi_j = d_j/\sum_{i=1}^pd_i\).

By introducing an error term, the PCA model can be written as

\[ \mathbf{X} = U D_q V_q^T + E \]

What PCA is revisited

Unfortunately […] there is a lot of confusion about what PCA is really about

(Pachter 2014)

PCA is

  • not a statistical method according to one textbook
  • a statistical method according to another
  • more useful as a visualization technique then as an analytical method
  • Principal components is a mathematical algorithm

Still, less important than knowing what PCA is or how it is implemented is knowing how and when to use it. Remember to think what you’re doing when using PCA.

Other dimensionality reduction techniques

  • principal components analysis
  • multidimensional scaling
  • partial least squares to latent structure analysis (PLS)
  • orthogonal partial least squares to latent structures analysis (OPLS)
  • PLS-DA (discriminant analysis)
  • OPLS-DA
  • biplot analysis
  • k-means clustering
  • hierarchical clustering
  • self-organizing maps
  • t-SNE

t-distributed stochastic neigbor embedding (t-SNE)

In contrast to PCA, t-SNE (Maaten and Hinton 2008) is a non-linear dimensionality reduction technique

It models high-dimensional points in such a way that points close in high-dimensional space end up close low-dimensional space

The algorithm is non-linear and adapts to the underlying data density, using different transformations for different regions

The perplexity input parameter approximates the number of neigbors a point has. Typical value are between 5 and 50.

In addition, the algorithm is stochastic, meaning successive runs with identical parameters may produce different results

The perplex perplexity effect

Adapted from (Wattenberg, Viégas, and Johnson 2016)

Too small perplexity and local variations dominate; too large (more than number of points) yields unexpected behaviour.

Choosing the number of iterations

Adapted from (Wattenberg, Viégas, and Johnson 2016)

Run to completion. “Pinched” shapes may indicate algorithm finished too soon.

Cluster size means nothing

Adapted from (Wattenberg, Viégas, and Johnson 2016)

You cannot see relative cluster sizes in a t-SNE plot.

Inbetween cluster distance might not mean anything

Adapted from (Wattenberg, Viégas, and Johnson 2016)

At perplexity=50, global structure is retained, but if we increase input cluster sizes, we must modify perplexity accordingly. Seeking global geometry requires fine-tuning perplexity.

Applications to single-cell rna sequencing

Example data set

The following examples are based on a study by (Zeisel et al. 2015). The data can be obtained from (Amit Zeisel and Linnarsson 2015). The analyses have been performed in R using utility functions from the scater package (“Scater,” n.d.). The analysis is based on the simpleSingleCell bioconductor tutorial (Aaron T. L. Lun and Marioni 2017).

The authors sequenced RNA from \(>\) 3000 single cells to “perform a molecular census of the primary somatosensory cortex (S1) and the hippocampal CA1 region”.

Quality control

PCA can be used for quality control by basing it on the quality metrics for each cell, e.g.:

  • the total number of reads
  • the total number of features (e.g. expressed genes)
  • proportion of mitochondrial or spike-in reads
  • batch / lane

or by using (filtered) expression values to get an overview of clustering patterns.

Outliers may be indicative of low-quality cells. Use quality metrics to aid interpretation of principal components.

Quality control

# data has been loaded and preprocessed and stored in a SCESet object
sce <- calculateQCMetrics(sce, feature_controls=list(Spike=is.spike, Mt=is.mito))  
plotPCA(sce, pca_data_input="pdata") 

Interpretation? No clear outlier or cluster. Add metadata information.

Quality control

Add metadata information (tissue and total mRNA mol) to facilitate interpretation.

plotPCA(sce, colour_by="tissue", size_by="total mRNA mol",
        pca_data_input="pdata", legend="none") 

First PC seems to separate by number of mRNA molecules in cell.

Quality control

Size by number of expressed genes (calculated by CalculateQCMetrics)

plotPCA(sce, colour_by="tissue", size_by="total_features",
        pca_data_input="pdata", legend="none") 

Even clearer trend - cells with few features at the right part of the plot.

Quality control

Look at higher principal components.

plotPCA(sce, colour_by="tissue", size_by="total_features",
        pca_data_input="pdata", legend="none", ncomponents=3) 

Quality control

PCA done on expression values

plotPCA(sce, colour_by=as.data.frame(pData(sce)[,"tissue"]),
        size_by=data.frame(rep(2, dim(sce)[2])), legend="none", ncomponents=2) 

Some separation by tissue, but more clusters are clearly present.

Quality control

Size by library size and number of genes (calculated by calculateQCMetrics)

plotPCA(sce, colour_by=as.data.frame(pData(sce)[,"tissue"]),
        size_by=as.data.frame(pData(sce)[,"total_counts"]/1e6),
        legend="none", ncomponents=2) 

Data processing and filtering

There are many more quality control steps to perform. Here, we have only discussed the use of PCA.

Before proceeding with data exploration with t-SNE/PCA and clustering we need to process and filter the data via a number of steps:

  • outlier removal: remove outliers based on distribution of QC metrics, such as total number of features.
  • remove uninteresting genes: remove genes with low average expression
  • normalization: account for library size
  • check for technical factors: block for uninteresting factors, e.g. sex
  • identify correlated HVGs: highly variable genes
  • remove batch effects: remove effects accounted for by blocking factor

Data exploration of correlated highly variable genes

tsne1 <- plotTSNE(sce, exprs_values="norm_exprs", colour_by=top.hvg,
    perplexity=10, rand_seed=100, feature_set=chosen) + fontsize
tsne2 <- plotTSNE(sce, exprs_values="norm_exprs", colour_by="Mog",
    perplexity=10, rand_seed=100, feature_set=chosen) + fontsize
multiplot(tsne1, tsne2, cols=2) 

Dimensionality reduction with t-SNE is applied to a subset of highly varible genes (HGVs) which have been obtained through multiple data processing and filtering steps.

Clustering cells into putative subpopulations

Perform hierarchical clustering on Euclidean distances between cells and use a dynamic tree cut to define clusters. The block-like pattern consistent with dimensionality reduction shown previosly.

Detecting marker genes between subpopulations

The goal: identify marker genes for each subpopulation. Many markers are not uniquely up- or down-regulated in a cluster; rather a combination of markers is used to characterize a subpopulation

Summary

Dimensionality reduction techniques, with a focus on PCA (and t-SNE)

Using PCA:

  • find latent structures
  • loadings, marker data to aid interpretation of components
  • center data!
  • scaling matters

t-SNE:

  • perplexity
  • distances and cluster sizes don’t necessarily mean anything

References

Aaron T. L. Lun, Davis J. McCarthy, and John C. Marioni. 2017. https://www.bioconductor.org/help/workflows/simpleSingleCell/.

Amit Zeisel, Peter Lönnerberg, Ana B. Muñoz Manchado, and Sten Linnarsson. 2015. http://linnarssonlab.org/cortex/.

Hastie, Trevor, and Rob Tibshirani. 2009. https://web.stanford.edu/~hastie/ElemStatLearn/datasets.

Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing Data Using T-SNE.” Journal of Machine Learning Research 9 (Nov): 2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series). Hardcover; The MIT Press. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20\&path=ASIN/0262018020.

Principal Component Analysis I.T. Jolliffe Springer. 2002. http://www.springer.com/la/book/9780387954424.

“Scater.” n.d. Bioconductor. http://bioconductor.org/packages/scater/.

Wattenberg, Martin, Fernanda Viégas, and Ian Johnson. 2016. “How to Use T-SNE Effectively.” Distill 1 (10): e2. doi:10.23915/distill.00002.

Zeisel, Amit, Ana B. Muñoz Manchado, Simone Codeluppi, Peter Lönnerberg, Gioele La Manno, Anna Juréus, Sueli Marques, et al. 2015. “Cell Types in the Mouse Cortex and Hippocampus Revealed by Single-Cell RNA-Seq.” Science, February, aaa1934. doi:10.1126/science.aaa1934.