單細胞數據高級分析之初步降維和聚類 | Dimensionality reduction | Clustering


個人的一些碎碎念:

聚類,直覺就能想到kmeans聚類,另外還有一個hierarchical clustering,但是單細胞里面都用得不多,為什么?印象中只有一個scoring model是用kmean進行粗聚類。(10x就是先做PCA,再用kmeans聚類的)

鑒於單細胞的教程很多,也有不下於10種針對單細胞的聚類方法了。

降維往往是和聚類在一起的,所以似乎有點難以區分。

PCA到底是降維、聚類還是可視化的方法,t-SNE呢?

其實稍微思考一下,PCA、t-SNE還有下面的diffusionMap,都是一種降維方法。區別就在於PCA是完全的線性變換得到PC,t-SNE和diffusionMap都是非線性的。

為什么降維?因為我們特征太多了,基因都是萬級的,降維之后才能用kmeans啥的。其次就是,降維了才能可視化!我們可視化的最高維度就是三維,幾萬維是無法可視化的。但paper里,我們最多選前兩維,三維在平面上的效果還不如二維。

聚類策略:

聚類還要什么策略?不就是選好特征之后,再選一個k就得到聚類的結果了嗎?是的,常規分析確實沒有什么高深的東西。

但通常我們不是為了聚類而聚類,我們的結果是為生物學問題而服務的,如果從任何角度都無法解釋你的聚類結果,那你還聚什么類,總不可能在paper里就寫我們聚類了,得到了一些marker,然后就沒了下文把!

什么問題?

什么叫針對問題的聚類呢?下面這篇文章就是針對具體問題的聚類。先知:我們知道我們細胞里有些污染的細胞,如何通過聚類將他們識別出來?

這種具體的問題就沒法通過跑常規流程來解決了,得想辦法!

 

Dimensionality reduction.

Throughout the manuscript we use diffusion maps, a non-linear dimensionality reduction technique37. We calculate a cell-to-cell distance matrix using 1 - Pearson correlation and use the diffuse function of the diffusionMap R package with default parameters to obtain the first 50 DMCs. To determine the significant DMCs, we look at the reduction of eigenvalues associated with DMCs. We determine all dimensions with an eigenvalue of at least 4% relative to the sum of the first 50 eigenvalues as significant, and scale all dimensions to have mean 0 and standard deviation of 1.

有點超前(另類),用diffusionMap來降維,計算了細胞-細胞的距離,得到50個DMC,鑒定出顯著的DMC,scale一下。

Initial clustering of all cells.

To identify contaminating cell populations and assess  overall heterogeneity in the data, we clustered all single cells. We first combined all Drop-seq samples and normalized the data (21,566 cells, 10,791 protein-coding genes detected in at least 3 cells and mean UMI at least 0.005) using regularized negative binomial regression as outlined above (correcting for sequencing depth related factors and cell cycle). We identified 731 highly variable genes; that is, genes for which the z-scored standard deviation was at least 1. We used the variable genes to perform dimensionality reduction using diffusion maps as outlined above (with relative eigenvalue cutoff of 2%), which returned 10 significant dimensions.

For clustering we used a modularity optimization algorithm that finds community structure in the data with Jaccard similarities (neighbourhood size 9, Euclidean distance in diffusion map coordinates) as edge weights between cells38. With the goal of over-clustering the data to identify rare populations, the small neighbourhood size resulted in 15 clusters, of which two were clearly separated from the rest and expressed marker genes expected from contaminating cells (Neurod6 from excitatory neurons, Igfbp7 from epithelial cells). These cells represent rare cellular contaminants in the original sample (2.6% and 1%), and were excluded from further analysis, leaving 20,788 cells.

鑒定出了highly variable genes,還是為了降噪(其實特征選擇可以很自由,只是好雜志偏愛這種策略,你要是純手動選,人家就不樂意了)。

Jaccard相似度,來聚類。

要想鑒定rare populations,就必須over-clustering!!!居然將k設置為15,然后通過marker來篩選出contaminating cells。

 

確實從中學習了很多,自己手寫代碼就是不一樣,比純粹的跑軟件要強很多。

# cluster cells and remove contaminating populations
cat('Doing initial clustering\n')
cl <- cluster.the.data.simple(cm, expr, 9, seed=42)
md$init.cluster <- cl$clustering
# detection rate per cluster for some marker genes
goi <- c('Igfbp7', 'Col4a1', 'Neurod2', 'Neurod6')
det.rates <- apply(cm[goi, ] > 0, 1, function(x) aggregate(x, by=list(cl$clustering), FUN=mean)$x)
contam.clusters <- sort(unique(cl$clustering))[apply(det.rates > 1/3, 1, any)]
use.cells <- !(cl$clustering %in% contam.clusters)
cat('Of the', ncol(cm), 'cells', sum(!use.cells), 'are determined to be part of a contaminating cell population.\n')
cm <- cm[, use.cells]
expr <- expr[, use.cells]
md <- md[use.cells, ]

  

# for clustering
# ev.red.th: relative eigenvalue cutoff of 2%
dim.red <- function(expr, max.dim, ev.red.th, plot.title=NA, do.scale.result=FALSE) {
  cat('Dimensionality reduction via diffusion maps using', nrow(expr), 'genes and', ncol(expr), 'cells\n')
  if (sum(is.na(expr)) > 0) {
    dmat <- 1 - cor(expr, use = 'pairwise.complete.obs')
  } else {
    # distance 0 <=> correlation 1
    dmat <- 1 - cor(expr)
  }
  
  max.dim <- min(max.dim, nrow(dmat)/2)
  dmap <- diffuse(dmat, neigen=max.dim, maxdim=max.dim)
  ev <- dmap$eigenvals
  # relative eigenvalue cutoff of 2%, something like PCA
  ev.red <- ev/sum(ev)
  evdim <- rev(which(ev.red > ev.red.th))[1]
  
  if (is.character(plot.title)) {
    # Eigenvalues, We observe a substantial eigenvalue drop-off after the initial components, demonstrating that the majority of the variance is captured in the first few dimensions.
    plot(ev, ylim=c(0, max(ev)), main = plot.title)
    abline(v=evdim + 0.5, col='blue')
  }
  
  evdim <- max(2, evdim, na.rm=TRUE)
  cat('Using', evdim, 'significant DM coordinates\n')
  
  colnames(dmap$X) <- paste0('DMC', 1:ncol(dmap$X))
  res <- dmap$X[, 1:evdim]
  if (do.scale.result) {
    res <- scale(dmap$X[, 1:evdim])
  } 
  return(res)
}

# jaccard similarity
# rows in 'mat' are cells
jacc.sim <- function(mat, k) {
  # generate a sparse nearest neighbor matrix
  nn.indices <- get.knn(mat, k)$nn.index
  j <- as.numeric(t(nn.indices))
  i <- ((1:length(j))-1) %/% k + 1
  nn.mat <- sparseMatrix(i=i, j=j, x=1)
  rm(nn.indices, i, j)
  # turn nn matrix into SNN matrix and then into Jaccard similarity
  snn <- nn.mat %*% t(nn.mat)
  snn.summary <- summary(snn)
  snn <- sparseMatrix(i=snn.summary$i, j=snn.summary$j, x=snn.summary$x/(2*k-snn.summary$x))
  rm(snn.summary)
  return(snn)
}


cluster.the.data.simple <- function(cm, expr, k, sel.g=NA, min.mean=0.001, 
                                    min.cells=3, z.th=1, ev.red.th=0.02, seed=NULL, 
                                    max.dim=50) {
  if (all(is.na(sel.g))) {
    # no genes specified, use most variable genes
    # filter min.cells and min.mean
    # cm only for filtering
    goi <- rownames(expr)[apply(cm[rownames(expr), ]>0, 1, sum) >= min.cells & apply(cm[rownames(expr), ], 1, mean) >= min.mean]
    # gene sum
    sspr <- apply(expr[goi, ]^2, 1, sum)
    # scale the expression of all genes, only select the gene above z.th
    # need to plot the hist
    sel.g <- goi[scale(sqrt(sspr)) > z.th]
  }
  cat(sprintf('Selected %d variable genes\n', length(sel.g)))
  sel.g <- intersect(sel.g, rownames(expr))
  cat(sprintf('%d of these are in expression matrix.\n', length(sel.g)))
  
  if (is.numeric(seed)) {
    set.seed(seed)
  }
  
  dm <- dim.red(expr[sel.g, ], max.dim, ev.red.th, do.scale.result = TRUE)
  
  sim.mat <- jacc.sim(dm, k)
  
  gr <- graph_from_adjacency_matrix(sim.mat, mode='undirected', weighted=TRUE, diag=FALSE)
  cl <- as.numeric(membership(cluster_louvain(gr)))
  
  results <- list()
  results$dm <- dm
  results$clustering <- cl
  results$sel.g <- sel.g
  results$sim.mat <- sim.mat
  results$gr <- gr
  cat('Clustering table\n')
  print(table(results$clustering))
  return(results)
}

  

 

 

  


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM