Seurat: FindAllMarkers pct deviates from manual calculation

Created on 19 May 2019  Â·  15Comments  Â·  Source: satijalab/seurat

Dear Seurat Team,

Seurat is a great package and I am currently using it very frequently for my research project. It's great to have such tool. Recently, I stumbled over something:

I think I know what pct.1 and pct.2 means when you run FindAllMarkers (pct.1 is the fraction of expressing cells for a given gene (nUMI > 0) within the cluster that is currently being analyzed for DE genes, whereas pct.2 is the same thing for all other cells which are not in that cluster).

If this is wrong you can stop reading from here.
If not:
I have been looking at some UMAP-maps with cells coloured for gene expression and felt that pct.1 and/or pct.2 may not always be correct. So, I calculated those fractions myself and found that they deviated from the output of FindAllMarkers.

Now, I could make a dick of myself if I did anything wrong outside of Seurat. I am trying though to provide a reproducible example below. If I made a mistake I am more than happy learn from it and to just have that issue solved for me.

Thanks.

library(tidyverse)
library(Seurat)

# import a Seurat object (v3) with scRNA UMI counts where the standard workflow has been conducted including cluster annotation with FindNeighbors and FindClusters
example.SO <- readRDS()

# fetch cell barcodes and their designation to clusters (I hope "seurat_clusters" works for you)
example.SO.data <- 
  FetchData(example.SO, c("seurat_clusters")) %>%
  rownames_to_column("cell.barcode")

# run Seurats function to find differentially expressed genes between clusters
example.SO.marker <- FindAllMarkers(example.SO, logfc.threshold = 0.2, only.pos = FALSE, min.pct = 0.0001, return.thresh = 0.05)

# now manually:
# get the raw count data from the seurat object
example.SO.counts <- GetAssayData(example.SO, slot = "counts", assay = "RNA")

# caculate pct.1 (the fraction of expressing cells for given gene in one cluster) manually from raw data with a dplyr pipe; I also did a little different yielding the same results'
example.SO.counts.summary <-
  as.data.frame(t(as.matrix(example.SO.counts[which(rownames(example.SO.counts) %in% example.SO.marker$gene[1:10]),]))) %>% # just grab the first 10 genes (arbitrary)
  rownames_to_column("cell.barcode") %>%
  left_join(example.SO.data, by = "cell.barcode") %>%
  gather(key = "gene", value = "nUMI", c(-cell.barcode, -seurat_clusters)) %>%
  group_by(seurat_clusters, gene) %>%
  summarise(non.expressing.cells = sum(nUMI == 0), expressing.cells = sum(nUMI > 0)) %>%
  ungroup() %>%
  dplyr::mutate(total.cells = non.expressing.cells + expressing.cells) %>%
  dplyr::mutate(frac.of.expr.cells.per.cluster = expressing.cells/total.cells)

# compare pct.1 from FindAllMarkers and frac.of.expr.cells.per.cluster from manual calculation - why are some so different whereas other seem to be correct?

comparison <- 
  example.SO.marker %>% 
  left_join(example.SO.counts.summary, by = c("cluster" = "seurat_clusters", "gene" = "gene")) %>% 
  drop_na()

Most helpful comment

This makes sense, and verifies that there is not a bug. Integrated data can be positive or negative, so its hard to interpret fractions of expressing cells in this context.

Note that we do not recommend performing DE tests on integrated data, which would avoid this issue. You can explore https://satijalab.org/seurat/v3.0/immune_alignment.html for an example of how to perform DE analysis after integration (both across cell types, and within cell-types but across conditions).

All 15 comments

Hi,

Can you check that the current idents of example.SO are set to "seurat_clusters"? To be sure you can add this line before running FindAllMarkers.

Idents(example.SO) <- "seurat_clusters"

Hello,
yes, I added the line as you suggested. The problem persists. "active.ident" equals "seurat_clusters".

(https://www.dropbox.com/s/bdfomowv1orbf5d/Screenshot%202019-05-24%2022.17.35.png?dl=0)

Bro, this is really interesting! Is it possible that some gene names are duplicated in your dataset, maybe that's why this is happening?

So, the 10 arbitrary genes which I pull out are cleary unique.
If add the line
length(rownames(example.SO.counts)) == length(unique(rownames(example.SO.counts)))
I get TRUE as return. So, I do not have duplicate gene names. Is it that what you meant?

I will provide two examples. You can see the big UMAP-map with annotated clusters. In a small window I coloured cells into expressers (nUMI > 0, red) and non-expressers (nUMI == 0, blue). Also, I inserted the output of FindAllMarkers (please, don't be bothered by extra lines or columns which stem from my analysis-loop; okay you may argue that I made a mistake somewhere but lets pretend I did not).

(i) CCL20 - https://www.dropbox.com/s/zt03wf1s2ixkjf9/pct.problem.png?dl=0
pct.1 in Cluster 3 may be correct (hard to judge by naked eye) but pct.2 which should refer to all other cells, not being in Cluster 3 is 80 % ?! I think this is either not correct or I do not interpret the pct values correctly.

(ii) GNLY - https://www.dropbox.com/s/i19e3nmv0oqo4qx/pct.problem.2.png?dl=0
Let's look at pct.1 for Cluster 1: 73 % of cells are expressing GNLY in this cluster? There is clearly a deviation between what I see on the map and the output of FindAllMarkers.

It is still possible that I made a mistake at some point, okay. I just wanted to add some support for my claims.

Yours, Chris.

Sorry, one more thing. If these deviating pct indicate that the "wrong" groups of cells are picked to calculate logFC values then those logFCs may also be incorrect. That's another thing I am worried about. As you know, such results are important for one's research project and for future decisions.

Damn...that's crazy, I don't know enough to comment but I hope it's not a bug. Let's see what the developers say. Can you provide a rdata file so that we can try and replicate your observations?

Hi Chris,

I'm not able to replicate the issue with the code you provided and our example dataset. E.g.

library(tidyverse)
library(Seurat)

# import a Seurat object (v3) with scRNA UMI counts where the standard workflow has been conducted including cluster annotation with FindNeighbors and FindClusters
data(pbmc_small)
example.SO <- pbmc_small
example.SO$seurat_clusters <- example.SO$RNA_snn_res.1

# fetch cell barcodes and their designation to clusters (I hope "seurat_clusters" works for you)
example.SO.data <- 
  FetchData(example.SO, c("seurat_clusters")) %>%
  rownames_to_column("cell.barcode")

# run Seurats function to find differentially expressed genes between clusters
example.SO.marker <- FindAllMarkers(example.SO, logfc.threshold = 0.2, only.pos = FALSE, min.pct = 0.0001, return.thresh = 0.05)

# now manually:
# get the raw count data from the seurat object
example.SO.counts <- GetAssayData(example.SO, slot = "counts", assay = "RNA")

# caculate pct.1 (the fraction of expressing cells for given gene in one cluster) manually from raw data with a dplyr pipe; I also did a little different yielding the same results'
example.SO.counts.summary <-
  as.data.frame(t(as.matrix(example.SO.counts[which(rownames(example.SO.counts) %in% example.SO.marker$gene),]))) %>% 
  rownames_to_column("cell.barcode") %>%
  left_join(example.SO.data, by = "cell.barcode") %>%
  gather(key = "gene", value = "nUMI", c(-cell.barcode, -seurat_clusters)) %>%
  group_by(seurat_clusters, gene) %>%
  summarise(non.expressing.cells = sum(nUMI == 0), expressing.cells = sum(nUMI > 0)) %>%
  ungroup() %>%
  dplyr::mutate(total.cells = non.expressing.cells + expressing.cells) %>%
  dplyr::mutate(frac.of.expr.cells.per.cluster = expressing.cells/total.cells)

# compare pct.1 from FindAllMarkers and frac.of.expr.cells.per.cluster from manual calculation - why are some so different whereas other seem to be correct?

comparison <- 
  example.SO.marker %>% 
  left_join(example.SO.counts.summary, by = c("cluster" = "seurat_clusters", "gene" = "gene")) %>% 
  drop_na()

all(comparison$pct.1 == round(comparison$frac.of.expr.cells.per.cluster, 3))
> TRUE

I also tested the same code but substituting pbmc_small for the 3K pbmc object from the 3k tutorial and still found them to agree entirely. Can you provide a dataset where they disagree?

Yes, please allow a few more days for me to prepare a rds file of a Seurat object for you which I am okay with to share with the public.

tl;dr:
Using a seurat object with integrated data and setting default.assay to "integrated" may be a reason for what I described above.

I tried two things:
(i) I run my code (the one above) with a Seurat object that contained non-integrated data (e.g. only one sample was incorporated in that object). Result: pct.1 and frac.of.expr.cells.per.cluster are equal.
(ii) I run the code with a Seurat object that contains integrated data. Result: pct.1 and frac.of.expr.cells.per.cluster are unequal.

Could you check if you find the same thing as I when using an object with integrated data.
I still feel a bit uncomfortable uploading data from our research even though it would be hard for anyone to get something out of it. So, if can move forward on that issue with what I suggested, it may be possible to avoid that I am upload anything.

Thanks.

This makes sense, and verifies that there is not a bug. Integrated data can be positive or negative, so its hard to interpret fractions of expressing cells in this context.

Note that we do not recommend performing DE tests on integrated data, which would avoid this issue. You can explore https://satijalab.org/seurat/v3.0/immune_alignment.html for an example of how to perform DE analysis after integration (both across cell types, and within cell-types but across conditions).

Thank you for solving this issue and I appreciate all of your tutorials and vignettes by the way. I want to point out two things though from a users point of view which I find counter intuitive:
(i) Yes, I also noticed that integrated data can be positive or negative but that refers to a data slot where values have been offset (or similar, sorry if I miss the proper English term) across cells and/or samples in order to make them comparable. But it is unexpected that this turns non-expressers into expressers or vice-versa. Or it is counter intuitive that FindMarkers-function uses this data slot if it contains quite processed values. Couldn't this be avoided somehow? Because ...
(ii) ... this means, within the Seurat ecosystem there are functions that incompatible with each other and wrong or misleading outputs are generated. Even though I spent many many hours working through tutorials and so on I was not aware of that issue. Still, you could put it on my ignorance or that I did not study harder but some kind of warning message somewhere or so would have helped me in this case.

Maybe, or probably, I do not understand all of the developer's concepts but as I said I am explaining from a user's point of view.

I got the same ERROR when I caculated pct1 for every gene. My code as followed:

cluster<-unique(Idents(immune.combined))
data<-as.matrix(immune.combined@assays$RNA[1:dim(immune.combined@assays$RNA)[1],1:dim(immune.combined@assays$RNA)[2]])
#data<-data[c('Cd3d','Cd3e','Cd3g','Cd4','Cd8a','Cd8b1'),]
#yao<-data.frame()
yao<-data.frame(Gene=rownames(data))
for (i in sort(cluster)){
#   a<-data[c('tomm20a','tomm20a.1','si:ch73-1a9.3'),names(Idents(immune.combined)[Idents(immune.combined)==i])]
    a<-data[,names(Idents(immune.combined)[Idents(immune.combined)==i])]
    b<-as.matrix(apply(a,1,function(x){round(length(x[x!=0])/ncol(a),digit = 3)}))
    yao<-cbind(yao,b[,1])
}
colnames(yao)<-c('Gene',sort(cluster))

And my normalize method is sctransform.

> packageVersion("Seurat")
[1] ‘3.0.0.9100’

As commented by satijalab the critical thing is to use the integrated assay with the FindAllMarkers function. Did you set up an integrated assay? If yes and it has been used for the FindAllMarkers function, then the pct values which are output by that function are wrong. (At least they were wrong with the Seurat version that I used at that time.) Given your code is correct your own values may reflect the true pct values.
I also wrote some code myself to circumvent this problem.

As commented by satijalab the critical thing is to use the integrated assay with the FindAllMarkers function. Did you set up an integrated assay? If yes and it has been used for the FindAllMarkers function, then the pct values which are output by that function are wrong. (At least they were wrong with the Seurat version that I used at that time.) Given your code is correct your own values may reflect the true pct values.
I also wrote some code myself to circumvent this problem.

Thank for you reply, It just only one sample and did not set up an integrated assay .What is more, I used source code of seurat function, and get the same result of my code,

i<-26
cells.1<-names(Idents(immune.combined)[Idents(immune.combined)==i])
data2 <- switch(EXPR = "data",'scale.data' = counts,pbmc2_1.2)
pct.1 <- round(x = rowSums(x = data['tomm20a', cells.1, drop = FALSE] > 0)/length(x = cells.1),digits = 3)
#tomm20a 
 #0.682 

I will circumvent this problem by my own code.

Was this page helpful?
0 / 5 - 0 ratings