Seurat: Is there a way to regress out ribosomal genes from the dimensional reduction?

Created on 7 Sep 2018 · 14Comments · Source: satijalab/seurat

Hi,

I read from the Seurat webpage about a vignette to remove the cell cycle-related genes from dimensional reduction. It was really helpful and then it showed some promising result on my data. However, I dug around to see if there is a way to regress out the ribosomal genes from the dimensional reduction and haven't seen an approach to do this. Is there a way or function that support similar processing as the cell cycle genes for ribosomal genes? Given that I have extracted some of the ribosomal genes already.

Again the goal is to exclude both cell-cycle genes and ribosomal genes in the dimensional reduction.

Thank you so much in advance!

Source

bwang258

Most helpful comment

You can replace the mitochondrial reduction with ribosomal reduction (just change the gene set). However we don't recommend this, as biological differences can be easily blended with technical ones

satijalab on 9 Sep 2018

👍4

All 14 comments

You can replace the mitochondrial reduction with ribosomal reduction (just change the gene set). However we don't recommend this, as biological differences can be easily blended with technical ones

satijalab on 9 Sep 2018

👍4

Hi everyone,

I've tried to regress out ribosomal genes from my datasets (I've tried without regressing them out first but it was too confounding and decided to try get rid of them).
I've used similar lines as for mitochondrial genes:

#mitochondrial genes
mito.genes <- grep(pattern = "^MT-", x = rownames(x = Mgcoculture@data), value = TRUE)
percent.mito <- Matrix::colSums([email protected][mito.genes, ])/Matrix::colSums([email protected])
Mgcoculture <- AddMetaData(object = Mgcoculture, metadata = percent.mito, col.name = "percent.mito")
#look at ribosomal genes
RPS.genes <- grep(pattern = "^RPS", x = rownames(x = Mgcoculture@data), value = TRUE)
percent.RPS <- Matrix::colSums([email protected][RPS.genes, ])/Matrix::colSums([email protected])
Mgcoculture <- AddMetaData(object = Mgcoculture, metadata = percent.RPS, col.name = "percent.RPS")
RPL.genes <- grep(pattern = "^RPL", x = rownames(x = Mgcoculture@data), value = TRUE)
percent.RPL <- Matrix::colSums([email protected][RPL.genes, ])/Matrix::colSums([email protected])
Mgcoculture <- AddMetaData(object = Mgcoculture, metadata = percent.RPL, col.name = "percent.RPL")
#filter cells
Mgcoculture <- FilterCells(object = Mgcoculture, subset.names = c("nGene", "percent.mito"),
low.thresholds = c(200, -Inf), high.thresholds = c(3000, 0.2))
Mgcoculture <- NormalizeData(object = Mgcoculture, normalization.method = "LogNormalize",
scale.factor = 10000)
Mgcoculture <- FindVariableGenes(object = Mgcoculture, mean.function = ExpMean, dispersion.function = LogVMR,
x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5)
Mgcoculture <- ScaleData(object = Mgcoculture, vars.to.regress = c("nUMI", "percent.mito", "percent.RPS", "percent.RPL"))
Then I run my PCA:
Mgcoculture <- RunPCA(object = Mgcoculture, pc.genes = [email protected], do.print = TRUE, pcs.print = 1:12, genes.print = 10)

However, ribosomal genes are still part of the main PCs. I've double-checked and the genes that come up in the PCs are part of the "RPL.genes" and "RPS.genes" list.
I'm wondering whether I am not regressing them out or if maybe I should change the "pc.genes" in the RunPCA function so that PCA is run on the scaled data.
Any ideas?

Thanks for the help!

Cesco

cesco-lemon on 22 Nov 2018

👍2

Hi @cesco-lemon ,

I am not sure if I can help you because I ended up keeping ribosomal genes in dimensional reduction as they turned out to be biologically relevant to one of my samples. However, I did had the same issue as yours initially. The code was roughly the same as yours as well. I ended up seeing ribosomal genes in PC and even in the cluster markers, in which ribosomal genes actually contributed to PC1 and PC2.

My way out was to manually remove these genes out of the var.genes list before PCA. I am not sure if your suggestion would work since pc.genes parameter is reading a gene list and the rownames of scale.data contains all the genes.

Not sure if that helps, we could keep the discussion going though.

bwang258 on 22 Nov 2018

Hey @bwang258,

Yeah I' getting them everywhere, in PCs and in cluster markers, it is really annoying. I think it has mostly to do with technical issues and the samples make-up itself and I'd like to get rid of them because they're not really giving me real biological differences. I can see the biological relevance of my PCs but ribosomal genes are giving me so much noise that I am sure I am losing more important biological differences. I ended up filtering out cells that have too high of a percentage of ribosomal genes to at list get a go at the data and see what it looks like but I need those cells as part of the dataset too.

I thought that by scaling the data I was automatically removing these genes in vars.to.regress right? That's why I was so confused, isn't RunPCA using the scaled data to run the analysis? Otherwise it make no sense running the RegressOut function in ScaleData if nothing gets regressed out...

@bwang258 could you provide the code you used to manually remove the genes from var.genes? That would definitely save my life for now!

@satijalab any idea of why my code isn't working? Because theoretically I am regressing out ribosomal genes by using the same workflow as for mitochondrial genes, right? Does it mean that I am not regressing out mitochondrial genes either? Maybe I am not fully understanding what "regress out" entails but based on your cell cycle gene vignette my code should get rid of the variability coming from the gene lists I've provided.

Thank you so much for the help!

cesco-lemon on 25 Nov 2018

Hi @cesco-lemon ,

Here is the code:

#taking your parameters and ribosomal genes as an example
RPS.genes <- grep(pattern = "^RPS", x = rownames(x = Mgcoculture@data), value = TRUE)
#find variable genes and store them in the var.genes
Mgcoculture <- FindVariableGenes(object = Mgcoculture, mean.function = ExpMean, dispersion.function = LogVMR,x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5)
#remove RPS genes from the var.genes
var.genes.no.RPS <- dpylr::setdiff([email protected], RPS.genes)
#run PCA with the new gene list
Mgcoculture <- RunPCA(object = Mgcoculture, pc.genes = var.genes.no.RPS, do.print = TRUE, pcs.print = 1:12, genes.print = 10)

That's basically how I did it. Not sure if this is the right thing to do and my cluster assignment changed drastically when doing it. Maybe there is a better way to do it. Hope it helps.

bwang258 on 26 Nov 2018

@cesco-lemon Hey, Did you end up removing them? I am getting the exact same problem, interested in what happened next. THANK YOU

ZeqiNiu on 12 May 2020

@cesco-lemon Hello ,
Same here,
I would like to remove the Rpl/Rps genes but I am getting the same problem as you guys so I am also interested to find a way which work fine.

PierreGCD on 29 May 2020

Hi all! One of my clusters has RPL genes as its top 20 markers. I have read in many threads about removing RPL genes etc, but was wondering if anyone could explain or share relevant papers regarding what a high RPL/RPS expression could indicate (ie, cellular stress/dying cells?) and how you've decided whether to keep or remove them. Thank you!

laijen000 on 14 Jun 2020

👍1

In my experience/opinion, if you have a cluster whose marker genes are mainly (almost all) RPS/RPL and/or mitochondrial then it is technical, that is, for those cells you just picked the most abundant transcripts. In this case, you should remove the cells in those clusters from the analysis and go back to square one. But I would not remove the genes. Still, if the variance in the population of mito/rps/rpl is high, you could regress them out as for cell cycle.
On the other hand, if the RPS/RPL genes arise from a DGE comparing clusters from one sample to the other, then it could be biologically relevant or it could be that the cells are too similar and the statistics picks up only the most abundant just because they are easier to detect. In this case, I would first check whether the % of total RPL/RPS is not completely different (say 80% vs 20%) because then the DGE is statistically flawed. But if this is not the case I would report a biologically relevant difference between cell types of ribosomal proteins expression

josegarciamanteiga on 23 Jun 2020

👍3

@laijen000 Hello! Have you found any papers which link high ribosomal subunit gene expression and cellular stress in single cell data? Would really appreciate if you could give me a pointer ;)
Cheers

Drosophilist on 12 Jul 2020

I've been playing around with removing ribosomal genes myself, no definitive answer to whether it's advised in some cases, but I did make a regex that selects all human ribosomal genes, can easily be used in PercentageFeatureSet for example, or to remove downstream in DE tests. If anyone sees something wrong with it that would be great to know!

"^RP[SL][[:digit:]]|^RPLP[[:digit:]]|^RPSA"

GBeattie on 13 Oct 2020

@GBeattie Hello! Thanks for this, I'll try this one. How many recommended %ribosomal genes did you set to remove cells?

levinhein on 28 Oct 2020

@levinhein I chose not to remove cells, I either remove the ribosomal genes at the start (i.e. remove them from the counts matrix), or remove them after the differential expression analysis, so they don't show up in visualisations.

GBeattie on 29 Oct 2020

@GBeattie That sounds ... cool.
Actually I saw some paper did the same thing:

We performed Seurat-based filtering of cells based on the number of detected genes per cell (500 to 7000) and the percentage of mitochondrial genes expressed (<10%). The mitochondrial genes and ribosomal genes were also removed from the gene expression matrix.

Is there any paper explain this strategy or it's just an experience-based approach?
Wang.