Seurat: SCTransform: details on regressed-out variables

Created on 25 Jun 2019 · 7Comments · Source: satijalab/seurat

Hi!
After reading the SCTransform preprint, I was wandering about the way unwanted sources of variation are regressed out with SCTransform.

To my understanding, the negative binomial model of sctransform is capable of regressing out additional variables, like this:

sctransform::vst(data, latent_var_nonreg=c("var1", "var2"))

But looking inside the code of Seurat::SCTransform(), it seems to me this is not what happens. If I understand it correctly, Seurat::SCTransform() first runs sctransform::vst() with latent_var_nonreg=NULL, and then runs Seurat::ScaleData() on the result with vars.to.regress set to our latent variables, i.e. something like this:

res <- sctransform::vst(data, latent_var_nonreg=NULL)$y
res <- Seurat::ScaleData(res, vars.to.regress=c("var1", "var2"))

In other words, it first constructs the regularized negative binomial model to regress out the effect of cell UMI count and then constructs another linear model on the pearson residuals obtained from the first model to regress out other variables.

If this is true, my question is, why is it done this way? And would you recommend trying to regress out additional latent variables directly in the negative binomial model and then use ScaleData() just for scaling?

Thank you
Adam

Source

krejciadam

Most helpful comment

Right. You could see that nCount_RNA should be the same for all cells in the nUMI data.
Also, UMI counts is not necessary set in the vars.to.regress, because it is set in the regularized negative binomial model.

yuhanH on 28 Jun 2019

👍3

All 7 comments

Hi, Adam
You are right. latent_var_nonreg could regress out latent variables.

Actually, after sctransform::vst, we have one default step vst.out$umi_corrected <- correct_counts( x = vst.out,umi = umi,show_progress = verbose) to correct UMI counts, and it aims to eliminate sequencing depth influence on UMI counts.

If we set latent_var_nonreg in the sctransform::vst, the re-computed UMI counts will also be affected by your latent variables. The corrected UMI counts may be corrected too much.
But when you set vars.to.regress, it will re-compute UMI counts first, then run Scaledata() on the pearson residuals to regress out your latent variables.

Currently, we suggest you use vars.to.regress to regress out your latent variables.

yuhanH on 28 Jun 2019

Hi @yuhanH I just hit upon this thread and now I have a relevant question to this. It used to be that we put for ScaleData nUMI to regress out, and if I understand it correctly, can I say that when using the SCTransform is it no longer necessary to put the vars.to.regress the UMI counts for latent variable?

chlee-tabin on 28 Jun 2019

yuhanH on 28 Jun 2019

👍3

I wanted to return briefly to this question. In clinical samples, it is pretty common to have large patient effects. From what I gather, data integration or tools like fastMNN are well suited to this issue. But there are also cases more like traditional "batch" effects (e.g. same two patients spread on two different runs/dates) in which the composition of the samples are likely to be pretty similar. In these cases, would you suggest adding batch to vars.to.regress? Going back to the vst function itself? Or some alternative version (e.g. going through the SCTransform/Integration process twice, once with patient and once with batch)? I suppose one could ask similarly about cell-cycle regression.

[Side note: The batch effect is actually nicely reduced with SCTransform relative to the standard approach]

dagarfield on 31 Jan 2020

For the obvious batch effects( such as different patients), we would suggest you run Integration process.
Usually, two lanes from the same patient should not have very strong batch effects and you can use add it to vars.to.regress.
Or you can use SCTransform and to see if there is batch effect between different lanes.

yuhanH on 31 Jan 2020

Alas, the batch effects in this (badly designed) experiment involve two different 10x runs with different chemistry versions -- meaning that batch effects are actually quite substantial. But because the individual 10x runs (and even patients) are not well balanced with respect to tissue composition, data integration via CCA seems to be overkill (though fastMNN works nicely).

I've had good success using SCTransform at the patient level and including regression terms for 10x chemistry and/or batch date. This leaves me integrating data between patients, and that seems to work well. But the process leaves me with a couple of general questions.

1) For SCTransform, is there any practical difference between including batch-like variables in vars.to.regress vs. the batch_var option passed to sctransform::vst? The two options have very similar impacts on clustering, but perhaps there are some additional consequences for downstream analyses. [And is either appropriate for things like cell-cycle regression?]

2) For SCTransform, there seems to be two approaches one could take.
a) Perform SCTransform at the lowest granularity (individual samples/batches) followed by a merge into the patient level.
b) Merge the raw counts and include a sample/batch term in SCTransform (question 1)
Is there any reason to prefer one approach vs. the other?

Note: There does seem to be some practical differences: For (a) you get downstream errors during data integration of the sort

r in scale.data[anchor.features, ] : subscript out of bounds In addition: Warning message: In GetResidual(object = object.list[[i]], features = anchor.features, : The following 2579 features do not exist in all SCT models: CXCL8..

suggesting some issues in terms of what happens to the SCT data after a merge.

Thanks!

dagarfield on 3 Feb 2020

Right. You could see that nCount_RNA should be the same for all cells in the nUMI data.
Also, UMI counts is not necessary set in the vars.to.regress, because it is set in the regularized negative binomial model.

Hello. Do you mean that the sum of the corrected counts per cell in the counts slot from SCT should be the same for each one? Maybe it's a naive question, but I get different corrected counts per cell on that slot.