Hi Team Seurat,
Similar to issue #1547,
I integrated samples across multiple batch conditions and diets after performing SCTransform (according to your most recent vignette for integration with SCTransform - Compiled: 2019-07-16). Now, I have a Seurat object with 3 assays: RNA, SCT, and Integrated.
So I have a couple of questions regarding my workflow:
For downstream DE analysis, the scale.data slot in the SCT assay has disappeared after integration. Therefore, I assume I cannot use Pearson residuals for DE analysis. Is it possible and valid instead to use values from the "data" slot of the SCT assay (log-normalized corrected values) for the MAST test?
I want to subset a specific cell type (cluster) and examine subtypes in this cell type. So, my here is my workflow:
SCT_integrated <- IntegrateData(anchorset = SCT_Integrated.anchors, normalization.method = "SCT", features.to.integrate = rownames(SCT_Integrated))
SCT_integrated <- RunPCA(SCT_integrated)
SCT_integrated <- FindNeighbors(SCT_integrated, dims = 1:15)
SCT_integrated <- RunUMAP(SCT_integrated, dims = 1:15)
SCT_integrated <- FindClusters(SCT_integrated)
control_subset <- subset(SCT_integrated, orig.ident = 'Chow')
DefaultAssay(control_subset) <- "RNA"
control_subset <- FindVariableFeatures(control_subset, selection.method = "vst", nfeatures = 3000)
DefaultAssay(control_subset) <- "integrated"
control_subset <- ScaleData(control_subset, verbose = FALSE, vars.to.regress = c( "percent.mt"))
control_subset <- RunPCA(control_subset, npcs = 30, verbose = FALSE)
control_subset <- FindNeighbors(control_subset, dims = 1:15)
control_subset <- RunUMAP(control_subset, dims = 1:15)
control_subset <- FindClusters(control_subset)
I want to know:
a. Is it valid to set features.to.integrate to all the genes in the original Seurat object if I want run subclustering on the subset using its integrated assay?
b. Is it necessary to run FindVariableFeatures on the RNA assay of the subset and get new variables to use in PCA in order to properly cluster the subset? I am worried that the top variable features of the original Seurat Object are not the same variable features of the new subset. Also, instead of changing the default assay to "RNA", finding the variable features, and changing the default assay back to "integrated", would it be make more sense to just delete those lines of code and just change:
control_subset <- RunPCA(control_subset, npcs = 30, verbose = FALSE) to
control_subset <- RunPCA(control_subset, npcs = 30, verbose = FALSE, features = Variable Features(control_subset))
c. Should FindVariableFeatures be run on the RNA assay, the integrated assay, or the SCT assay? (I assume if I just need to delete the 3 lines of code I just mentioned above and change
control_subset <- RunPCA(control_subset, npcs = 30, verbose = FALSE) to
control_subset <- RunPCA(control_subset, npcs = 30, verbose = FALSE, features = Variable Features(control_subset)),
then the answer is to run it on the integrated assay).
d. Should ScaleData be run on the subset prior to PCA even though the subset comes from an integrated object prepped from SCT? (I ask because in the new integration vignette, it explicitly mentions not to run ScaleData after running the IntegrateData function)?
I have a conceptual question about the batch-correction (integration) model developed by Seurat (the one from the most recent vignette for integration with SCTransform - Compiled: 2019-07-16). Does this batch-correction overfit the data so much so such that legitimate biological differences in gene expression profiles of cells from different diets (HFD, LFD, Chow) are gone? Since the data I am analyzing comes from different diets as well as different batches, will batch-correction make me unable to determine differences in gene expression of cells from different diets? If so, would only performing batch correction on batches of the same diet and merging all the diets together without batch correction be a valid method of retaining gene expression differences between diet but not batches?
If I decide that batch correction is not required for my samples, could I subset cells from my original Seurat Object (after running Quality Control and clustering on it), set the assay to "RNA", and and run the standard SCTransform pipeline. In other words, is this workflow valid:
SCT_not_integrated <- FindClusters(SCT_not_integrated)
control_subset <- subset(SCT_not_integrated, orig.ident = 'Chow')
DefaultAssay(control_subset) <- "RNA"
control_subset <- SCTransform(control_subset, vars.to.regress = "percent.mt") %>% RunPCA() %>% FindNeighbors(dims = 1:15) %>% RunUMAP(dims = 1:15) %>% FindClusters()
In addition, since I am not integrating the subset, is it recommended to use the "scale.data" slot in the SCT assay for DE analysis or continue using the "data" slot in the SCT assay for this subset?
Thanks for all your help
@attal-kush I hope its okay to piggyback of your question. I have been following the SCTransform integration tutorial and it doesn't mention how to FindClusters or identify cluster specific markers. I simply used the FindNeighbors and FindClusters command in order to create the 'seurat_clusters' list in the meta.data. Could you please let me know if the steps below are the correct way to go about identifying clusters and markers?
Steps
Are these the correct steps to follow? I just want to make sure the Seurat Team agrees with my workflow for identifying the cell clusters and conserved markers for the integrated and sctransform analysis.
Thank you!
Hello Seurat Team,
Thank you for the wonderful package. I have similar questions as @attal-kush with regards to reclustering of a subset of an integrated object. The ideal workflow is not clear to me and perusing the vignettes and past issues did not clarify it fully.
Is this workflow indeed the best? Or should we go directly onto integrated dataset and RunPCA?
control_subset <- subset(SCT_integrated, orig.ident = 'Chow')
DefaultAssay(control_subset) <- "RNA"
control_subset <- FindVariableFeatures(control_subset, selection.method = "vst", nfeatures = 3000)
DefaultAssay(control_subset) <- "integrated"
control_subset <- ScaleData(control_subset, verbose = FALSE, vars.to.regress = c( "percent.mt"))
control_subset <- RunPCA(control_subset, npcs = 30, verbose = FALSE)
control_subset <- FindNeighbors(control_subset, dims = 1:15)
control_subset <- RunUMAP(control_subset, dims = 1:15)
control_subset <- FindClusters(control_subset)
@MediciPrime That looks correct to me, though your resolution=0.2 parameter is quite low.
I would also like to know the recommended way of doing this.
Thank you for your reply @j-andrews7!
Hi all, I'm also interested to this topic: what is the best way to subset and reclustering data starting from an integrating dataset? After subsetting clusters of interest (subsetting by ident) I have a Seurat object with RNA, SCT and integrated assay, and dimensional reduction (pca, tsne, umap) coming from the original Seurat object. The integrated assay consists of 3000 features comings from the original integration analysis (so choosed from the whole dataset, and not only from cells of the subset).
Thanks!!
Alberto
Hi all,
I'm also interested in understanding better how to do this. I'm writing here to be sure to receive an email when somebody will post an explanation here :-)
My intuition would have been that when I focus on 1 cluster, I can start the whole process (including integration) as if it was a fresh dataset, especially given the fact that the integration has focused on genes that were highly variable in the whole dataset but may not be variable at all in the population I'm focusing on... But reading a few posts and issues here, it's not the way to go and I would like to understand why and to know how to do it properly.
Thank you @satijalab for this amazing tool and the amazing tutorials !!!!
All the best,
Alice
a) My approach would be to just run FindClusters() with a higher resolution on the whole dataset until the desired subclustering is reached. Then find the DEGs between 2 clusters with FindMarkers(ident.1=, ident.2=).
b) Running FindVariableGenes() and RunPCA() again on the integrated dataset does not seem helpful to me because the limited feature space of 3000 is not changed. The alternative would be to subset() the population of interest and run the complete preprocessing including integration only on those cells again. That enables to change the feature space.
But I'm also curious how others approach this!
EDIT:
After discussing with colleagues and reading other articles I decided to go for option b)
See this paper also:

Hi all, I'm also interested in this issue, and wonder what is the best way to subset and reclustering data starting from an integrating dataset?
As suggested by #2042, you can change the set of features to be integrated by using the features.to.integrate argument in IntegrateData. By default, this is set to the VariableFeatures. How to set the 'features.to.integrate' as all the features?
Hi All,
I am also stuck on this issue too. I followed a similar approach to @amayer21 with regards to treating the data set as new after removing clusters/cells. Which of course included re-calculating the variable genes (on the "RNA" Slot) and re-integration. For the same reasons, I felt this was the most intuitive way. From my understanding, including all genes into the "Feature.to.integrate" functions will give you a gene matrix of all genes altered based on the integration, but the PCA analysis and subsequent non-linear dimensionality reduction and clustering will still be calculated based on the 2000 features found in the "Find.Integration.anchors" functions (unless otherwise stated), which change depending on the original data used, ie subsetted or whole.
From reading the other issues posted regarding the topic it appears that any kind of re-analysis prior to integration is not recommended, and that once subsetted a integrated data set should just be re-scaled and the pipeline followed on from this point on.
Similar to @amayer21 I am wondering what the best way to approach this is, and why treating a subsetted data set as new is not the correct way to run an integrated analysis pipeline?
Thank you @satijalab for this amazing analysis package.
Best wishes
Jordan
Hi All,
I followed a similar approach to @attal-kush. Does anyone has found a better solution to re-project a cluster of the dataset?
I know that we shouldn't rescale subsetted data from an integrated object but is it possible to RunUMAP on the subsetted data so I can at least get a plot? I have increased the resolution on FindClusters to analyze the integrated object and get my cluster of interested subclustered enough for DEG analysis but would simply like a new UMAP plot to visualize expression within that group of clusters.
Hello,
Still in the same situation.
@timoast , how can we finally tackle this issue?
It seems that a repeated possibility would be to change the features.to.integrate argument in IntegrateData to all_common_features between the different integrated datasets, however I have a quite big dataset (100.000 cells) and I'm experiencing memory issues:
Error in subCsp_rows(x, i, drop = drop) :
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
In any case, could this workflow (slightly modified from the one from @attal-kush) be accepted to subcluster from an integrated object?
commongenes_tokeep <- Reduce(intersect, lapply([email protected], rownames))
SCT_integrated <- IntegrateData(anchorset = SCT_Integrated.anchors, normalization.method = "SCT", features.to.integrate = commongenes_tokeep)
SCT_integrated <- RunPCA(SCT_integrated)
SCT_integrated <- RunUMAP(SCT_integrated, dims = 1:30)
SCT_integrated <- FindNeighbors(SCT_integrated, dims = 1:30)
SCT_integrated <- FindClusters(SCT_integrated, resolution=0.8)
DefaultAssay(Sample_subset) <- "RNA"
Sample_subset <- subset(SCT_integrated, idents = c(0, 2, 3, 7))
Sample_subset <- ScaleData(Sample_subset, vars.to.regress = c("percent.mt")
Sample_subset <- RunPCA(Sample_subset)
Sample_subset <- RunUMAP(Sample_subset, dims = 1:30)
Sample_subset <- FindNeighbors(Sample_subset, dims = 1:30)
Sample_subset <- FindClusters(Sample_subset, resolution = 0.8)
@attal-kush Your questions are so comprehensive and I am also curious if there is a practical way to analyse the subsetted cells.
Btw, regarding DE analysis in your question 1, according to https://github.com/satijalab/seurat/issues/1836#issuecomment-513632072, it says that both RNA and SCT assay could be used for DE analysis if my understanding is correct.
it makes no sense to me the not to use the integrated assay on every downstream analysis. isn't the whole point of integration to remove batch effects?
I tried this two ways
and reading this issue I only got more confused.
Hi @attal-kush ,
I have also been working on the single cell dataset and there are several times that i need to subcluster a proportion cell type. I have tried several approaches before and i think they may be helpful:
Hi @attal-kush ,
I have also been working on the single cell dataset and there are several times that i need to subcluster a proportion cell type. I have tried several approaches before and i think they may be helpful:
- Choose a subset of cells, and use the integration assay to Run PCA, umap, findneighbors and findclusters to do subclustering. The pro of this approach is that it is fast and easy. However there are a few times that i found some genes that are primary markers for one certain subtype of the cells i want to sub clustering do not exist in the integration assay, which may lead to some problems.
- Choose a subset of cells, and then split by samples and then re-run the integration steps (select integration features, find anchors and integrate data). The pro of this approach is that I use this method to solve the problem in the previous approach and now i have the genes that are primary markers for the cell sub types. However i do not believe this is the correct approach to do integration so i did not choose this method.
I hope this helps.
which do you use for DE and heatmap?
did the Seurat team ever respond to this..?
I think the proper way is to subset before integration as in Smillie et al. 2019 as referred to by @tilofreiwald. Note that @timoast from the Seurat team recommended otherwise, although I never seen an explanation why would this not best way to go.
While I did not test the above, I tested running FindVariableFeatures() (or not), and I recommend re-running FindVariableFeatures(). Note, that tested this on one data set only so far.
FindVariableFeatures() it is basically a zoom-in on the original (before subset) UMAP.Thus not much is gained.

FindVariableFeatures(): the neurons and corresponding progenitors (7,8) are linkedAlso, cells previously occurring as cluster outliers from cl7 found their way to the corresponding clusters.

Note that overall, the major structure is conserved, the effect may be particular to this data set.
Hi @vertesy ,
I used the first way as @Zha0rong described for re-clustering of subset cells, choosing a subset and then use the integration assay to Run PCA, umap, findneighbors and findclusters to do subclustering. It works, however, for some types of cells, not very well. So I guess FindVariableFeatures of the subset cells should be tried. But I am not sure which assay should be used for FindVariableFeatures of the subset cells, RNA, SCT, or Integrated? I did integration with SCTransform.
Thank you.
Most helpful comment
@MediciPrime That looks correct to me, though your
resolution=0.2parameter is quite low.I would also like to know the recommended way of doing this.