Seurat: Comparing clusters of two different populations sequenced separately

Created on 20 Dec 2017 · 17Comments · Source: satijalab/seurat

Hello all,
I am trying to compare a few sets of clusters that are of interest from two different cell populations that were sequenced separately via 10X platform. I am worried that I am not stringent enough in removing batch-to-batch variations... This is what I am doing:

Set up run1 (call it POP1) - go through setting up the seurat object/filtering/scaling using nUMI and %mito; PCA calculations, Jackstraw, FindClusters, and finally TSNE
Set up run2 (call it POP2) - same parameters as above
Subset the set of clusters I want to compare from POP1 (cluster1_POP1) and POP2 (cluster4_POP2)
Use MergeSeurat function to merge cluster1_POP1 and cluster4_POP2, scaledata using nUMI and %mito, PCA calc, Jackstraw, FindClusters, TSNE.

The question to the community is - is this process enough to reduce sequencing variations emanating from two different runs?

Source

kathirij

👍3

All 17 comments

Hi,

In my opinion, what you are doing so far is not removing batch effects. Did you try scaling for orig.ident, or any other column in your [email protected] slot where each group of cells (i.e. cells from cluster1_POP1 and cells from cluster4_POP2) has a unique level?

Best,
Leon

leonfodoulian on 21 Dec 2017

@leonfodoulian

Thanks - that's what I thought that there is no step where I am actually removing batch effects.

To scale for orig.ident, will I just use the ScaleData function and regress out "orig.ident"?

e.g.,

ScaleData(object = pop_merged vars.to.regress = c("orig.ident"))

where pop_merged is the merged object of cluster1_POP1 and cluster4_POP2.

I am pretty novice when it comes to R coding, hence the potentially inane question.

kathirij on 21 Dec 2017

❤1

Yes. You can also regress out nUMI and percent.mito. Just make sure that your cells from cluster1_POP1 have a unique level (i.e. identity) in [email protected]$orig.ident compared to cluster4_POP2. Otherwise, you can create that column as follows:

# Define 2 character vectors of cell names for each of the clusters
# cluster1_POP1_cells and cluster4_POP2_cells 

# Create a new meta.data column corresponding to batch of cells
[email protected]$batch <- ifelse(rownames([email protected]) %in% cluster1_POP1_cells, "POP1", 
                                 ifelse(rownames([email protected]) %in% cluster4_POP2_cells, "POP2",
                                        NA))

# Regress out batch, nUMI and percent.mito
pop_merged <- ScaleData(object = pop_merged, vars.to.regress = c("batch", "nUMI", "percent.mito"))

Best,
Leon

leonfodoulian on 21 Dec 2017

👍1

Thanks Leon.

I always set a unique project ID that gets saved under "orig.ident" whenever I create a new seurat object. Because I am merging/comparing subsets of the two already analyzed seurat objects, they bring the orig.ident along with them and therefore, I do have the orig.ident as a unique value for both the batches.

I appreciate your help and will re-do the merge function by scaling the data with orig.ident in addition to the nUMI/percent.mito.

Just another quick question - I am guessing that I should calculate the variable genes in the merged object "after" using the ScaleData function, is that right? or does it not make any difference?

--
Thanks,
Jaymin.

kathirij on 21 Dec 2017

Variable genes are not computed with the scaled data. However, to speed up the analysis, you can feed the variable genes to ScaleData(), as these genes will be used for downstream analysis anyway (i.e. to compute PCA, tSNE and clusters). But make sure to compute the variable genes with your merged Seurat object.

Best,
Leon

leonfodoulian on 21 Dec 2017

Great! Thank you, as always.

kathirij on 21 Dec 2017

I forgot to paste the code snippet:

pop_merged <- ScaleData(object = pop_merged, genes.use = [email protected], vars.to.regress = c("batch", "nUMI", "percent.mito"))

Best,
Leon

leonfodoulian on 21 Dec 2017

This is a very interesting thread as I am trying to do something similar with (gulp) eight datasets.

Question: If I merge datasets iteratively (e.g. MergeSeurat data1_subpop2 data2_subpop4 followed by MergeSeurat FirstMergedSet data3_subpop6), should I be regressing after each merge or only once I have merged all eight datasets?

Alternatively, can I simplify this by skipping the filtering/scaling, PCA calculations, Jackstraw, FindClusters, TSNE steps for the individual samples and just run these processes on the final merged dataset while including orig.ident as a regression factor? I've tried something similar before (albeit with regressing using orig.ident) and the results look surprisingly good for most the cell type clusters/subclusters but I am not 100% sure if this is removing all batch effects.

DepledgeLab on 4 Jan 2018

Hello @dandepledge ,

There is no need to regress after each merge, because your [email protected] will be replaced at each run of ScaleData().

Regarding your second question, if you are interested in performing an integrated analysis for all 8 datasets, you can simply perform the analysis on the final merged Seurat object. Indeed, for batch correction, you should regress for orig.ident, or a similar variable in your final merged [email protected] slot. This approach, as you mentioned, is not removing all batch effects. For more details, you can refer to issue #187, and read the following bioRxiv paper.

Best,
Leon

leonfodoulian on 4 Jan 2018

👍1

Ok great - I think I have finally wrapped my head around this.

Thanks for getting back to me so quickly!

Dan

DepledgeLab on 5 Jan 2018

Dear Leon

Regarding your second question, if you are interested in performing an integrated analysis for all 8 datasets, you can simply perform the analysis on the final merged Seurat object. Indeed, for batch correction, you should regress for orig.ident, or a similar variable in your final merged [email protected] slot. This approach, as you mentioned, is not removing all batch effects. For more details, you can refer to issue #187, and read the following bioRxiv paper.

Just a question to improve my understanding: If performing regression on orig.ident aren't you also regressing out a lot of biological information? If batch1 to batch8 are different biological entities and thus have been sequenced each separately (batch1...8), isn't this as perfect confounder? Thus regressing out will address the batch effect but also the biological differences.

Cheers,
Charles

cimbusch on 30 Mar 2018

👍1

Hi Charles,

I have never tried combining different datasets from different biological samples. But I think that in this case you shouldn't regress for batch. At least this is what I have seen people doing when performing such analysis. Maybe others have more experience in this and can give a more thorough answer to your question.

Best,
Leon

leonfodoulian on 31 Mar 2018

👍1

Hi Leon,
Along the same lines. If I were combining two different biological samples (different populations), then should I exclude the step that looks at the variance explained by low dimensional cca vs pca, since this step is primarily meant to exclude rare cell populations? (I am looking to retain rare populations found in both sets in the merged output).

Where data is a seurat object:
data<- SubsetData(data, subset.name = "var.ratio.pca", accept.low = 0.5)

Best,
Anu

amr15 on 20 Apr 2018

Hi Anu,

I think it would be a better idea to open a new issue on this GitHub repository for your question. However, I have to mention that I am a user, and not the developper of the CCA algorithm. So maybe a more suitable answer can be provided by @andrewwbutler and @satijalab.

To my understanding, CCA should be applied to datasets from the same biological sample, but that are difficult to integrate due to batch effects. So I am not sure to what extent you can integrate data from different biological samples. However, if I were in your case, I would first analyse each dataset separately, and try to identify these rare cells types based on cluster specific marker genes.

Best,
Leon

leonfodoulian on 20 Apr 2018

👍1

Question: If I merge datasets iteratively (e.g. MergeSeurat data1_subpop2 data2_subpop4 followed by MergeSeurat FirstMergedSet data3_subpop6), should I be regressing after each merge or only once I have merged all eight datasets?

There is no need to regress after each merge, because your [email protected] will be replaced at each run of ScaleData().

Is this true for the log-normalization step too? I noticed that MergeSeurat() performs log-normalization automatically – if merging objects iteratively, is the log-normalized @data slot replaced every time?

Thanks!