Seurat: about NormalizeData, ScaleData, PCA, CCA ..

Created on 17 Nov 2018  路  4Comments  路  Source: satijalab/seurat

Dear Seurat authors and contributors,

as I have just started reading the documentation on SEURAT for scRNA-seq, I would appreciate having your answers and insights please on the following :

1 after NormalizeData() function, why ScaleData() function is needed ?

2 is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?

3 is ScaleData() absolutely needed in the scRNA-seq analysis ?

4 is RunCCA() working on Normalized_Data or on Scaled_Data of each sample ?

thanks a lot,

-- bogdan

Most helpful comment

1 after NormalizeData() function, why ScaleData() function is needed ?

NormalizeData() only accounts for the depth of sequencing in each cell (reads*10000 divide by total reads, and then log). ScaleData() zero-centres and scales it (See ?ScaleData). Scaling (mean/sd) is done to bring the gene expressions in same range otherwise, the huge difference in ranges of gene-expression will not allow comparing the expression across the genes. Scaling is a routine thing to do for enhancing clustering or other analyses. (You may also like to see scale() function in R)

In the recent versions of Seurat, the ScaleData function is also used to regress out unwanted variables.

2 is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?

Everything on Scaled_Data. As I said, it facilitates the comparison across the genes. eg.

g1 10 20 30 40 50
g2 20 40 60 80 100

Although g2 has double the expression of g1, their _pattern_ of expression is same, and scaling will "normalize" their expression so that they will cluster together.

3 is ScaleData() absolutely needed in the scRNA-seq analysis ?

Scaling is not inherent to scRNA-Seq. It is an important aspect of many machine learning / dimensional reduction algorithms where the distance between the features is compared. If you don't scale, the feature which has large range of variation might dominate/bias your analysis (because they will get large distances). Scaling "normalizes" this large variations among the features.

I think you are confused between Normaliztion and Scaling. Normalization "normalizes" within the cell for the difference in sequenicng depth / mRNA thruput. Scaling "normalizes" across the sample for differences in _range_ of variation of expression of genes .

4 is RunCCA() working on Normalized_Data or on Scaled_Data of each sample ?

?RunCCA gives you the answer

RunCCA(object, object2, group1, group2, group.by, num.cc = 20, genes.use,
  scale.data = TRUE, rescale.groups = FALSE, ...)

As you see, scale.data = TRUE .

All 4 comments

Hi,

I am not part of the team, but I may be able to answer some of the questions based on my experience.

Without running NormalizeData(), running FindVariableGenes will throw an error:
Error in seq.int(rx[1L], rx[2L], length.out = nb) : 'to' must be a finite number

Without running ScaleData(), running RunPCA will throw this error:
Error in GetAssayData(object, assay.type = assay.type, slot = "scale.data") : [email protected] has not been set. Run ScaleData() and then retry.

Now this will also happen even if NormalizeData() and FindVariableGenes are run.

So I guess FindVariableGenes uses normalized data and RunPCA uses scaled data.

I am sorry that I am not sure about CCA.

Thank you for your comments; yes, if the authors can advise us on these, it would be great.

1 after NormalizeData() function, why ScaleData() function is needed ?

NormalizeData() only accounts for the depth of sequencing in each cell (reads*10000 divide by total reads, and then log). ScaleData() zero-centres and scales it (See ?ScaleData). Scaling (mean/sd) is done to bring the gene expressions in same range otherwise, the huge difference in ranges of gene-expression will not allow comparing the expression across the genes. Scaling is a routine thing to do for enhancing clustering or other analyses. (You may also like to see scale() function in R)

In the recent versions of Seurat, the ScaleData function is also used to regress out unwanted variables.

2 is FindVariableGenes() or RunPCA() or FindCluster() working on Normalized_Data or on Scaled_Data ?

Everything on Scaled_Data. As I said, it facilitates the comparison across the genes. eg.

g1 10 20 30 40 50
g2 20 40 60 80 100

Although g2 has double the expression of g1, their _pattern_ of expression is same, and scaling will "normalize" their expression so that they will cluster together.

3 is ScaleData() absolutely needed in the scRNA-seq analysis ?

Scaling is not inherent to scRNA-Seq. It is an important aspect of many machine learning / dimensional reduction algorithms where the distance between the features is compared. If you don't scale, the feature which has large range of variation might dominate/bias your analysis (because they will get large distances). Scaling "normalizes" this large variations among the features.

I think you are confused between Normaliztion and Scaling. Normalization "normalizes" within the cell for the difference in sequenicng depth / mRNA thruput. Scaling "normalizes" across the sample for differences in _range_ of variation of expression of genes .

4 is RunCCA() working on Normalized_Data or on Scaled_Data of each sample ?

?RunCCA gives you the answer

RunCCA(object, object2, group1, group2, group.by, num.cc = 20, genes.use,
  scale.data = TRUE, rescale.groups = FALSE, ...)

As you see, scale.data = TRUE .

Dear Santosh, thank you. Very helpful to understand the statistical design of the algorithm.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rajasreemenon picture rajasreemenon  路  3Comments

fly4all picture fly4all  路  3Comments

sarahwajid picture sarahwajid  路  3Comments

htc502 picture htc502  路  3Comments

RuiyangLiu94 picture RuiyangLiu94  路  3Comments