Do the module scores from AddModuleScore() have any specific meaning? I understand that it's the difference between the average expression levels of each gene set and random control genes. Positive numbers indicate higher expression level than random. However, is it possible to assign any meaning to the actual units? For example, would any cutoff besides 0 make any sense or would it be arbitrary?
A positive score would suggest that this module of genes is expressed in a particular cell more highly than would be expected, given the average expression of this module across the population.
However, there is no physical meaning to the actual scores (they are unit-less as you suggest).
Hi, on a somewhat different note, it appears that log normalized, but unscaled (non batch corrected) data is used for this function. Should there be any effects of batch on the score that comes out?
There was actually a related question raised earlier (https://github.com/satijalab/seurat/issues/62) with the following response:
In general, we use [email protected] for functions that identify structure in the data, such as dimensionality reduction, as this will tend to give lowly and highly expressed genes equal weight. Values in [email protected] can therefore be negative, while values in object@data are >=0.
For FindMarkers and AverageExpression, we want to either discover DE genes or compute in silico cluster averages, so using [email protected] would be inappropriate.
Hi @igordot, in the source code they reference an article (Tirosh 2016, _"Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq"_) it seems the calculation is derived from. In the methods they describe the score as follows:
MITF and AXL expression programs and cell scores
The top 100 MITF-correlated genes across the entire set of malignant cells were defined as the
MITF program, and their average relative expression as the MITF-program cell score. The
average expression of the top 100 genes that negatively correlate with the MITF program scores
were defined as the AXL program and used to define AXL program cell score. To decrease the
effect that the quality and complexity of each cell鈥檚 data might have on its MITF/AXL scores we
defined control gene-sets and their average relative expression as control scores, for both the
MITF and AXL programs. These control cell scores were subtracted from the respective
MITF/AXL cell scores. The control gene-sets were defined by first binning all analyzed genes
into 25 bins of aggregate expression levels and then, for each gene in the MITF/AXL gene-set,
randomly selecting 100 genes from the same expression bin as that gene. In this way, a control
gene-sets have a comparable distribution of expression levels to that of the MITF/AXL gene-set
and the control gene set is 100-fold larger, such that its average expression is analogous to
averaging over 100 randomly-selected gene-sets of the same size as the MITF/AXL gene-set. To
calculate significance of the changes in AXL and MITF programs upon relapse, we defined the
expression log2-ratio between matched pre- and post- samples for all AXL and MITF program
genes (Fig. 3D).
A positive score would suggest that this module of genes is expressed in a particular cell more highly than would be expected, given the average expression of this module across the population.
However, there is no physical meaning to the actual scores (they are unit-less as you suggest).
But are these absolute scores (meaning one can compare the score strength between projects)? 馃
Thanks
Most helpful comment
But are these absolute scores (meaning one can compare the score strength between projects)? 馃
Thanks