Seurat: Underscores in library names affect cell identity grouping

Created on 26 Mar 2018  路  2Comments  路  Source: satijalab/seurat

Having tried to create a Seurat object with CreateSeuratObject I found that the identity class for each cell affected by underscores (_) in the library names.

Turns out those classes are set in the object@ident slot by the following code:

https://github.com/satijalab/seurat/blob/cfc2f29b448cd92204099264d3b6cf9737a82e85/R/preprocessing.R#L97

It is possible to play around with the names.field or names.delim parameters, but if the names have variable underscores it won't give me the right grouping:

mycolnames <- c("lib-1_ACTG","lib-2_ATCC","lib-2_redone_ATTT")

factor(x = unlist(x = lapply(
    X = mycolnames,
    FUN = ExtractField,
    field = 1,
    delim = "_" 
)))
[1] lib-1 lib-2 lib-2
Levels: lib-1 lib-2

Maybe that's by design, but since I use underscores a lot in my files names it would be convenient have an option to ignore them or so.
A possible workaround is to overwrite the grouping using a regular expression that separates only by the last underscore in the names:

object@ident  <- factor(stringr::str_replace(colnames(object@data),"_[^_]+$",""))

Just wanted to share this in case someone has the same issue.

Most helpful comment

Hi @seb-mueller,

You may probably already know what I have written, but looking at your code prompted me to mention it, in case it is useful. To overwrite the identity of your cells, I would however suggest you first create a column in your [email protected] slot storing all the identities of your cells, and then pass that column as input to the id argument of the Seurat::SetAllIdent() function. This allows you to maintain those identities somewhere for later use. Otherwise, if the object@ident slot gets overwritten (e.g. through Seurat::FindClusters()), you will simply loose this information.

# Create Seurat object
>object <- CreateSeuratObject(raw.data = data)

# Randomly sample 'a', 'b' and 'c' as new identities for the cells
>new.ident <- factor(x = sample(x = letters[1:3], size = ncol(object@data), replace = TRUE))

# Replace 'object@ident' with the new identities
>object@ident <- new.ident
>head(object@ident)
[1] b a b c c c
Levels: a b c

# Check identities of cells in '[email protected]'
>head(x = [email protected], 1) # 'orig.ident' stores the original identities of the cells
               nGene     nUMI orig.ident
LFHT2_ROW01_01  8873 171329.9      LFHT2

As you can see, overwriting object@ident will not add this information to [email protected].

# Store 'new.ident' in '[email protected]$my.ident'
>[email protected]$my.ident <- new.ident
>head(x = [email protected], 1)
               nGene     nUMI orig.ident my.ident
LFHT2_ROW01_01  8873 171329.9      LFHT2        a

# Set the identities of the cells to the levels stored in '[email protected]$my.ident'
>object <- SetAllIdent(object = object, id = "my.ident")
>head(object@ident)
LFHT2_ROW01_01 LFHT2_ROW01_02 LFHT2_ROW01_03 LFHT2_ROW01_04 
             a              a              b              b 
LFHT2_ROW01_05 LFHT2_ROW01_06 
             b              c 
Levels: a b c

Another way of doing it would be to use the Seurat::StashIdent() function after having overwritten the object@ident slot, the way you did it.

# Replace 'object@ident' with new identities
>object@ident <- new.ident

# Stash the cell identities to the 'my.ident' column in '[email protected]'
>object <- StashIdent(object = object, save.name = "my.ident")
>head(x = [email protected], 1)
               nGene     nUMI orig.ident my.ident
LFHT2_ROW01_01  8873 171329.9      LFHT2        a

Best,
Leon

All 2 comments

Hi @seb-mueller,

You may probably already know what I have written, but looking at your code prompted me to mention it, in case it is useful. To overwrite the identity of your cells, I would however suggest you first create a column in your [email protected] slot storing all the identities of your cells, and then pass that column as input to the id argument of the Seurat::SetAllIdent() function. This allows you to maintain those identities somewhere for later use. Otherwise, if the object@ident slot gets overwritten (e.g. through Seurat::FindClusters()), you will simply loose this information.

# Create Seurat object
>object <- CreateSeuratObject(raw.data = data)

# Randomly sample 'a', 'b' and 'c' as new identities for the cells
>new.ident <- factor(x = sample(x = letters[1:3], size = ncol(object@data), replace = TRUE))

# Replace 'object@ident' with the new identities
>object@ident <- new.ident
>head(object@ident)
[1] b a b c c c
Levels: a b c

# Check identities of cells in '[email protected]'
>head(x = [email protected], 1) # 'orig.ident' stores the original identities of the cells
               nGene     nUMI orig.ident
LFHT2_ROW01_01  8873 171329.9      LFHT2

As you can see, overwriting object@ident will not add this information to [email protected].

# Store 'new.ident' in '[email protected]$my.ident'
>[email protected]$my.ident <- new.ident
>head(x = [email protected], 1)
               nGene     nUMI orig.ident my.ident
LFHT2_ROW01_01  8873 171329.9      LFHT2        a

# Set the identities of the cells to the levels stored in '[email protected]$my.ident'
>object <- SetAllIdent(object = object, id = "my.ident")
>head(object@ident)
LFHT2_ROW01_01 LFHT2_ROW01_02 LFHT2_ROW01_03 LFHT2_ROW01_04 
             a              a              b              b 
LFHT2_ROW01_05 LFHT2_ROW01_06 
             b              c 
Levels: a b c

Another way of doing it would be to use the Seurat::StashIdent() function after having overwritten the object@ident slot, the way you did it.

# Replace 'object@ident' with new identities
>object@ident <- new.ident

# Stash the cell identities to the 'my.ident' column in '[email protected]'
>object <- StashIdent(object = object, save.name = "my.ident")
>head(x = [email protected], 1)
               nGene     nUMI orig.ident my.ident
LFHT2_ROW01_01  8873 171329.9      LFHT2        a

Best,
Leon

Thanks a lot for that @leonfodoulian !
In fact I wasn't aware of SetAllIdent, so this is exactly what I was looking for.
I've wrapped up the infos to a minimal workflow in case anyone else runs into a similar issue.

library(dplyr)
colnames(mymatrix)
# [1] "lib-1_ACTG"        "lib-2_ATCC"        "lib-2_redone_ATTT"

metaData <- data.frame(cellNames = colnames(mymatrix)) %>%
  mutate(samples = factor(str_replace(cellNames,"_[^_]*$",""))) %>%
  mutate(barcode = factor(str_replace(cellNames,".+_","")))
rownames(metaData) <- metaData$cellNames
print(metaData)
#                           cellNames      samples barcode
# lib-1_ACTG               lib-1_ACTG        lib-1    ACTG
# lib-2_ATCC               lib-2_ATCC        lib-2    ATCC
# lib-2_redone_ATTT lib-2_redone_ATTT lib-2_redone    ATTT

object <- CreateSeuratObject(raw.data = mymatrix, meta.data = metaData)
object <- SetAllIdent(object = object, id = "samples")
[email protected]$orig.ident <- [email protected]$samples  # orig.ident has to be overwritten for some reason as well

Best,
Seb

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kathirij picture kathirij  路  3Comments

RuiyangLiu94 picture RuiyangLiu94  路  3Comments

mvalenzuelav picture mvalenzuelav  路  3Comments

fly4all picture fly4all  路  3Comments

htc502 picture htc502  路  3Comments