Seurat: ReadH5AD replaces categorical variables with numbers

Created on 7 May 2019  路  8Comments  路  Source: satijalab/seurat

Hi!
I just tried to convert an AnnData/scanpy to Seurat using V3 Seurat's ReadH5AD().

I didnt get any warnings, however all my categorical metadata wasreplaced with numbers (e.g. my disease-variable with levels "disease" & "control" was replaced to 0 and 1; my celltypes to 0 1 2 3....)

Any idea what's the issue here?
Thanks!

just FYI:
I still have Seurat V2 installed. The old convert approach works totally fine.

more-information-needed

Most helpful comment

I think there is a group called uns in the new AnnData format, which plays as a dictionary for the indices in the metadata. And Seurat does not consider this group. This is not the best solution I think, but at least I can read the metadata properly.

ReadH5AD_2 <- function(h5.path) {
  data <- Seurat::ReadH5AD(h5.path)
  uns <- rhdf5::h5read(h5.path, "uns")

  names(uns) <- gsub("_", ".", names(uns)) # Normalize UNS names
  metadata <- lapply(colnames([email protected]), function(meta.name) {
    uns.name <- paste0(meta.name, ".categories")
    if (uns.name %in% names(uns)) {
      uns.array <- as.character(uns[[uns.name]])
      meta.index <- as.numeric([email protected][[meta.name]]) + 1
      return(uns.array[meta.index])
    } else {
      return([email protected][[meta.name]])
    }
  })
  metadata <- as.data.frame(metadata, stringsAsFactors = FALSE)
  colnames(metadata) <- colnames([email protected])
  rownames(metadata) <- colnames(data)
  [email protected] <- metadata
  return(data)
}

Basically, I mapped the indices in the metadata from ReadH5AD to the labels in uns. It is quite tricky because Seurat replaces _ with . in the column names, and uns uses base-0 index.

So I tested that function on the sample data from @CyrilLagger

data <- ReadH5AD_2("~/Downloads/adata_small_test.h5ad")
str([email protected])
# 'data.frame':   1019 obs. of  12 variables:
#  $ age                   : chr  "30m" "30m" "30m" "30m" ...
#  $ cell                  : chr  "10X_P1_16_AGCTCCTCAGACTCGC" "10X_P2_10_CCGGTAGAGGAATTAC" "10X_P2_10_CGTGTCTGTAGCCTCG" "10X_P2_10_CTAGTGAGTGATGCCC" ...
#  $ cell.ontology.class   : chr  "hepatocyte" "hepatocyte" "hepatocyte" "hepatocyte" ...
#  $ cell.ontology.id      : chr  "NA" "NA" "NA" "NA" ...
#  $ free.annotation       : chr  "Hepatocyte (Pericentral and Periportal)" "Hepatocyte (Midlobular)" "Hepatocyte (Midlobular)" "Hepatocyte (Midlobular)" ...
#  $ method                : chr  "droplet" "droplet" "droplet" "droplet" ...
#  $ mouse.id              : chr  "30-M-3" "30-M-5" "30-M-5" "30-M-5" ...
#  $ nFeatures_RNA         : num  2787 1223 2616 2220 1289 ...
#  $ sex                   : chr  "male" "male" "male" "male" ...
#  $ subtissue             : chr  "LIVER" "LIVER_HEP" "LIVER_HEP" "LIVER_HEP" ...
#  $ tissue                : chr  "Liver" "Liver" "Liver" "Liver" ...
#  $ tissue.free.annotation: chr  "Liver" "Liver" "Liver" "Liver" ...

All 8 comments

I also have this issue with Seurat V3

Hi @jcschupp,

I'm unable to replicate this issue; could you provide your h5ad file (or a downsampled version of it), that I could use for testing?

Hi,
I am having the same issue, it seems to be due to the hdf5r::h5file routine.
If I use

hfile <- hdf5r::h5file(filename = file, mode = 'r')
obs <- hfile[['obs']][]

obs contains numbers instead of categorical variables.

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

Hello,

I am having the same problem and wondering if there is some way to fix it. I am using Seurat_3.1.1.

@mojaveazure
I am using the h5ad files from "Tabula Muris Senis" freely available on figshare here:
https://figshare.com/projects/Tabula_Muris_Senis/64982. Any of those files would be good for testing.

Thanks!

I have the same issue. Is there any way to fix this issue?

I am also running into this issue with Seurat_3.1.2, from an H5AD file created in scanpy 1.4.4.post1

All the strings in the metadata were converted to numbers

I think there is a group called uns in the new AnnData format, which plays as a dictionary for the indices in the metadata. And Seurat does not consider this group. This is not the best solution I think, but at least I can read the metadata properly.

ReadH5AD_2 <- function(h5.path) {
  data <- Seurat::ReadH5AD(h5.path)
  uns <- rhdf5::h5read(h5.path, "uns")

  names(uns) <- gsub("_", ".", names(uns)) # Normalize UNS names
  metadata <- lapply(colnames([email protected]), function(meta.name) {
    uns.name <- paste0(meta.name, ".categories")
    if (uns.name %in% names(uns)) {
      uns.array <- as.character(uns[[uns.name]])
      meta.index <- as.numeric([email protected][[meta.name]]) + 1
      return(uns.array[meta.index])
    } else {
      return([email protected][[meta.name]])
    }
  })
  metadata <- as.data.frame(metadata, stringsAsFactors = FALSE)
  colnames(metadata) <- colnames([email protected])
  rownames(metadata) <- colnames(data)
  [email protected] <- metadata
  return(data)
}

Basically, I mapped the indices in the metadata from ReadH5AD to the labels in uns. It is quite tricky because Seurat replaces _ with . in the column names, and uns uses base-0 index.

So I tested that function on the sample data from @CyrilLagger

data <- ReadH5AD_2("~/Downloads/adata_small_test.h5ad")
str([email protected])
# 'data.frame':   1019 obs. of  12 variables:
#  $ age                   : chr  "30m" "30m" "30m" "30m" ...
#  $ cell                  : chr  "10X_P1_16_AGCTCCTCAGACTCGC" "10X_P2_10_CCGGTAGAGGAATTAC" "10X_P2_10_CGTGTCTGTAGCCTCG" "10X_P2_10_CTAGTGAGTGATGCCC" ...
#  $ cell.ontology.class   : chr  "hepatocyte" "hepatocyte" "hepatocyte" "hepatocyte" ...
#  $ cell.ontology.id      : chr  "NA" "NA" "NA" "NA" ...
#  $ free.annotation       : chr  "Hepatocyte (Pericentral and Periportal)" "Hepatocyte (Midlobular)" "Hepatocyte (Midlobular)" "Hepatocyte (Midlobular)" ...
#  $ method                : chr  "droplet" "droplet" "droplet" "droplet" ...
#  $ mouse.id              : chr  "30-M-3" "30-M-5" "30-M-5" "30-M-5" ...
#  $ nFeatures_RNA         : num  2787 1223 2616 2220 1289 ...
#  $ sex                   : chr  "male" "male" "male" "male" ...
#  $ subtissue             : chr  "LIVER" "LIVER_HEP" "LIVER_HEP" "LIVER_HEP" ...
#  $ tissue                : chr  "Liver" "Liver" "Liver" "Liver" ...
#  $ tissue.free.annotation: chr  "Liver" "Liver" "Liver" "Liver" ...
Was this page helpful?
0 / 5 - 0 ratings