Datasets: module 'tensorflow_datasets.core.features' has no attribute 'text'

Created on 27 Oct 2020 · 5Comments · Source: tensorflow/datasets

tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus( corpus_en, target_vocab_size=2**13) #Reducing the number of words if our dataset contains about 200,000 unique words, we have reduced it to an assumed 2^13 words(distinct) tokenizer_fr = tfds.features.text.SubwordTextEncoder.build_from_corpus( corpus_fr, target_vocab_size=2**13)

facing this error:
AttributeError Traceback (most recent call last)
in ()
1 # tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
2 # corpus_en, target_vocab_size=213) #Reducing the number of words if our dataset contains about 200,000 unique words, we have reduced it to an assumed 2^13 words(distinct)
----> 3 tokenizer_fr = tfds.features.text.SubwordTextEncoder.build_from_corpus(
4 corpus_fr, target_vocab_size=213)

AttributeError: module 'tensorflow_datasets.core.features' has no attribute 'text'

bug

Source

lopeselio

All 5 comments

Subwordtextencoder is now deprecated. Look here
Instead use tensorflow_text

PrattJena on 27 Oct 2020

Indeed, the API is deprecated. Please use tfds.deprecated.text or update your code to use tensorflow_text https://www.tensorflow.org/datasets/api_docs/python/tfds/deprecated/text

Conchylicultor on 27 Oct 2020

👍1

tensorflow_text doesn't have SubwordTextEncoder

haifengkao on 9 Nov 2020

tensorflow_text has a few option for subword tokenizer, like BertTokenizer, WordpieceTokenizer.

I believe they should make it easier though to create vocabulary, maybe with something like: vocab = tf_text.build_vocab(ds, **options). But it's more an issue for TF text.

Conchylicultor on 9 Nov 2020

Just use "tfds.deprecated.text.SubwordTextEncoder.build_from_corpus" instead of "tfds.features.text.SubwordTextEncoder.build_from_corpus",then the problem is solved.