tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
corpus_en, target_vocab_size=2**13) #Reducing the number of words if our dataset contains about 200,000 unique words, we have reduced it to an assumed 2^13 words(distinct)
tokenizer_fr = tfds.features.text.SubwordTextEncoder.build_from_corpus(
corpus_fr, target_vocab_size=2**13)
facing this error:
AttributeError Traceback (most recent call last)
1 # tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
2 # corpus_en, target_vocab_size=213) #Reducing the number of words if our dataset contains about 200,000 unique words, we have reduced it to an assumed 2^13 words(distinct)
----> 3 tokenizer_fr = tfds.features.text.SubwordTextEncoder.build_from_corpus(
4 corpus_fr, target_vocab_size=213)
AttributeError: module 'tensorflow_datasets.core.features' has no attribute 'text'
Subwordtextencoder is now deprecated. Look here
Instead use tensorflow_text
Indeed, the API is deprecated. Please use tfds.deprecated.text or update your code to use tensorflow_text https://www.tensorflow.org/datasets/api_docs/python/tfds/deprecated/text
tensorflow_text doesn't have SubwordTextEncoder
tensorflow_text has a few option for subword tokenizer, like BertTokenizer, WordpieceTokenizer.
I believe they should make it easier though to create vocabulary, maybe with something like: vocab = tf_text.build_vocab(ds, **options). But it's more an issue for TF text.
Just use "tfds.deprecated.text.SubwordTextEncoder.build_from_corpus" instead of "tfds.features.text.SubwordTextEncoder.build_from_corpus",then the problem is solved.