Deepspeech: Problem with SWC corpus script

Created on 9 Dec 2019  路  20Comments  路  Source: mozilla/DeepSpeech

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • TensorFlow installed from (our builds, or upstream TensorFlow): Yes
  • TensorFlow version (use command below): b'v1.13.1-0-g6612da8951' 1.13.1
  • Python version: 3.5
  • Bazel version (if compiling from source): 0.19.2
  • GCC/Compiler version (if compiling from source): 5.4.0
  • CUDA/cuDNN version: 10.0.130
  • GPU model and memory: Quadro RTX 6000, 72GB

Hello Team,

I am trying to use import_swc.py (under bin) to preprocess SWC corpus. I used the following command:

DeepSpeech/bin/import_swc.py . --language german --normalize --german_alphabet ../../../dependencies/alphabet.txt

But when I train the DeepSpeech model, the training loss is always infinite. Please guide how to resolve this issue. Below are the logs:

WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.

WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
I Initializing variables...
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:18:08 | Steps: 1845 | Loss: inf
Epoch 0 | Validation | Elapsed Time: 0:00:36 | Steps: 139 | Loss: 270.188871 | Dataset: ../german-speech-corpus/delete/swc/dev_swc.csv
I Saved new best validating model with loss 270.188871 to: /home/LTLab.lan/agarwal/.local/share/deepspeech/checkpoints/best_dev-1845
Epoch 1 |   Training | Elapsed Time: 0:17:52 | Steps: 1845 | Loss: inf
Epoch 1 | Validation | Elapsed Time: 0:00:35 | Steps: 139 | Loss: 227.384010 | Dataset: ../german-speech-corpus/delete/swc/dev_swc.csv
WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/training/saver.py:966: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.
I Saved new best validating model with loss 227.384010 to: /home/LTLab.lan/agarwal/.local/share/deepspeech/checkpoints/best_dev-3690
Epoch 2 |   Training | Elapsed Time: 0:17:52 | Steps: 1845 | Loss: inf
Epoch 2 | Validation | Elapsed Time: 0:00:35 | Steps: 139 | Loss: 218.371178 | Dataset: ../german-speech-corpus/delete/swc/dev_swc.csv
I Saved new best validating model with loss 218.371178 to: /home/LTLab.lan/agarwal/.local/share/deepspeech/checkpoints/best_dev-5535
Epoch 3 |   Training | Elapsed Time: 0:17:53 | Steps: 1845 | Loss: inf
Epoch 3 | Validation | Elapsed Time: 0:00:35 | Steps: 139 | Loss: 322.072106 | Dataset: ../german-speech-corpus/delete/swc/dev_swc.csv
WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
I Early stop triggered as (for last 4 steps) validation loss: 322.072106 with standard deviation: 22.604229 and mean: 238.648019
I FINISHED optimization in 1:14:16.207693
I Restored variables from best validation checkpoint at /home/LTLab.lan/agarwal/.local/share/deepspeech/checkpoints/best_dev-5535, step 5535
Testing model on ../german-speech-corpus/delete/swc/test_swc.csv
Test epoch | Steps: 412 | Elapsed Time: 0:08:00
WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/tools/freeze_graph.py:232: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.convert_variables_to_constants
WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/framework/graph_util_impl.py:245: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.extract_sub_graph
Test on ../german-speech-corpus/delete/swc/test_swc.csv - WER: 0.984189, CER: 0.952155, loss: 221.439163
--------------------------------------------------------------------------------
WER: 3.000000, CER: 1.833333, loss: 90.893661
 - src: "wurden"
 - res: "in den hundert"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.789474, loss: 41.020634
 - src: "umweltver盲nderungen"
 - res: "um ein"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 1.200000, loss: 77.087273
 - src: "array"
 - res: "er ende"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 2.000000, loss: 86.086899
 - src: "sex"
 - res: "in den "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 1.100000, loss: 120.730904
 - src: "siebzehnte"
 - res: "es unendlich"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 1.250000, loss: 157.400894
 - src: "monotherapie"
 - res: "die eeeeeeeeeeeee"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 4.250000, loss: 191.515320
 - src: "doch"
 - res: "es hunderttausende"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 2.211713
 - src: "an"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 2.343423
 - src: "mit"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 2.612154
 - src: "auf"
 - res: ""
--------------------------------------------------------------------------------
I Exporting the model...
I Models exported at ../models

All 20 comments

@AASHISHAG Please provide more context:

  • Was the import done right/is it complete?
  • What are your training parameters?
  • How does your alphabet look like (or are you performing a UTF-8 based training)?
  • Are you training a German model from scratch or fine-tuning an English model into a German one (recommended)?

Thank you @tilmankamp for the reply. Below are the detailed steps, logs and attached alphabet file,

  1. Download and pre-process data:
    Status: Complete

Command:
DeepSpeech/bin/import_swc.py . --language german --normalize --german_alphabet ../../../dependencies/alphabet.txt
Logs:

Progress |#####################################################| 100% completed
Progress |#####################################################| 100% completed
Progress |#####################################################| 100% completed
Progress |#####################################################| 100% completed
Progress |#####################################################| 100% completed
Progress |#####################################################| 100% completed
Progress |#####################################################| 100% completed
No archive "./SWC_German.tar" - downloading...
Extracting "./SWC_German.tar"...
Converting and joining source audio files...
Collecting samples...
Skipped samples:
 - missing timestamps: 106908
 - illegal character: 346
 - too short to decode: 125
 - substitution rule: 52
 - validation: 31
Sub-set "train" with 132897 samples (duration: 238.57 h)
Sub-set "dev" with 15116 samples (duration: 26.20 h)
Sub-set "test" with 14858 samples (duration: 25.76 h)
Creating sample directories...
Splitting audio files...
Writing "/media/data/LTLab.lan/agarwal/german-speech-corpus/delete/swc/german-train.csv"...
Writing "/media/data/LTLab.lan/agarwal/german-speech-corpus/delete/swc/german-dev.csv"...
Writing "/media/data/LTLab.lan/agarwal/german-speech-corpus/delete/swc/german-test.csv"...
Removing intermediate files in "./german"...
Progress |#####################################################| 100% completed
  1. Training and Parameters used:
./DeepSpeech.py --train_files ../german-speech-corpus/delete/swc/train_swc.csv --dev_files ../german-speech-corpus/delete/swc/dev_swc.csv --test_files ../german-speech-corpus/delete/swc/test_swc.csv --alphabet_config_path ../dependencies/alphabet.txt --lm_trie_path ../dependencies/trie --lm_binary_path ../dependencies/lm.binary --test_batch_size 36 --train_batch_size 24 --dev_batch_size 36 --epochs 75 --learning_rate 0.0001 --dropout_rate 0.30 --export_dir ../models

WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.

WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
I Initializing variables...
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:18:08 | Steps: 1845 | Loss: inf
Epoch 0 | Validation | Elapsed Time: 0:00:36 | Steps: 139 | Loss: 270.188871 | Dataset: ../german-speech-corpus/delete/swc/dev_swc.csv
I Saved new best validating model with loss 270.188871 to: /home/LTLab.lan/agarwal/.local/share/deepspeech/checkpoints/best_dev-1845
Epoch 1 |   Training | Elapsed Time: 0:17:52 | Steps: 1845 | Loss: inf
Epoch 1 | Validation | Elapsed Time: 0:00:35 | Steps: 139 | Loss: 227.384010 | Dataset: ../german-speech-corpus/delete/swc/dev_swc.csv
WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/training/saver.py:966: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.
I Saved new best validating model with loss 227.384010 to: /home/LTLab.lan/agarwal/.local/share/deepspeech/checkpoints/best_dev-3690
Epoch 2 |   Training | Elapsed Time: 0:17:52 | Steps: 1845 | Loss: inf
Epoch 2 | Validation | Elapsed Time: 0:00:35 | Steps: 139 | Loss: 218.371178 | Dataset: ../german-speech-corpus/delete/swc/dev_swc.csv
I Saved new best validating model with loss 218.371178 to: /home/LTLab.lan/agarwal/.local/share/deepspeech/checkpoints/best_dev-5535
Epoch 3 |   Training | Elapsed Time: 0:17:53 | Steps: 1845 | Loss: inf
Epoch 3 | Validation | Elapsed Time: 0:00:35 | Steps: 139 | Loss: 322.072106 | Dataset: ../german-speech-corpus/delete/swc/dev_swc.csv
WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
I Early stop triggered as (for last 4 steps) validation loss: 322.072106 with standard deviation: 22.604229 and mean: 238.648019
I FINISHED optimization in 1:14:16.207693
I Restored variables from best validation checkpoint at /home/LTLab.lan/agarwal/.local/share/deepspeech/checkpoints/best_dev-5535, step 5535
Testing model on ../german-speech-corpus/delete/swc/test_swc.csv
Test epoch | Steps: 412 | Elapsed Time: 0:08:00
WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/tools/freeze_graph.py:232: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.convert_variables_to_constants
WARNING:tensorflow:From /home/LTLab.lan/agarwal/python-environments/env/lib/python3.5/site-packages/tensorflow/python/framework/graph_util_impl.py:245: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.extract_sub_graph
Test on ../german-speech-corpus/delete/swc/test_swc.csv - WER: 0.984189, CER: 0.952155, loss: 221.439163
--------------------------------------------------------------------------------
WER: 3.000000, CER: 1.833333, loss: 90.893661
 - src: "wurden"
 - res: "in den hundert"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.789474, loss: 41.020634
 - src: "umweltver盲nderungen"
 - res: "um ein"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 1.200000, loss: 77.087273
 - src: "array"
 - res: "er ende"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 2.000000, loss: 86.086899
 - src: "sex"
 - res: "in den "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 1.100000, loss: 120.730904
 - src: "siebzehnte"
 - res: "es unendlich"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 1.250000, loss: 157.400894
 - src: "monotherapie"
 - res: "die eeeeeeeeeeeee"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 4.250000, loss: 191.515320
 - src: "doch"
 - res: "es hunderttausende"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 2.211713
 - src: "an"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 2.343423
 - src: "mit"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 2.612154
 - src: "auf"
 - res: ""
--------------------------------------------------------------------------------
I Exporting the model...
I Models exported at ../models
  1. Alphabet: Please note, I have replaced 脽 with ss after pre-processing was complete. Also, I trained the language model after replacing 脽 with ss. There you won't find 脽 in my alphabet.txt
    alphabet.txt.

  2. I am training my model from scratch.

Here you can find my train_swc.csv for reference: https://drive.google.com/file/d/1jGhFlODniKVWUMx05YaQApbWWIHPxTyu/view?usp=sharing

Note: I used the same above setting for training with other German corpora namely: Tuda-De, Voxforge, Mozilla Common Voice and M-Ailabs and didn't face this infinite loss issue.

Your import is looking correct.
If you are starting a training from scratch with hyper parameters that are too aggressive, you could under certain circumstances get infinity losses - particularly at the beginning. To overcome this, you could for example lower your --learning_rate.
With similar settings I fine-tuned a from-scratch English model (trained using the German alphabet + apostrophe) into a German one using a training-set consisting of _CV-de, TUDA, SWC-de_ and _M-AILABS-de_ and got no "infs" at all.
A better place for discussing these kind of questions is our Discourse forum.

I had been a bit too fast with my answer (actually I missed your note at the end). It just came to my mind that of all the German importers the SWC one was the one with the most "correction" work (like numbers and currencies).
I wonder, why this messages are not showing up in your log.
I'll do the same run as yours on my end and try to get it reproduced and/or a collection of files to further correct or just exclude.

Thank you, for reopening the issue.

Also, strangely, when I try to train the model on TEST or DEV, the infinite loss is not showing. It seems there are some problematic files still in the TRAIN set. I don't know how I can filter those out.

I am attaching my TRAIN, DEV and TEST for reference.

Train: https://drive.google.com/file/d/1jGhFlODniKVWUMx05YaQApbWWIHPxTyu/view?usp=sharing
Dev: https://drive.google.com/file/d/17idPGY7NemzEZmDSGMh1NIPrRPAdIlMe/view?usp=sharing
Test: https://drive.google.com/file/d/1tK-rGxba2Iks8iGs0goyrsZYyKJMguMM/view?usp=sharing

I now tried to reproduce it on my end, but without success (did a couple of runs):

[...]
[2019-12-19 12:45:02] [worker 0] + python -u DeepSpeech.py --alphabet_config_path [...]/de/alphabet.txt --lm_binary_path [...]/languages/german/german-lm.binary --lm_trie_path [...]/languages/german/german-lm.trie --train_files [...]/SWC/german-train.csv --dev_files [...]/SWC/german-dev.csv --test_files [...]/SWC/german-test.csv --feature_cache [...]/swc-feature-cache --train_batch_size 24 --dev_batch_size 36 --test_batch_size 36 --learning_rate 0.0001 --dropout_rate 0.30 --epochs 1 --noearly_stop --checkpoint_dir [...]/keep --summary_dir [...]/summaries
[...]
[2019-12-19 12:45:46] [worker 0] I Initializing variables...
[2019-12-19 12:45:49] [worker 0] I STARTING Optimization
[2019-12-19 12:45:49] [worker 0] Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
[2019-12-19 12:46:06] [worker 0] Epoch 0 |   Training | Elapsed Time: 0:00:17 | Steps: 1 | Loss: 25.703697
[...]
[2019-12-19 12:56:51] [worker 0] Epoch 0 |   Training | Elapsed Time: 0:11:02 | Steps: 691 | Loss: 133.371081
[2019-12-19 12:56:51] [worker 0] Epoch 0 |   Training | Elapsed Time: 0:11:02 | Steps: 691 | Loss: 133.371081
[2019-12-19 12:56:59] [worker 0] Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: [...]/SWC/german-dev.csv
[...]

I had some issues on other runs - but those were batch-size OOMs.

There are some differences left between your and my setup:

  • SW environment (especially TensorFlow: We use 1.14.0 now)
  • DeepSpeech version (I tested with current master)
  • 脽-replacement (haven't done that on my end)

Thank you for pointing out the differences. I have now switched to master code, and I was able to find the file causing infinite loss. On removing it, I don't encounter the infinite loss issue.

Use tf.where in 2.0, which has the same broadcast rule as np.where
I Initializing variables...
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
The following files caused an infinite (or NaN) loss: /media/data/LTLab.lan/agarwal/german-speech-corpus/delete/swc/german-train/sample-072692.wav

I would like to point you out to some issues that I have encounter while I was browsing the transcripts, (I was not able to verify it with the original transcripts as the processed wav file names differs from original ones.)

  1. Some transcripts have space between the characters. Example: c h i p in the below.
    german-train/sample-000006.wav,319724,von zwei tausend zw枚lf bis zwei tausend dreizehn gab es mit derc h i ppower play eine neuauflage des magazins an der zahlreiche
    https://drive.google.com/file/d/1IG4gno1WXaImBliX41a320YB2LO4Vj71/view?usp=sharing

  2. Some transcripts don't match exactly to the audio file.
    german-train/sample-032501.wav,208364,der inklination i die die orientierung der rotationsachse
    https://drive.google.com/file/d/1FQIbkUVk4JAt3kqrSv6k9G-V1SFKVmb5/view?usp=sharing
    german-train/sample-045211.wav,25004,g g
    https://drive.google.com/file/d/1mOhdeJwDoqSta5VHTHOUPCLRzdPy5qM-/view?usp=sharing

Is it possible to find/correct these transcripts?

Still wonder why I wasn't able to reproduce. Could you provide the transcript of your .../swc/german-train/sample-072692.wav file?

Regarding your observations:

  1. That's right - so far I thought those were only provided (by the source) in case of abbreviations that should be spoken by spelling out each letter. Idea: Merging single-letter sequences that exceed a length of three.
  2. I only see one way to get them excluded: Doing a complete inference run on all of them and then listening to the worst WER ones one-by-one and excluding them. A handy tool for this is missing.

Regarding the loss-exploding file: Forgot that you provided your CSVs - will look it up by myself.

Thank you @tilmankamp for the above suggestions.

Also, today I was randomly looking at the pre-processed transcripts. I found a few transcripts don't match the audio. Not sure how many are there.

Pre-processed transcript: german-dev/sample-014858.wav, schallplatten
Audio: https://drive.google.com/file/d/1_MZ3Vm6Yv-Q9kwpqT2lOjJVfmgCMwhwZ/view?usp=sharing

Pre-processed transcript: german-dev/sample-014820.wav, f眉r
Audio: https://drive.google.com/open?id=1dYPAofaep7kWI8aEo22DtE6uP9WjIh1Q

Is there any way to retain the original naming convention? It would be easy to verify it with the original transcripts.

@tilmankamp :

Happy Christmas and New Year!!

I have a question w.r.t. Tuda-De, M-Ailabs and SWC scripts. I read the paper _Common Voice: A Massively-Multilingual Speech Corpus_ released by the Deep Speech team. They have mentioned:

_"We made dataset splits (c.f. Table (2)) such that one speaker鈥檚 recordings are only present in one data split. This allows us to make a fair evaluation of speaker generalization, but as a result, some training sets have very few speakers, making this an even more challenging scenario."_

I noticed that in SWC script you have used a "_speaker_" flag to identify the speakers and I assume that you are possibly splitting the overall data set in SWC into training, development and test partitions in such a way that speakers or sentences do not overlap across the different sets. Please confirm?

Could we also do the same with other datasets i.e. Tuda-De and M-Ailabs? Also, it would be great if we could have the script for Voxfogre. The current Voxforge script doesn't support DE version.

@AASHISHAG Regarding your first comment: We should collect all wrong samples and I'll keep this issue up for collecting them till we reached a point where it's worth bundling a first PR as an extension of the following lines (where the secondary None will just drop matching statements): https://github.com/mozilla/DeepSpeech/blob/85a61a3ab74aa28a08723236ddab740c7a9fa1e3/bin/import_swc.py#L44-L57
Just for keeping track: Your (first) loss infinity causing one is the one with transcript "ssss".

@AASHISHAG

I noticed that in SWC script you have used a "speaker" flag to identify the speakers and I assume that you are possibly splitting the overall data set in SWC into training, development and test partitions in such a way that speakers or sentences do not overlap across the different sets. Please confirm?

Confirmed (for the speakers).

2625 is for adding article name and the speaker to CSV columns for debugging - This will let you verify that each speaker is restricted to one set. It also allows excluding "unknown" speakers (in case an unknown speaker is actually just an unidentified existing one).

Be aware: There is no "sentence overlap" check, as the importer assumes Wikipedia articles not sharing equal sentences.

@AASHISHAG

Could we also do the same with other datasets i.e. Tuda-De and M-Ailabs? Also, it would be great if we could have the script for Voxfogre. The current Voxforge script doesn't support DE version.

The TUDA importer just reproduces their split.
From the TUDA README:

Test / Dev: includes recordings for the test and dev set. These sentences only occur once, there is no overlap with sentences in Train and the Test / Dev recordings were conducted with a different set of speakers. Each sentence in Test / Dev is unique, i.e. just recorded once by one speaker.

Just to be aware: Each sentence in the TUDA train set is repeated about five times - one time per microphone/angle. Allowing for selecting just one of them during import would be a great contribution.

Regarding M-Ailabs: The split is not speaker based yet.

Having Voxforge-DE and speakers guaranteed to occur in just one sub-set would indeed be great.

The TUDA importer just reproduces their split.
From the TUDA README:

Test / Dev: includes recordings for the test and dev set. These sentences only occur once, there is no overlap with sentences in Train and the Test / Dev recordings were conducted with a different set of speakers. Each sentence in Test / Dev is unique, i.e. just recorded once by one speaker.

I tried to verify it and found some duplicates in Train, Dev and Test. It would be helpful if you can point me to the source of the README from where you got this text.

If you imported TUDA, you should find the README under <import-dir>/german-speechdata-package-v2/README.
The containing archive's URL is constructed like this:
https://github.com/mozilla/DeepSpeech/blob/85a61a3ab74aa28a08723236ddab740c7a9fa1e3/bin/import_tuda.py#L27-L29
Result: http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/german-speechdata-package-v2.tar.gz

@AASHISHAG Regarding your first comment: We should collect all wrong samples and I'll keep this issue up for collecting them till we reached a point where it's worth bundling a first PR as an extension of the following lines (where the secondary None will just drop matching statements):

https://github.com/mozilla/DeepSpeech/blob/85a61a3ab74aa28a08723236ddab740c7a9fa1e3/bin/import_swc.py#L44-L57

Just for keeping track: Your (first) loss infinity causing one is the one with transcript "ssss".

  1. This could be an addition to SWC script (referring it from Kaldi Tuda De project)

https://github.com/uhh-lt/kaldi-tuda-de/blob/master/s5_r2/local/prepare_swc_german_wavscp.py#L15

  1. The last commit here, says " Extra words from SWC". Probably, these are the common occurring symbols in SWC.
    https://github.com/uhh-lt/kaldi-tuda-de/blob/master/s5_r2/local/extra_words.txt

@AASHISHAG
Regarding 1: I'll take some of them for the filter rules - thanks!
Regarding 2: Looks like the vocabulary.

@tilmankamp Can this be closed?

Yes, as this has been more of a discussion than a specific issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yoann1995 picture yoann1995  路  49Comments

SirZontax picture SirZontax  路  59Comments

lissyx picture lissyx  路  33Comments

MalikMahnoor picture MalikMahnoor  路  79Comments

abuvaneswari picture abuvaneswari  路  30Comments