Deepspeech: Adapting engine to any Custom Language

Created on 3 Jul 2017  ·  54Comments  ·  Source: mozilla/DeepSpeech

I was wondering what kinds of modifications would be needed to use this engine for languages other than English (other than a new language model and a new words.txt file) ?
In particular, I was interested if it could be used with a Cyrillic script because of this: "data in the transcripts must match the [a-z ]+ regex", and if yes how hard would it be to adapt it.
I think I could circumvent this problem by creating a translator that can translate the text from a Cyrillic script to [a-z ]+ format, but would be preferable if it could use a Cyrillic script directly.

Thanks in advance

question

Most helpful comment

The network doesn't know what characters are, it just learns the probabilities of pieces of audio matching one of N different classes. In our case, the number N of classes happens to be 29 (n_character in the code), which is the 26 letters of the English alphabet plus three indices reserved for space, apostrophe and the CTC blank label (although we currently strip apostrophes from the source texts). If your alphabet has more letters, all you have to do is increase the number of classes accordingly, and possibly update text.py to map the Unicode characters to indices between 0 and n_character. Then no special pre and post-processing of the text is needed, and the network will learn your alphabet "natively".

All 54 comments

Does it mean that this engine cannot adapted to other language at this moment?

It can be adapted to other languages. For example the original Deep Speech 2 paper[1] covered English and Mandarin.

I think capturing all the changes in the code to support Cyrillic here is a bit much. However, none of the changes are hard, basically one changes the number of "softmax output units" to the number of Cyrillic characters then makes several follow-on changes too. They talk a bit about how they went from English to Mandarin in Deep Speech 2 paper[1] .

I forgot to update the issue, but I managed to adapt it to cyrillic. I didn't adapt it directly, I just mapped every cyrillic character from unicode to ascii symbols and changed the range of characters to not be [a-z] in util/text.py (needed more than 26 symbols). Some preprocessing of the text is needed to map it from unicode to ascii, but it works. I also created a custom language model with kenlm.

The problem I am facing now is how to test it properly. I don't really have large enough dataset at the moment. I have some 3k wavs, but they are only about 2-3s long (basically less than 3h of audio) and most of them are the pronunciation of only 1 word. It works and gives nice results for some of them, but I am still not convinced. I was wondering what is the minimal amount of audio files (or total length) that I need to test the model properly (some estimation would be nice)?

I would like to have some idea about how good the model will be before I start collecting large amounts of audio.

Obtaining data sets is painful. What language are you targeting?

On data size, to really match something like Google or Amazon you need at least 10k of hours of speech, or more. (I know depressing.)

However, a couple of hundred hours, like the TED data set, can create something useful but not perfect.

I am from Macedonia, so I am targeting Macedonian.
I know that it is painful, but it is what it is. Besides, I am not really trying to match Google or Amazon since even they don't have Speech Recognition for my language. Basically, anything that works decent should be useful :)

Anyway, thanks for your response

Macedonian, cool! I wish I had a speech data set to give to you!

If you can think of any way Mozilla can help you get more speech data, just let us know.

I already know where to get at least 200h of audio files with transcription, but the problem is that most of the files are minutes long (some may even go up to an hour).

Is it even worth it to try training with such long files or is segmentation a must?

Since I am pretty new with this stuff, is there any way that segmentation can be done automatically?

@istojan Automatic segmentation is usually bound to a language. However, you might be able to get away with some mix of VAD and manual work to segment the audio.

@istojan When you get an Macedonia model up and running and want to distribute it to the world, we'd be happy to help host the model for you. Providing, say, S3 storage so others can download the model.

Hi istojan.
Could you send your mods, for non ascii characters...
Need it for french.
Working on it too.
Thanks

@elpimous @istojan What would be ideal is if you could combine efforts to resolve #741 Then all future users would be able to switch languages with ease.

@kdavis
Of course. I'll post results and mods

@elpimous First, sorry for the late reply. I was reading your idea in 1 and it's kinda different from what I do. You can use other characters other than [a-z] by making a change in util/text.py. You need to change line 10.
For example, my language has 31 character so i changed line 10 to FIRST_INDEX = 90 (you can set a lower value if you need more characters) . This allows me to use all Ascii characters with value > 90. After that, I have another script that maps my alphabet to characters from 91 till 121. I use this script to map my .csv files and my words.txt file. Additionally, before you create the language model, you also need to map the entire text corpus you use too. After you do all this, in order for the output of the reports to be clear, I update the str function of the Sample object to this:

def __str__(self): return 'WER: %f, loss: %f, mean edit distance: %f\n - src: "%s"\n - res: "%s"' % (self.wer, self.loss, self.mean_edit_distance, change_line_to_unicode(self.src), change_line_to_unicode(self.res))

Here the change_line_to_unicode function is a custom function I wrote for reverse mapping (from the ascii values 91-121 to my language).
There were some other errors I needed to fix mostly concerned with printing unicode characters in python 2 and stuff like that (mostly connected with the str() function that doesn't work well with unicode), but I can't remember all the changes I made. Most of them were pretty straightforward and easy to find with the error messages that were printed when running DeepSpeech.py. Additionally, these changes I made are still working on a version of the engine I pulled around the 5th of June (was quite busy recently with some other stuff and couldn't find the time to update everything with the newest changes made to the engine).

Now your idea about using multiple characters to map characters with accents might work too (not very familiar how accented characters in french are pronounced), but you can't forget about changing the language model too.

There might be some way to use this unicode to ascii mapping without changing the language model, where right before you score a sentence on the language model in util/spell.py at the function log_probability you call the change_line_to_unicode function on the sentence, but I didn't test that idea and not so certain it will work. You can certainly try that if you have a ready language model (or don't want to train another one)

Honestly, I know that this is not a really clean solution and has a lot of potential problems, but it works for me at the moment. I might try to find something cleaner and easier in the future, but don't have the time at the moment. There would be a lot of edge cases in trying to implement a solution that will work with all Unicode characters.

The network doesn't know what characters are, it just learns the probabilities of pieces of audio matching one of N different classes. In our case, the number N of classes happens to be 29 (n_character in the code), which is the 26 letters of the English alphabet plus three indices reserved for space, apostrophe and the CTC blank label (although we currently strip apostrophes from the source texts). If your alphabet has more letters, all you have to do is increase the number of classes accordingly, and possibly update text.py to map the Unicode characters to indices between 0 and n_character. Then no special pre and post-processing of the text is needed, and the network will learn your alphabet "natively".

Yes, while I was writing the previous comment I took a closer look on text.py and I realized that I have wasted my time with multiple mappings. Would have saved me a lot of time by realizing this from the start, but it is what it is.

Thanks anyway

P.S. I will leave my previous answer just for context, although it is pretty useless now.

@reuben @istojan
Hi friends,
well, For my small french model, i must add to a-z (26 characters), é è ê ç à â ô ù û î - ' (12 characters)

Could you confirm the next parts ?

in deepspeech.py:
I just have to replace (L.260) n_character = 29 by n_character = 41 (29+12)

and in text.py :
add this line for each new characters : result = original.replace("u'\xe9'", "A").replace("u'\xe8'", "B")...

Doesn't work and produce errors.

The apostrophe is already included in the 29 we currently use, all you have to do is change the code in text.py to not remove it. As for the the other text.py changes, you need to map those characters to integer values in the range 0-39, for example by making é be 27, è be 28, etc. You also have to change ndarray_to_text accordingly otherwise the results in the reports will be garbled.

@reuben
the function ndarray_to_text isn't in use in text.py. you confirm that I must call it to change output reports ?

here is my function for remapping special characters :

ascii_converter.py

import numpy as np

########################################################################################
#
#   A simple routine for some non-ascii conversion.
#   I'm limited to [a-z] letters
#   letter 'é' is seen as 99, 73   so need ti remap 73 and erase 99
#   for use in DEEPSPEECH project
#
########################################################################################


def read(data):

    result = data
    My_counter = 0
    #print('start_datas '+str(result))
    for i in result:
        if i == 137: #('é')
            result[My_counter]=27
        if i == 136: #('è')
            result[My_counter]=28
        if i == 138: #('ê')
            result[My_counter]=29
        if i == 128: #('à') 
            result[My_counter]=30
        if i == 130: #('â')
            result[My_counter]=31
        if i == 135: #('ç')
            result[My_counter]=32
        if i == 155: #('û')
            result[My_counter]=33
        if i == 153: #('ù')
            result[My_counter]=34
        if i == 142: #('î')
            result[My_counter]=35
        if i == 148: #('ô')
            result[My_counter]=36
        if i == -51: #('-')
            result[My_counter]=37
        My_counter+=1

    #print('remapped_datas '+str(result))

    return(result)

I import the function in text.py :
_import util.ascii_converter as AC_

And call it here :

    # Map characters into indicies
    result = np.asarray([SPACE_INDEX if xt == SPACE_TOKEN else ord(xt) - FIRST_INDEX for xt in result])
    # [99 73  3 15 21 20  5]

#*********************************  remapping function  ****************************************
    result = AC.read(result)  # see ascii_converter.py in util/
#***********************************************************************************************

It works nice, but not sure that it's the best routine way ! If anyone has better, I keep it !!

Thanks @reuben @istojan

Here is a result for 1200 Epochs of nearly 800 sentences (2-3s each)

I Test of Epoch 734 - WER: 0.683075, loss: 76.4165649414, mean edit distance: 0.260379
I --------------------------------------------------------------------------------
I WER: 0.166667, loss: 3.501996, mean edit distance: 0.030303
I  - src: "peux�tu boire et manger et dormir"
I  - res: "peux�tu boir et manger et dormir"
I --------------------------------------------------------------------------------
I WER: 0.250000, loss: 2.776680, mean edit distance: 0.041667
I  - src: "alfred as�tu des loisirs"
I  - res: "alfred as�tu des losirs"
I --------------------------------------------------------------------------------
I WER: 0.250000, loss: 13.145912, mean edit distance: 0.093750
I  - src: "je ne suis pas d accord avec toi"
I  - res: "je ne suis as accord avec toi"
I --------------------------------------------------------------------------------
I WER: 0.300000, loss: 19.457064, mean edit distance: 0.125000
I  - src: "je me demande si tu es d accord avec moi"
I  - res: "je me demende e si tu es accord avec moi"
I --------------------------------------------------------------------------------
I WER: 0.428571, loss: 23.876453, mean edit distance: 0.156250
I  - src: "est ce que tu aimes les sciences"
I  - res: "est ce que tu amns les ci encces"
I --------------------------------------------------------------------------------
I WER: 0.500000, loss: 4.260559, mean edit distance: 0.074074
I  - src: "diriges�toi vers la cuisine"
I  - res: "diriges�toi vers lar cuisie"
I --------------------------------------------------------------------------------
I WER: 0.500000, loss: 5.317667, mean edit distance: 0.088235
I  - src: "peux�tu te promener dans la maison"
I  - res: "peux�tu te prome aer dan la maison"
I --------------------------------------------------------------------------------
I WER: 0.750000, loss: 8.183546, mean edit distance: 0.115385
I  - src: "alfred donnes�moi la m{t{o"
I  - res: "alfred donne� moi la m{nt{o"
I --------------------------------------------------------------------------------
I WER: 1.000000, loss: 18.071476, mean edit distance: 0.200000
I  - src: "connais�tu axel"
I  - res: "connfis�tu axelle"
I --------------------------------------------------------------------------------
I WER: 1.000000, loss: 19.676701, mean edit distance: 0.192308
I  - src: "prom|nes�toi dans le salon"
I  - res: "rom|nes�toi dsnle saloin"
I --------------------------------------------------------------------------------

I'm working without any LM neither words.txt.

What would be the benefits of lm and words ?

My objective : create a small model (nearly 400 sentences) to vocally interact with a robot, using a chatbot.

Could you help me to anderstant lm and words.txt usage here ?

Ok, I answer some of my questions :
I created a kenlm binary model and a words.txt.
They both seems to improve wer.

Test of Epoch 734 - WER: 0.884683, loss: 76.5306167603, mean edit distance: 0.278231
I --------------------------------------------------------------------------------
I WER: 0.125000, loss: 10.322077, mean edit distance: 0.156250
I  - src: "je ne suis pas d accord avec toi"
I  - res: "je ne suis pas accord avec toi"
I --------------------------------------------------------------------------------
I WER: 0.428571, loss: 24.085228, mean edit distance: 0.156250
I  - src: "est ce que tu aimes les sciences"
I  - res: "est ce que tu aime les si elles"
I --------------------------------------------------------------------------------
I WER: 0.500000, loss: 3.111674, mean edit distance: 0.041667
I  - src: "alfred as�tu des loisirs"
I  - res: "alfred as tu des loisirs"
I --------------------------------------------------------------------------------
I WER: 0.500000, loss: 6.814869, mean edit distance: 0.090909
I  - src: "peux�tu boire et manger et dormir"
I  - res: "peux tu quoi et manger et dormir"
I --------------------------------------------------------------------------------
I WER: 0.500000, loss: 11.425927, mean edit distance: 0.074074
I  - src: "diriges�toi vers la cuisine"
I  - res: "diriges toi vers la cuisine"
I --------------------------------------------------------------------------------
I WER: 0.600000, loss: 21.864613, mean edit distance: 0.115385
I  - src: "es�tu capable d avoir faim"
I  - res: "es tu capable avoir faim"

Hi all.
I'm finishing some mods for all unicode/utf8 alphabet, but will certainly need tests on differents languages !
Deepspeech assigns the real ascii value for a character. It's easy, but limits all special characters.

I use a different process :

you feed a textfile "alphabet.txt" with your own alphabet, on only one line, with commas between each character.
A function, imported in text.py, read the number of characters from the alphabet line, assigns the "n_character" value, and replace each character in sentence by their position value in the alphabet line

Some mods in text.py and spell.py too..(for a full unicode)

a small test :

I FINISHED Optimization - training time: 0:00:10
I Test of Epoch 10 - WER: 0.500000, loss: 11.0500001907, mean edit distance: 0.181818
I --------------------------------------------------------------------------------
I WER: 0.500000, loss: 11.050000, mean edit distance: 0.181818
I  - src: "écoutès moi"
I  - result: "écotès moi"
I --------------------------------------------------------------------------------
I Exporting the model...
I Restored checkpoint at training epoch 11

A question : why, when I augment epochs, the square src/result doesn't appear ?
the lign Wer/loss/mean appears each time.

At 10 epoch, the Deepspeech class Sample() and class Epoch() are executed

I Test of Epoch 10 - WER: 0.500000, loss: 22.4760360718, mean edit distance: 0.636364
I --------------------------------------------------------------------------------
I WER: 0.500000, loss: 22.476036, mean edit distance: 0.636364
I  - src: "écoutès moi"

at 30 epochs, the Deepspeech only class Epoch() is executed :
No return error !!

I Test of Epoch 30 - WER: 0.000000, loss: 1.17223775387, mean edit distance: 0.000000
I Exporting the model...

Ok.
I think it was due to a too small training material !

Strange : 
result is correct,
I --------------------------------------------------------------------------------
I WER: 0.166667, loss: 0.038866, mean edit distance: 0.000000
I  - src: "cet engin spécial a plusieurs hélices"
I  - result: "te engin spécial a plusieurs hélices"
I --------------------------------------------------------------------------------

And when testing model via client.py, it returns :
cet engin sp{cial a plusieurs h{lices

??

@elpimous Great work! If we can assist in any way, just let us know how we can help!

Deepspeech with special characters ready :
https://github.com/elpimous/alfred_robot/tree/master/Deepspeech
Follow readme.md.
post here issues.

TODO bug : test of model who replace some special characters by others (easy to change by a .replace('','')

Hi, need small help !
working on the Deepspeech with special characters,
as you can see, the train and test works nice.

However, the model tested outside, in client.py for example, returns special characters bad replacment !
ex : 'hélices' --> h{lices (the test returns good result ! see previous posts)
Why?
My process is simple : replace 'é' by '5', and return '5' to 'é'.
It works for all 26 characters ! so, why not special characters ?

@elpimous I'm not an expert on this part of the code, but in calling client.py inference occurs through a Python wrapping of a C++ interface to our interface engine. (See the native_client directory.)

This C++ code in turn does character conversion in the "old fashion" way. See deepspeech.cc:112. So in completely going from normal ASCII code to what you're doing one would also have to change the C++ code too.

I hope this helps. If not I'll try to dig a bit deeper for you.

Thanks Kdavis. I think it'exactly that !! However, I don't know C++. only a bit of python...

part of deepspeech.cc, in native_client :

  // Output is an array of shape (1, n_results, result_length).
  // In this case, n_results is also equal to 1.
  auto output_mapped = outputs[0].tensor<int64, 3>();
  int length = output_mapped.dimension(2) + 1;
  char* output = (char*)malloc(sizeof(char) * length);
  // i = character value
  for (int i = 0; i < length - 1; i++) {
    int64 character = output_mapped(0, 0, i);
    // I recover values in order, replace 0 to space, and character
    output[i] = (character ==  0) ? ' ' : (character + 'a' - 1);
    // assign to output[i] the value of character. need to modify this ": (character + 'a' - 1)"
    // in python, I pass the 'character value' in a convert function, and it returns the final character
  }
  output[length - 1] = '\0';
  return output;

here is my howto in python :

cwd = os.getcwd()

try :
    with open(cwd+'/data/alphabet/alphabet.txt') as alphabet:
        "read your alphabet file characters"
        characters = alphabet.readlines()[1] #read lign 2
        "transform to unicode"
        characters = unicode(characters, 'utf8')
        characters = characters.replace('\n','') # can be usefull, in case of line return on last alphabet letter
        "split list"
        characters = characters.split(',')
        "assign number of characters in your personal alphabet"
        characters_numbers = len(characters)

except :
    print("error1")


"Convert each letter to corresponding value"
def read(letter):
    "replace the letter by an integer (ex: a,b,c,d   c=3)"
    if letter in characters:
        letter = characters.index(letter)+1
    else :
        print("error2")
    return(letter)


"Return each value to corresponding letter"
def write(data):
    data-=1
    letter = characters[data]
    return(letter)

Could anyone have time to integrate the write() python function to deepspeech.cc ??
Sorry guys, but I can't in c++ (too poor knowledge !)

I'll try, but I will not have success before a very very long time Lol !

I can try to take a look at it over the weekend.

One problem that will likely arise is that C++ support for unicode can be rather spotty.

Oh ! I didn't know that.. Hope it will work
Thanks Kdavis

@kdavis-mozilla
Hi Kelly. Did U have success ?

It's not a unicode issue, it's just that @elpimous work is defining a new way of handling chars and that the native_client/deepspeech.cc decoding code is not aware of it. His alphabet is:

# Enter here your personal alphabet, without any punctuation-space. See alphabet_readme.txt #
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,é,è

And the code https://github.com/elpimous/alfred_robot/blob/master/Deepspeech/util/alphabet_converter.py#L40 basically builds an array of chars. So it is close to what we do with ASCII computation. Since deepspeech.cc has no knowledge of that, and because his chars 'é' is next to 'z' in his alphabet, when the model decodes what should be a 'é' it indeed has the same ASCII code as '{'.

@lissyx Yes unicode would fix the mapping from integer to character, obviate any need for alphabet.txt, and allow for a "," to also be used :-) I think it's a good idea.

@lissyx @elpimous I think the "problem" is we'd then have to introduce an embedding matrix like Deep Speech and Deep Speech 2 do for Mandarin. Not hard to do, but something that, none-the less, must be done.

Unicode doesn't really fix this since we need the labels to be a contiguous sequence starting at 0, and for French for example you'd want to use ranges of Unicode code points that aren't adjacent. We need some sort of embedding matrix in all cases that aren't just plain ASCII.

Yes : for example a-z and é, are 1-26 and 233.
But sure, a solution will appear...

@elpimous We talked a bit more about this in our weekly meeting and we're going to follow, more-or-less, your suggestion. @reuben is going to integrate this in to his current work on a deeper integration of the language model in to our code.

So, soon switching languages should be relatively easy.

YESS..

@reuben
Hi. will I be able to test a switching language deepspeech soon ?

@elpimous
I used your branch for another language, it works well, even though mine is totally different and based on Arabic writing system. I too received mixed-up results when using client.py but found out that I could remap each character again after doing evaluation, so now I get evaluation results in my writing system. I too don't understand C++ and it scares me! But using this remapping it just works at least until this implementation support language changing natively.

@elpimous Reuben has an alpha version, which is performing very well. However, he needs to further test the code before we integrate it in to master. I'd guess this will happen within the next one to two weeks.

@roxima What language are you training on?

@kdavis-mozilla Working on Persian, also targeting Kurdish.
Will @reuben share alpha version so that we could also check it out before being merged into master?

@kdavis-mozilla I also noticed something, maybe unusual. When resuming training after it has been cancelled, train loss tends to increase, somehow drastically for some epochs, before going down again. Is this normal? By the way I have not configured parameters to my own needs, as I just intended to see if the solution by @elpimous was working fine for me and that seems to do.

@roxima "Working on Persian, also targeting Kurdish" Awesome! Made my day!

@kdavis-mozilla Thanks to you, main developers, and my gratitude to @elpimous. There is always just the problem of GPU access. Currently limited by my friend's GT740M but I'm still satisfied until one day I get more resources.

@elpimous @roxima yep! I've been working on the decoder lately and it requires the more significant changes to support custom alphabets, but now that that part is almost done adapting the other parts of the code is easier. I'll apply the changes to the Python and Native Client code and land that separately from the decoder work since there's reason they have to land together.

@roxima Thanks for your nice feedback !! Happy ! (even if my github isn't easy to read lol)
@reuben Thanks for your work ! impatient to test. (thanks to all the team too !)

@roxima i'm Persian , i want to train deep speech for Persian language
are u working now?
how is result quality?

@reuben
@kdavis-mozilla
Hello! I'm the co-founder of the B2B voice assistant service. I think that we've enough data to make a Russian language support. How can we add support for the Russian in DeepSpeech?

@alejandro-pnz Quick question: How many hours of transcribed audio do you have?

@alejandro-pnz Quick question: How many hours of transcribed audio do you have?

Right now we have about 50 hours, and will twice this count for every next month.

This will not be enough to train an engine from scratch.

To give a feel for scale. We trained our v0.2.0 model on about 3700 hours of English, and our model, in truth, needs more training data.

This will not be enough to train an engine from scratch.

To give a feel for scale. We trained our v0.2.0 model on about 3700 hours of English, and our model, in truth, needs more training data.

Okay, I've got it. I hope to get back later to it after getting enough data.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jacobjennings picture jacobjennings  ·  6Comments

dabadiesimon picture dabadiesimon  ·  6Comments

luckmoon picture luckmoon  ·  7Comments

NicoHood picture NicoHood  ·  5Comments

deepak02 picture deepak02  ·  7Comments