Chatterbot: UbuntuCorpusTrainer fails on non-existent "data/ubuntu_dialogs" directory after extraction.

Created on 9 Apr 2017 · 8Comments · Source: gunthercox/ChatterBot

So, using the example from ChatterBot/examples/ubuntu_corpus_training_example.py, it seems that the ubuntu_dialogs.tgz archive is downloaded and extracted successfully, however, the bot doesn't seem to be picking responses from it after .train() and also seems to want to extract the archive on every run. I've tried completely removing the data directory and re-training but nothing I've found seems to help. See below for details:

ubuntu_corpus_training_example.py with comments removed:

from chatterbot import ChatBot
import logging

logging.basicConfig(level=logging.INFO)

chatbot = ChatBot(
    'Example Bot',
    trainer='chatterbot.trainers.UbuntuCorpusTrainer'
)

chatbot.train()

response = chatbot.get_response('How are you doing today?')
print(response)

The contents of the data directory:

drwxr-xr-x    4 dmi29  1226777485   136B  9 Apr 14:56 .
drwxr-xr-x    8 dmi29  1226777485   272B  9 Apr 17:45 ..
drwx------  352 dmi29  1226777485    12K 26 Apr  2015 dialogs
-rw-r--r--    1 dmi29  1226777485   527M  9 Apr 14:56 ubuntu_dialogs.tgz

Execution:

Extracting dialogs/10/9995.tsv
Extracting dialogs/10/9996.tsv
Extracting dialogs/10/9997.tsv
Extracting dialogs/10/9998.tsv
Extracting dialogs/10/9999.tsv
... etc ...
INFO:chatterbot.trainers:File extraction complete
INFO:chatterbot.adapters:Recieved input statement: How are you doing today?
INFO:chatterbot.adapters:"How are you doing today?" is a known statement
INFO:chatterbot.adapters:No statements have known responses. Choosing a random response to return.
INFO:chatterbot.adapters:Using "How are you doing today?" as a close match to "How are you doing today?"
INFO:chatterbot.adapters:No response to "How are you doing today?" found. Selecting a random response.
INFO:chatterbot.adapters:BestMatch selected "How are you doing today?" as a response with a confidence of 0
INFO:chatterbot.adapters:NoKnowledgeAdapter selected "How are you doing today?" as a response with a confidence of 0
How are you doing today?

bug

Source

monokal

👍1

Most helpful comment

@monokal Sorry for my late response! Thank you for the detailed bug report, I will aim to have this corrected in the next release.

gunthercox on 18 Apr 2017

👍4

All 8 comments

Having dug a little deeper, it seems the issue is that the code in https://github.com/gunthercox/ChatterBot/blob/master/chatterbot/trainers.py#L216 is expecting to find the Ubuntu Dialogs in data/ubuntu_dialogs, however, the extraction process places them in data/dialogs. On renaming the directory, things seem to be getting read.

[2017-04-11 10:49:02,396][InfraBot][DEBUG] Loaded config: {'general': {'name': 'DELboy'}, 'training': {'readonly': True, 'trainer': 'chatterbot.trainers.UbuntuCorpusTrainer'}}
/usr/local/lib/python3.6/site-packages/chatterbot/storage/jsonfile.py:26: UnsuitableForProductionWarning: The JsonFileStorageAdapter is not recommended for production environments.
  self.UnsuitableForProductionWarning
[2017-04-11 10:49:02,683][InfraBot][INFO] Training DELboy...
thats not using 3d glasses 4
i have not seen his second line 4
crank all the sliders up 4
or buy a preamp ;) 4
all up and i dont want to use a preamp on my eeepc 4
i'm pretty sure the speaker in an eepc will be fairly poor 4
on windows its much loader then on ubuntu 4
so i think its a software-problem 4
try a different driver or different driver settings 4
same deal, yours is quicker 4
i think so :) 4
... etc ...

monokal on 11 Apr 2017

👍4

@monokal Sorry for my late response! Thank you for the detailed bug report, I will aim to have this corrected in the next release.

gunthercox on 18 Apr 2017

👍4

Hey guys.. just curious, but have you been able to train on the data. After applying this patch.. and another one regarding utf-8 data formatting (only applicable in py-2.7), I was able to start the training process. This was very slow.. and after several minutes it appears that only processed a couple meg of training data. Additionally what is the time complexity for a search... at first glance it would appear that we do a search over the entire space for a best match?

coreyauger on 8 Jun 2017

@coreyauger You are correct, the primary issue is that the chat bot needs to search the entire space for each response. I am working on a solution to use cacheing as well as additional filtering for optimization but this is still largely an unsolved problem as far as I am aware because there isn't an efficient way to do NLP operations on a database level.

gunthercox on 9 Jun 2017

Thanks for the quick response. So I would be right in assuming that it is currently unfeasible to experiment with the entire ubuntu corpus? Thanks again :)

coreyauger on 9 Jun 2017

At the current time I would recommend avoiding training a chat bot with the entire Ubuntu corpus. This is a problem that I'm working to resolve but I don't have an ideal solution yet.

gunthercox on 9 Jun 2017

👍1

I have some ideas on how you would accomplish this. If you are interested I will compose them more and share them with you?

Also it is pretty cool that your project is mentioned in the upcoming book
https://www.manning.com/books/natural-language-processing-in-action

Which is how I found it to begin with :)

Thanks again!

coreyauger on 9 Jun 2017

@coreyauger No problem, I would be happy to hear any ideas you have.

gunthercox on 9 Jun 2017

Was this page helpful?

0 / 5 - 0 ratings