Chatterbot: chinese corpus UnicodeDecodeError

Created on 26 Dec 2016  ·  9Comments  ·  Source: gunthercox/ChatterBot

I use window10 , python2.7
this is my file
test.py

# -*- coding: utf-8 -*-
from chatterbot import ChatBot
from chatterbot.trainers import ChatterBotCorpusTrainer
chinesebot = ChatBot("Training Example")
chinesebot.set_trainer(ChatterBotCorpusTrainer)
chinesebot.train("chatterbot.corpus.chinese")
chinesebot.get_response("早上好,你好吗?")

then i run python test.py, i get error

F:\AnacondaWork\lib\site-packages\chatterbot\storage\jsonfile.py:30: UnsuitableForProductionWarning: The J
 not recommended for production environments.                                                             
  self.UnsuitableForProductionWarning                                                                     
[nltk_data] Downloading package stopwords to                                                              
[nltk_data]     C:\Users\80920\AppData\Roaming\nltk_data...                                               
[nltk_data]   Package stopwords is already up-to-date!                                                    
[nltk_data] Downloading package wordnet to                                                                
[nltk_data]     C:\Users\80920\AppData\Roaming\nltk_data...                                               
[nltk_data]   Package wordnet is already up-to-date!                                                      
[nltk_data] Downloading package punkt to                                                                  
[nltk_data]     C:\Users\80920\AppData\Roaming\nltk_data...                                               
[nltk_data]   Package punkt is already up-to-date!                                                        
[nltk_data] Downloading package vader_lexicon to                                                          
[nltk_data]     C:\Users\80920\AppData\Roaming\nltk_data...                                               
[nltk_data]   Package vader_lexicon is already up-to-date!                                                
Traceback (most recent call last):                                                                        
  File "test.py", line 9, in <module>                                                                     
    chinesebot.train("chatterbot.corpus.chinese")                                                         
  File "F:\AnacondaWork\lib\site-packages\chatterbot\trainers.py", line 117, in train                     
    trainer.train(pair)                                                                                   
  File "F:\AnacondaWork\lib\site-packages\chatterbot\trainers.py", line 82, in train                      
    statement = self.get_or_create(text)                                                                  
  File "F:\AnacondaWork\lib\site-packages\chatterbot\trainers.py", line 25, in get_or_create              
    statement = self.storage.find(statement_text)                                                         
  File "F:\AnacondaWork\lib\site-packages\chatterbot\storage\jsonfile.py", line 46, in find               
    values = self.database.data(key=statement_text)                                                       
  File "F:\AnacondaWork\lib\site-packages\jsondb\db.py", line 98, in data                                 
    return self._get_content(key)                                                                         
  File "F:\AnacondaWork\lib\site-packages\jsondb\db.py", line 52, in _get_content                         
    obj = self.read_data(self.path)                                                                       
  File "F:\AnacondaWork\lib\site-packages\jsondb\file_writer.py", line 15, in read_data                   
    obj = decode(content)                                                                                 
  File "F:\AnacondaWork\lib\site-packages\jsondb\compat.py", line 28, in decode                           
    return json_decode(value, encoding='utf-8')                                                           
  File "F:\AnacondaWork\lib\json\__init__.py", line 352, in loads                                         
    return cls(encoding=encoding, **kw).decode(s)                                                         
  File "F:\AnacondaWork\lib\json\decoder.py", line 364, in decode                                         
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())                                                     
  File "F:\AnacondaWork\lib\json\decoder.py", line 380, in raw_decode                                     
    obj, end = self.scan_once(s, idx)                                                                     
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd4 in position 0: invalid continuation byte          

how can i clear the console output warning like [ntlk_data] Downloading ....
and how to solve the error.
i can't find the same error in any issue.
thanks for support

bug

Most helpful comment

The same code ,chinese corpus ,it works in python3.

Maybe python2 has encode decode bug. Just bypass the problem
@vkosuri
@onlydarkknight

All 9 comments

@sunchenguang the nltk downloading will search different zip files available on you machine, if any one of the file not found it starts downloading it from server.

I think most of the times in windows can't convert Unicode characters properly, have seen same issue on Linux machine?

REF Link: https://wiki.python.org/moin/PrintFails

@sunchenguang Try to remove database.db and re run your script with below modification it is working fine on my machine.

```Diff
--- a/chatterbot/input/input_adapter.py
+++ b/chatterbot/input/input_adapter.py
@@ -19,14 +19,14 @@ class InputAdapter(Adapter):
Return an existing statement object (if one exists).
"""
input_statement = self.process_input(args, *kwargs)
- self.logger.info('Recieved input statement: {}'.format(input_statement.text))
+ self.logger.info('Recieved input statement: {%r}'.format(input_statement.text))

     existing_statement = self.chatbot.storage.find(input_statement.text)

```

I find the file in below

F:\AnacondaWork\Lib\site-packages\chatterbot\input\input_adapter.py

I modify the file as you do, but it doesn't work.
I use the same test.py file and the console just show same error.

did you removed previous database.db file?

yes i do. I would like to try it on python3

@vkosuri this issue also happened when try to train bot with other unicode languages like persian .more details could found in this closed but unsolved bug

The same code ,chinese corpus ,it works in python3.

Maybe python2 has encode decode bug. Just bypass the problem
@vkosuri
@onlydarkknight

@sunchenguang
yeah.... python2 has encode decode issues when using chatterbot. Python3.+ is a good choice...

@sunchenguang i also observed this issue will pop up if we training with exisisting database.db, try to remove database.db and retrain your bot, It may work. Similar issue https://github.com/gunthercox/ChatterBot/issues/567

Was this page helpful?
0 / 5 - 0 ratings