Transformers: CSV/JSON file format for examples/token-classification/run_ner.py

Created on 21 Nov 2020 · 7Comments · Source: huggingface/transformers

Environment info

transformers version: 3.5.0
Platform: Linux-3.10.0-1160.6.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core
Python version: 3.6.8
PyTorch version (GPU?): 1.6.0 (False)
Tensorflow version (GPU?): 2.3.1 (False)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help

@mfuntowicz, @stefan-it

Information

Model I am using (Bert, XLNet ...): XLM-R

The problem arises when using:

[x] the official example scripts: (give details below)

The tasks I am working on is:

[x] my own task or dataset: (give details below)

https://github.com/huggingface/transformers/tree/master/examples/token-classification

python run_ner.py \
  --model_name_or_path bert-base-uncased \
  --train_file path_to_train_file \
  --validation_file path_to_validation_file \
  --output_dir /tmp/test-ner \
  --do_train \
  --do_eval

I am trying to perform ner on custom dataset. It's not clear what's the format of path_to_train_file and path_to_validation_file. From the code, it seems that the file format should be csv or json. Can you please give more details on this so that I can format my dataset accordingly?

Thanks.

Source

ganeshjawahar

👍1

Most helpful comment

If you can provide a tiny example for csv or json format, that should be very helpful. 🤗

ganeshjawahar on 21 Nov 2020

👍2

All 7 comments

Hi @ganeshjawahar , please have a look at the run_NER_old.py script! It should handle custom files 🤗

stefan-it on 21 Nov 2020

👍1

Usage and more examples are documented here:

https://github.com/huggingface/transformers/tree/master/examples/token-classification#old-version-of-the-script

stefan-it on 21 Nov 2020

👍1

Thanks for the quick response. I'm able to make use of run_ner_old.py with my custom dataset. Is there a similar documentation to use run_ner.py with custom dataset?

P.S.: run_ner_old.py loads all examples into RAM and that's a problem for me as my custom dataset is very large. I was thinking of getting around this issue by using run_ner.py which uses datasets library.

ganeshjawahar on 21 Nov 2020

👀1

If you can provide a tiny example for csv or json format, that should be very helpful. 🤗

ganeshjawahar on 21 Nov 2020

👍2

Ah, I see, an example for a json-based file format can be found here:

https://github.com/huggingface/transformers/blob/master/tests/fixtures/tests_samples/conll/sample.json

Another possibility would be, that you write a custom recipe with Hugging Face datasets library. Then you can run the run_NER.py script by passing the (local) path name of your recipe to the script. Just have a look at the CoNNL dataset/recipe:

https://github.com/huggingface/datasets/blob/master/datasets/conll2003/conll2003.py

You could usw it as a template and modify it for your needs 🤗

stefan-it on 21 Nov 2020

👍1

I think the JSON sample should be in the token-classification README for people trying to use run_ner.py from local files. Would you also be willing to provide a CSV sample? So far, I have found through trial, error, and code deciphering that:

The CSV needs to start with a column names (not respecting this causes ValueError: External features info don't match the dataset)
The column separator should be a comma (,)
Text containing commas should be in double quotes (like this ",") to disambiguate columns
Literal double quotes should be escaped with \

Right now, my CSV file looks like this:

token,label
DC,M                                                                         
##T,M                                                                        
##N,M                                                                        
##4,M                                                                        
as,O                      
a,O                                    
m,O                                                                          
##od,O                                                                       
##ifier,O
...

I get the following error:

File "projects/github/transformers/examples/token-classification/run_ner.py", line 221, in main
    if isinstance(features[label_column_name].feature, ClassLabel):
AttributeError: 'Value' object has no attribute 'feature'

Using the python debugger, I've found that features[label_column_name] = Value(dtype='string', id=None) but I don't know if this is expected behavior. I can only assume that it isn't, but I can't seem to figure out what else features[label_column_name] could or should be.

I'm pretty much stuck, and knowing if the issue comes from the structure of my CSV would be very helpful.

Furthermore, I've tried formatting my data as close as I could to the conll sample, but I get the following error:

json.decoder.JSONDecodeError: Extra data: line 2 column 1

After a little bit of googling, as I suspected it turns out one cannot have multiple JSON objects in one file. So if the intended JSON format for run_ner.py requires one JSON object per sequence but JSON files can't contain more than one JSON object, how can we get run_ner.py to work with several sequences in JSON mode?

gpiat on 3 Dec 2020

👍1

Exact same process/issue/errors as @gpiat. Would be very helpful if the format for the csv option for run_ner.py was explicitly defined in the readme. If there was a sample input for the csv option that is fully functional with the script it would be much more simple to modify our custom data to match the sample as opposed to writing a custom recipe.