Spacy: How to train the NER to recognize addresses

Created on 28 Nov 2017  Â·  3Comments  Â·  Source: explosion/spaCy

I have a DB of 50 million places , street names, states, towns, countries.
What is the best way to train spaCy NER to recognize addresses with this DB?

training usage

Most helpful comment

You also may want to take a look here: https://github.com/openeventdata/mordecai

All 3 comments

This depends on what exactly you want spaCy to recognise and how your data looks – does you database contain only the places, or also texts with those entities in context? If you want spaCy to recognise those addresses in context, you also need training examples of those entities in context. The examples in context should also be similar to the data you later want to use the model on.

Ultimately, you'll have to experiment with different approaches and see what works best for your data. Here are some ideas if you only have the places and addresses, but no context:

  • Create sentence templates that are similar to the data you're looking to analyse, and randomly fill them in with entries from your database. For example, if your application needs to recognise addresses in email conversations, a template could look like: "Our office is located at [STREET] in [CITY]". If you're building a conversational application, you might want to use templates like "Find me directions to [ADDRESS]" or "Book me a flight to [CITY] via [CITY]." Based on these templates, you can then create training data for the entity recognizer.

  • Use the PhraseMatcher and create match patterns using the countries, cities etc. and run it over a large corpus of sentences. Then you can use the sentences containing matches to create training data that's closer to real-world examples.

Given the size of your database, you'll likely end up with a very large training corpus as well. So you should also look into some tips and strategies for batching up your training examples and experimenting with different hyperparameters.

You'll also need to think about which labels you want the entity recognizer to learn – do you simply want to improve one of the built-in entity types like GPE and LOCATION, or do you want to add your own labels like STREET, STATE or CITY?

Here are some of the relevant sections in the documentation:

You also may want to take a look here: https://github.com/openeventdata/mordecai

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings