spacy-lookup: Named Entity Recognition based on dictionaries
spaCy v2.0 extension and pipeline component for adding Named Entities metadata to
Doc objects. Detects Named Entities using dictionaries. The extension sets the custom
Named Entities are matched using the python module
flashtext, and looks up in the data provided by different dictionaries.
spacy v2.0.16 or higher.
pip install spacy-lookup
First, you need to download a language model.
python -m spacy download en
Import the component and initialise it with the shared
nlp object (i.e. an instance of
Language), which is used to initialise
flashtext with the shared vocab, and create the match patterns. Then add the component anywhere in your pipeline.
import spacy from spacy_lookup import Entity nlp = spacy.load('en') entity = Entity(keywords_list=['python', 'product manager', 'java platform']) nlp.add_pipe(entity, last=True) doc = nlp(u"I am a product manager for a java and python.") assert doc._.has_entities == True assert doc._.is_entity == False assert doc._.entity_desc == 'product manager' assert doc._.is_entity == True print([(token.text, token._.canonical) for token in doc if token._.is_entity])
spacy-lookup only cares about the token text, so you can use it on a blank
Language instance (it should work for all available languages!), or in a pipeline with a loaded model. If you're loading a model and your pipeline includes a tagger, parser and entity recognizer, make sure to add the entity component as
last=True, so the spans are merged at the end of the pipeline.
The extension sets attributes on the
Token. You can change the attribute names on initialisation of the extension. For more details on custom components and attributes, see the processing pipelines documentation.
On initialisation of
Entity, you can define the following settings:
entity = Entity(nlp, keywords_list=['python', 'java platform'], label='ACME') nlp.add_pipe(entity) doc = nlp(u"I am a product manager for a java platform and python.") assert doc._.is_entity