Machine Learning, NLP, Programming, Python, spaCy

How To Extract Human Names Using Python spaCy

Introduction

How To Extract Human Names Using Python spaCy
How To Extract Human Names Using Python spaCy

We will go over How To Extract Human Names Using Python spaCy.

Did you know that spaCy can allow multiple languages?

I will break this review in the following sections:

  • What is spaCy and how it can help you find names in text
  • Go over example code step by step
  • Test it and see how it works with real examples

I have used this successfully in various projects and it works very well.

We will go point by point on getting you up and running in less than 5mins of work.

This is a complete guide and should cover all your questions on How To Extract Human Names Using Python spaCy.

All code associated with this can be found in the GitHub page here.

Why Is spaCy Useful For Named Entity Recognition

There’s several reasons why you may want to use spaCy for your named entity recognition. I will try to list them below and see if this matches your use-case and if it may be helpful for your project. Having said that if you came to this article I’m sure you have your own reasons or just curious. If I miss something in this list please send it to me in the comments below I would be curious to know.

  • It’s a complete framework that has all the code written for you to detect names in text using Python
  • It’s supported in Python which your code base may be already written in
  • It supports multiple languages
  • It has trained sample data so you may not have to do your own training
  • It’s a mature project and has a good track record
  • The success of named entity detection is very good especially if you do good tokenization of your text
  • It lets you override some named entities to different things, for example consider you have a name but it represents an organization/company instead. spaCy lets you do this effortlessly and update your code and models.

If you find yourself being in one of the above categories then please read on as I will try to simplify the process of installation and using it.

Setting Up Environment For spaCy

We will start by setting up your environment on how to run it with language packs. If you have already done this you can either skim through this section or simply skip over and go straight to the implementation part of things.

Creating A Virtual Environment For spaCy

The first thing we need to be doing is setting up a virtual environment so it doesn’t conflict with our main system python packages. This will keep things clean and will allow you to quickly deploy it in containers and other installations later on if you need too.

We will be creating a virtual environment using the virtualenv command but you can also use condo to create this. All code examples can be found in the Github repo here and will be using Python 3.

$ virtualenv venv
created virtual environment CPython3.9.10.final.0-64 in 238ms
creator CPython3Posix(dest=/Users/alex/code/unbiased/python-extract-human-names-spacy/venv, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/alex/Library/Application Support/virtualenv)
added seed packages: pip==22.0.4, setuptools==60.9.3, wheel==0.37.1
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
$ source venv/bin/activate

Once the virtual environment is created and activated we can proceed into the next step which is starting to add the necessary packages for spaCy to work.

Installing Python Packages For spaCy

To simplify the process I included a requirements file in the Github repo that allows you to simply use it for the installation process.

$ pip install -r requirements.txt
Collecting spacy
Downloading spacy-3.2.4.tar.gz (1.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 5.2 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
Preparing metadata (pyproject.toml) ... done
....
Created wheel for spacy: filename=spacy-3.2.4-cp39-cp39-macosx_12_0_arm64.whl size=6062668 sha256=c1000711155383e4a4a9878a3b1518607507fbb1665aa0cca6e2e5d68fb0e09c
Stored in directory: /Users/alex/Library/Caches/pip/wheels/ef/12/91/e0d58db0235263a450cc3a02e637f71125b80c02abfaa5b1a4
Successfully built spacy
Installing collected packages: wasabi, murmurhash, cymem, certifi, urllib3, typing-extensions, tqdm, spacy-loggers, spacy-legacy, smart-open, pyparsing, preshed, numpy, MarkupSafe, langcodes, idna, click, charset-normalizer, catalogue, typer, srsly, requests, pydantic, packaging, jinja2, blis, thinc, pathy, spacy
Successfully installed MarkupSafe-2.1.1 blis-0.7.7 catalogue-2.0.7 certifi-2021.10.8 charset-normalizer-2.0.12 click-8.0.4 cymem-2.0.6 idna-3.3 jinja2-3.1.1 langcodes-3.3.0 murmurhash-1.0.6 numpy-1.22.3 packaging-21.3 pathy-0.6.1 preshed-3.0.6 pydantic-1.8.2 pyparsing-3.0.7 requests-2.27.1 smart-open-5.2.1 spacy-3.2.4 spacy-legacy-3.0.9 spacy-loggers-1.0.2 srsly-2.4.2 thinc-8.0.15 tqdm-4.64.0 typer-0.4.1 typing-extensions-4.1.1 urllib3-1.26.9 wasabi-0.9.1

The main package defined above is spaCy and this will essentially pull all the dependencies along it and create a big Python dependency list. If you are interested in that you can use the following pip command to save the versioning of the packages and ensure compatibility with future versions.

$ pip freeze > requirements-deps.txt

Essentially this freeze lets you capture the versions of your dependencies too in case those get updated and do not stay backwards compatible with your code. I always make a backup of those every time I start a project even if I don’t end up using them as a good practise so I can always go back to what I need without having to port code to newer versions.

Installing Language Packs For spaCy

The next step in the process is to install language packs for spaCy. As mentioned earlier spaCy is good enough to offer pre-trained data in some languages so you can get started with named entity recognition right away. This helps you save time from doing the training yourself but it has a caveat that it may not be entirely based on your particular needs.

So if you have some exceptions you want to add you may need to extend the training models or basically make a copy and adjustment accordingly to your needs.

In this case we will be installing the English and Spanish trained data which has been gathered and trained from web and news sources online.

Installing English Language Training Model For spaCy

Starting with the English trained model is one of the core packages that spaCy uses and in this case it’s collected from web scraping. I find this one pretty good to use as base and later extend and train more things for it.

As we discussed earlier this may not work for you as you may need to adjust stuff but I still recommend using it for all the other scenarios and cases you may not want to do training for.

To install we simply invoke the download functionality of the spacy module in Python and specify the source name which in this case it’s the English trained data model en_core_web_sm.

$ python -m spacy download en_core_web_sm
Collecting en-core-web-sm==3.2.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.9/13.9 MB 10.8 MB/s eta 0:00:00
....
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')

This concludes the setup of the English language. To ensure it works use the spacy.load function in your Python code and see if it loads the English training model by name.

Installing Spanish Language Training Model For spaCy

Similar with the English pack we will be doing the Spanish pack. This one has been gathered from news mainly but it was gathered using again the method of scraping from online sources. Similar to English you can clone it and do any modifications you want to the trained model or even improve it.

$ python -m spacy download es_core_news_sm
Collecting es-core-news-sm==3.2.0
Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.2.0/es_core_news_sm-3.2.0-py3-none-any.whl (14.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 10.2 MB/s eta 0:00:00
.....
Successfully installed es-core-news-sm-3.2.0
✔ Download and installation successful
You can now load the package via spacy.load('es_core_news_sm')

Find Names In Text Using spaCy

Code To Extract Human Names In Python spaCy

Now that we have our environment setup with two languages we can proceed and start talking about writing the code on how to extract the names from text using Python spaCy. As a first example we will examine how to do this using English as a language and the training set we previously downloaded.

The sample code below will achieve this for us.

import spacy

english_nlp = spacy.load('en_core_web_sm')

text = '''
This is a sample text that contains the name Alex Smith who is one of the developers of this project.
You can also find the surname Jones here.
'''

spacy_parser = english_nlp(text)

for entity in spacy_parser.ents:
    print(f'Found: {entity.text} of type: {entity.label_}')

The things we need to note in the code above are:

  • We load the language set pack at the very top, this will be used as our NLP (Natural Language Processing) engine.
  • Then we have some sample text this can be coming from any source you want, just save it as a string.
  • Then we will initialize our English NLP spaCy engine with the text
  • We will run our processing of our entities to see what types it finds

Testing Code To Extract Human Names In Python spaCy

Now that we covered how the code works let’s go ahead and execute this and see what it will output for us in terms of identifying names in the long string we provided.

$ python ./find_human_names_text_english.py
Found: Alex Smith of type: PERSON
Found: Jones of type: PERSON

As it can be seen above the code found two instances:

  • Alex Smith which is of type PERSON in this case it just means it’s a name
  • Jones which is just a surname and again is of type PERSON

This concludes finding all the names in our string. Do note that sometimes spaCy may mislabel a person as an organization entity which you would need to either play with the tokenizing or improve the training model set that it has.

Since training is beyond the scope of this article and may be a topic of it’s own I am going to link to the official page which has some example code and may be helpful for you to train your own models. You can find more information here.

Extract Names Using Python In Spanish

Similar to extracting names in English you can do so in Spanish. In this section we will go over this process and test it out to see how the output would look.

One thing to keep in mind in this execution context is that the training model for Spanish that we will be using below is derived from news articles online and not generalized sources. While this may not impact you in most cases since it’s specific to a niche it may not be as accurate as a general web training model as the English one.

To further improve this you can refer to the same training model link I provided earlier.

Code To Find Spanish Names In Python spaCy

import spacy

spanish_nlp = spacy.load('es_core_news_sm')

text = '''
El nombre de esta persona es John Smith y tiene una amiga llamada Joanna Williams.
'''

spacy_parser = spanish_nlp(text)

for entity in spacy_parser.ents:
    print(f'Found: {entity.text} of type: {entity.label_}')

The code is pretty much identical to what we described earlier with the English variation with the only difference that it’s loading as a training model the Spanish data set and not the English one.

Testing Code To Find Spanish Names In Python spaCy

Again testing the code is very similar to the English variant the only difference we are seeing here is that the tagging of the person is named as PER rather than person. In essence however this is the same as referring to a person name.

$ python ./find_human_names_text_spanish.py
Found: John Smith of type: PER
Found: Joanna Williams of type: PER

As simple as that without having to write a boiler plate code or include any long list of names we have been able to demonstrate that finding names in both English and Spanish with the stock spaCy models it was easy.

Extract Names From File Using spaCy

Extracting names from a file using Python is very similar to what we did before. Since Python provides simple functionality to read a file we simply need to open the file and save the output to a string.

Once the output is saved to a string we can use the code we demonstrated earlier and reference that string instead of the sample that we used.

text = open('filename.txt', 'r').read()

In the example above we simply read the contents of a file called filename.txt and save it into the same variable called text that we can later pass on to our English or Spanish NLP loader.

Conclusion

We were able to successfully go over How To Extract Human Names Using Python spaCy, hopefully I answered any questions you may have had and helped you get started on your Python name finding project.

If you found this useful and you think it may have helped you please drop me a cheer below I would appreciate it.

If you have any questions, comments please post them below or send me a note on my twitter. I check periodically and try to answer them in the priority they come in. Also if you have any corrections please do let me know and I’ll update the article with new updates or mistakes I did.

Which is your favorite way of finding names in Python text?

I personally use a combinations of libraries such as spaCy and NLTK. In this guide I covered How To Extract Human Names Using Python spaCy using spaCy in another upcoming guide I will cover it using NLTK.

If you would like to visit the official spaCy documentation here.

If you would like to find more Python related articles please check the links below: