Find Names In Text Using JavaScript

Introduction

Find Names In Text Using JavaScript
Find Names In Text Using JavaScript

We will go over Find Names In Text Using JavaScript.

Did you know that you can use Javascript/NodeJS to find names in a text buffer?

We will break down this in the following sections:

  • Why Javascript is suitable for finding names in text
  • How to implement finding names using Javascript with an example
  • How to find names in a file using Javascript

I have used this successfully in various projects, and it works very well and has saved me a ton of trouble and time debugging things.

We will go point by point on getting you up and running in less than 5mins; having some background programming knowledge in Javascript is helpful if you want to fine-tune our code, but you don’t need to find names in a text buffer.

This complete guide should cover all your questions on using Javascript to read names from a text file or string.

All code and examples of how to do this can be found in the Github link.

Why Use Javascript To Find Names in Text Or File

In this section, I’d like to cover some reasons why Javascript is useful for finding names in text/strings or files. This may or may not apply to your use case, so feel free to skip this section if you know what you are doing and want to get into the coding part.

  • Javascript allows you to find names in a text string programmatically so you can automate the whole process
  • Javascript lets you set conditions such as looking for specific names in your code
  • You can add post operations in files such as rejecting approving files if they contain the names you are looking for
  • You can implement privacy compliance code to see if any text/buffer contains humanly identifiable information (GDPR/HIPAA/CCPA)
  • You can use this to looking for specific files that have the data you need and can be used for data classification reasons

The list above is by no means complete, but it explains why some people may want to use Javascript to find names in a text buffer programmatically.

How To Setup Javascript For Finding Names In Text

How To Install Node and Yarn

The first step we will be doing is to set up the NodeJS environment that we will use to run our application. If you don’t have NodeJS and Yarn installed in your system I recommend checking these links to get started with those in your system:

If you have those installed in your system you can check the versions that you have by running the following commands:

$ node -v
v18.7.0

$ yarn -v
1.22.19

In order to ensure compatibility with this guide I recommend you use at least the versions listed above or greater than them. This will allow you to work by copying directly the things from the Git repo and this article.

How To Install NodeJS Library Dependencies For Spacy

The first step we need to do is make a directory and initialize our yarn packages there. In order to do this we will be calling the yarn init command and passing -2 so it uses version 2 structure which is faster and can parallelize package installation.

$ yarn init -2
➤ YN0000: Retrieving https://repo.yarnpkg.com/3.2.2/packages/yarnpkg-cli/bin/yarn.js
➤ YN0000: Saving the new release in .yarn/releases/yarn-3.2.2.cjs
➤ YN0000: Done in 0s 330ms
{
  name: 'nodejs-spacy-extract-names-from-text',
  packageManager: '[email protected]'
}

This initializes our repo and creates a baseline from which now we can start installing packages that we will be using in this project. So the next step is to install two packages we need here:

  • Typescript
  • spaCy Library

This can be shown in the command below:

$ yarn add typescript spacy-nlp
➤ YN0000: ┌ Resolution step
....
➤ YN0000: └ Completed in 8s 638ms
➤ YN0000: ┌ Link step
➤ YN0000: └ Completed in 0s 270ms
➤ YN0000: Done with warnings in 11s 154ms

How To Install Development Dependencies For Typescript/NodeJS

As it can be seen above we successfully installed both package dependencies, now we can proceed into installing our development dependencies that we will use in our app. The most important one is our node types so we code completion works in Visual Code alongside we will also use ts-node compiler to produce Javascript files from Typescript files.

This can be seen in the command below:

$ yarn add -D @types/node ts-node
➤ YN0000: ┌ Resolution step
➤ YN0000: └ Completed in 2s 565ms
....
➤ YN0000: └ Completed in 0s 586ms
➤ YN0000: ┌ Link step
➤ YN0000: └ Completed
➤ YN0000: Done in 3s 179ms

How To Configure TSC For NodeJS

Finally we also need to do a final step and this is initialize the Node Typescript pseudo compiler system with some options that are good to have. Also we will be defining our source and binary directories in the command below.

$ yarn tsc --init --rootDir src --outDir ./bin --esModuleInterop --lib ES2022 --module commonjs --noImplicitAny true
Created a new tsconfig.json with:                                                                                                       TS
  target: es2016
  module: commonjs
  lib: es2022
  outDir: ./bin
  rootDir: src
  strict: true
  esModuleInterop: true
  skipLibCheck: true
  forceConsistentCasingInFileNames: true

If you notice above we also specified our default Typescript library version which in this case the latest one is ES2022, depending on which year you are seeing this on you can adjust accordingly. For now the output defaults to ES2016 so it’s compatible with various versions of Javascript.

Now that all of our dependencies and libraries are installed we can proceed into implementing some code.

How To Install spaCy Python Dependencies

The NodeJS spaCy library relies on Python in order to work properly more specifically it needs the following packages:

  • Python spaCy
  • Pre-trained spaCy model

For the model we will use the English model that was pre-trained with web data. For the Python dependencies I have created a requirements file that’s included in the Github repo for this article which you can find linked below.

Let’s go ahead and execute the commands to install our Python dependencies to get spaCy NodeJS to work.

$ virtualenv venv
created virtual environment CPython3.9.13.final.0-64 in 151ms
  creator CPython3Posix(dest=/Users/alex/code/unbiased/nodejs-spacy-extract-names-from-text/venv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/alex/Library/Application Support/virtualenv)
    added seed packages: pip==21.3.1, setuptools==58.3.0, wheel==0.37.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator

$ source venv/bin/activate

$ pip install -r requirements.txt
Collecting socketIO-client
  Using cached socketIO-client-0.7.2.tar.gz (23 kB)
  Preparing metadata (setup.py) ... done
Collecting spacy
  Downloading spacy-3.4.1-cp39-cp39-macosx_11_0_arm64.whl (6.4 MB)
     |████████████████████████████████| 6.4 MB 2.9 MB/s
Collecting requests>=2.7.0
  Using cached requests-2.28.1-py3-none-any.whl (62 kB)
....
Installing collected packages: typing-extensions, numpy, murmurhash, cymem, click, catalogue, wasabi, urllib3, typer, srsly, smart-open, pyparsing, pydantic, preshed, MarkupSafe, idna, charset-normalizer, certifi, blis, websocket-client, tqdm, thinc, spacy-loggers, spacy-legacy, six, requests, pathy, packaging, langcodes, jinja2, spacy, socketIO-client
Successfully installed MarkupSafe-2.1.1 blis-0.7.8 catalogue-2.0.8 certifi-2022.6.15 charset-normalizer-2.1.1 click-8.1.3 cymem-2.0.6 idna-3.3 jinja2-3.1.2 langcodes-3.3.0 murmurhash-1.0.8 numpy-1.23.2 packaging-21.3 pathy-0.6.2 preshed-3.0.7 pydantic-1.9.2 pyparsing-3.0.9 requests-2.28.1 six-1.16.0 smart-open-5.2.1 socketIO-client-0.7.2 spacy-3.4.1 spacy-legacy-3.0.10 spacy-loggers-1.0.3 srsly-2.4.4 thinc-8.1.0 tqdm-4.64.0 typer-0.4.2 typing-extensions-4.3.0 urllib3-1.26.12 wasabi-0.10.1 websocket-client-1.3.3

$ python3 -m spacy download en_core_web_md
Collecting en-core-web-md==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.0/en_core_web_md-3.4.0-py3-none-any.whl (42.8 MB)
     |████████████████████████████████| 42.8 MB 3.5 MB/s
....
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.4.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_md')

As shown above we performed three operations:

  • Activated and installed virtual Python environment so our packages don’t get installed globally in the system
  • Installed requirements file as discussed earlier
  • Installed pre-trained model that spaCy will use for our NLP code

How To Start NodeJS spaCy Server

Before proceeding into implementing our code we need to setup the NodeJS spaCy server. This server is responsible with communicating with Python behind the scenes and doing all of our NLP work. After that there’s a spaCy NodeJS library that we will use below to start doing parsing.

The first thing we need to do here is implement a bit of code that starts the server. This server is simply a socketio frontend that accepts connections parses the requests and executes them.

The code for this looks like this:

const NLP = require("spacy-nlp");
var spacySocketIO = NLP.server({ port: process.env.IOPORT });

What this does is basically two things:

  • It loads the spaCy node module that we will be using to start our server
  • It starts the server invoking the helper function called server which optionally takes as an option a port to bind too, in our case we read this from an environment variable which we will demonstrate below.

So the next step is to go ahead and execute our socket IO server so it can start.

$ IOPORT=6466 yarn ts-node nodejs-start-spacy.ts
[Wed Aug 30 2022 20:24:47 GMT-0500] INFO Starting poly-socketio server on port: 6466, expecting 1 IO clients
[Wed Aug 30 2022 20:24:47 GMT-0500] INFO Starting socketIO client for python3 at 6466
[Wed Aug 30 2022 20:24:48 GMT-0500] INFO cgkb-py oCePz1eLsZeJgNKMAAAA joined, 0 remains
[Wed Aug 30 2022 20:24:48 GMT-0500] INFO All 1 IO clients have joined
[Wed Aug 30 2022 20:24:53 GMT-0500] INFO global-client-js wWoKktQrBPHZJqF6AAAB joined, -1 remains
[Wed Aug 30 2022 20:24:53 GMT-0500] INFO All 1 IO clients have joined
[Wed Aug 30 2022 20:24:55 GMT-0500] INFO wWoKktQrBPHZJqF6AAAB left
[Wed Aug 30 2022 20:25:00 GMT-0500] INFO Exit: killed ioClient.js children

If you notice above I pass on the IOPORT as 6466 to the nodejs spacy server which is the default for the client to connect too. In the output above also it can be seen that a client connected and successfully performed a request. This client will be using the code which we will go over below on how to do named entity recognition for names.

How To Find Names In Text Using Javascript

The first step is to process a buffer and implement some code that will find names in a text/string buffer using Javascript. We will be leveraging the spaCy library that we previously installed. This will give us the following two abilities:

  • Connect to the socketIO server we previously created
  • Send in the named entity request
  • Parse the results and process to only show the names

The code that implements this is the following:

let textToSearch = 'Alex walked down the street to find his cat Bob, which is friends with his dog Victor';

(async () => {
    const spacyNLP = require("spacy-nlp");
    const nlp = spacyNLP.nlp;
    const result = await nlp.parse(textToSearch);
    console.log('Looking for names in buffer: ', textToSearch);
    result[0]['parse_list'].forEach((element: any) => {
        if (element.NE == 'PERSON')
            console.log('Found name: ', element.word);
    });
})();

This code works as follows:

  • We first define a sample text which we want to look up names for, in the example above our sentence contains 3 names:
    • Alex
    • Bob
    • Victor
  • We load the spacy NLP library into memory
  • We acquire a spacy object that we will be using to perform the parsing
  • Then we invoke the important function which is parse this basically is responsible for some functions such as:
    • Tokenizing
    • Word splitting
    • Classifying words into entities
  • Finally we take the result and look for named entities that are type Person which is basically a name, once we find one we print it out and continue until we exhaust our results

If we execute the code we will see the following output:

$ yarn ts-node nodejs-find-names-in-text.ts
Looking for names in buffer:  Alex walked down the street to find his cat Bob, which is friends with his dog Victor
Found name:  Alex
Found name:  Bob
Found name:  Victor

So executing the code above it’s obvious by now that it has identified accurately all the names that we provided in the sample text. This successfully demonstrates how to find names using Javascript in a string/text buffer.

How To Extract Names From A File In Javascript

This section will demonstrate how to extract names from a file using Javascript and NodeJS. To do this again, we will be leveraging again spaCy helper code that we implemented along with some extra code to open and read a file.

The code for this is shown below:

let textToSearch = fs.readFileSync('sample.txt', 'utf-8');

The code above works in a very similar like the code above where we identified the names containing text but the difference here is that we also process the file to convert that in a string buffer. The idea here is that you will probably need to do some tokenization if the file contents are very big. The easiest way to tokenize this is to use full stops (period) letter.

If you have some more advanced context you would want to tokenize and split the file reading that will be helpful to avoid reading the entire file in memory and occupying a lot of resources in your system in terms of memory.

The code above assumes you have a file called sample.txt in your running directory that contains a string that has names in it.

Conclusion

We were able to go over this Find Names In Text Using JavaScript in an easy to explain manner with examples. Hopefully, I answered any questions you may have and helped you get started on your quest to discovering clear text personal identifiable information in JavaScript.

Please drop me a cheer below if you found this helpful and think it may have helped you. I would appreciate it.

If you have any questions or comments, please post them below or send me a note on my Twitter. I check periodically and try to answer them in the priority they come in. Also, if you have any corrections, please let me know, and I’ll update the article with new updates or mistakes I made.

Would you consider using NodeJS to find names in a file?

I use this extensively for many projects when I want to do some compliance checking with the latest standards. Checking if there’s any personally identifiable information flying around may cause problems for everyone.

If you would like to find more articles related to Javascript and finding names in text, you can check the list below:

You can find information about relevant projects referenced in this article in the list below:

Leave a Comment

Your email address will not be published. Required fields are marked *