Programming Across Disciplines

Table of Contents » Chapter 3 : Processing : Case Studies : Named Entity Recognition (NER)

Named Entity Recognition (NER)

Subscribe Contact

Overview
The spaCy Library
spaCy Examples

Overview

Entity extraction, also known as Named Entity Recognition (NER), is a set of techniques we can use in Python to locate and classify named entities mentioned in unstructured text (phrases, sentences, paragraphs, articles, documents, etc.) into pre-defined categories. These categories can include the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Entity extraction is an important part of processing that helps in understanding text and extracting relevant information.

Concept: Natural Language Processing (NLP)

Full Concepts List: Alphabetical ↗ or By Chapter ↗

Natural Language Processing (NLP) is a fascinating field at the intersection of computer science, artificial intelligence, and linguistics, aimed at enabling computers to understand, interpret, and generate human language in a meaningful way. In Python, NLP involves using libraries and tools such as spaCy ↗ and others to process and analyze large amounts of text data. This can include tasks like sentiment analysis, language translation, named entity recognition, and chatbot development. For beginners in Python, exploring NLP means learning how to use these libraries to extract insights and patterns from text, automate tasks that involve natural language data, and build applications that can interact with users in more natural and intuitive ways. Through NLP, Python programmers can bridge the gap between human communication and digital data processing, unlocking a wide array of possibilities in data analysis, web development, and artificial intelligence applications.

Concept: Named Entity Extraction (NER)

Full Concepts List: Alphabetical ↗ or By Chapter ↗

Named Entity Recognition (NER) is a key component of Natural Language Processing (NLP) that involves identifying and classifying key information (entities) in text into predefined categories such as the names of people, organizations, locations, dates, and other specific data. For beginners in Python, learning NER means exploring how to automatically scan entire articles or documents and highlight important information, simplifying data extraction for analysis or automating data entry processes. Python libraries like spaCy ↗ and others provide easy-to-use tools for implementing NER, allowing you to quickly start experimenting with text analysis. Through NER, you can build applications that intelligently process and understand large volumes of text, making it a valuable skill for projects ranging from automated content tagging to enhancing search algorithms and creating more engaging user experiences with personalized content recommendations based on extracted entities.

Concept: Named Entities

Full Concepts List: Alphabetical ↗ or By Chapter ↗

Named entities are specific pieces of information that are recognized and categorized within a text based on predefined categories such as names of people, places, organizations, dates, and monetary values, among others. In the realm of Natural Language Processing (NLP) with Python, extracting these named entities from text involves using libraries such as spaCy ↗, which can identify and classify these pieces of information automatically. This process, known as Named Entity Recognition (NER), is a fundamental step in understanding and extracting meaning from natural language data, enabling applications like content classification, information retrieval, and data analysis to be more efficient and insightful. For beginners diving into Python-based NLP, mastering named entity extraction is a crucial skill that opens up numerous possibilities for analyzing and interpreting vast amounts of textual data.

The spaCy Library

spaCy ↗ is one of the most popular Python libraries for natural language processing (NLP). It is designed for production use and offers fast performance for NLP tasks. It is well-suited for large-scale information extraction tasks. spaCy provides pre-trained models for multiple languages and supports tasks like tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and text classification. It emphasizes efficiency and accuracy. spaCy's API is streamlined and intuitive, making it accessible for users who are new to NLP while still powerful for advanced users.

Google CoLab

If you are using Google CoLab, the spaCy library is already installed and available for use in any Notebook, so you can go straight to the code examples below.

IDEs like Visual Studio Code, PyCharm, or Others

If you are using an IDE like Visual Studio Code, PyCharm, or others, you'll need to install the spaCy library before you can use it. The common approach to install a library is to use the pip package manager in the terminal. Open a terminal and enter the following two commands:

# First use pip to install the library
pip install spacy
# Then install a language model. You can choose either the small model or large model, like this:
# Use the following if you want the small language model ...
python -m spacy download en_core_web_sm
# ... or the following if you want the large language model ...
python -m spacy download en_core_web_lg

Official Documentation

For detailed documentation on spaCy, see the spaCy usage ↗ page.

Once you have spaCy and a language model installed, you can proceed using spaCy in your code. See the following section for some examples.

spaCy Examples

File Download

Code

import spacy
from spacy import displacy
import html

def perform_ner_and_visualize(file_path):
    nlp = spacy.load("en_core_web_sm")
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    doc = nlp(text)
    displacy_image = displacy.render(doc, style='ent', page=True, minify=True)

def visualize_sentence_dependencies(file_path):
    nlp = spacy.load("en_core_web_sm")
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    doc = nlp(text)
    first_sentence = next(doc.sents)
    displacy.render(first_sentence, style='dep', jupyter=True, options={'distance': 100})

def visualize_all_sentence_dependencies(file_path):
    nlp = spacy.load("en_core_web_sm")
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    doc = nlp(text)
    for sentence in doc.sents:
        displacy.render(sentence, style='dep', jupyter=True, options={'distance': 100})
        print("\n" + "-"*80 + "\n")


if __name__ == "__main__":
  file_path = "twocities.txt"
  perform_ner_and_visualize(file_path)
  # visualize_sentence_dependencies(file_path)
  visualize_all_sentence_dependencies(file_path)

Output

*Figure 1: Result of Named Entity Recognition (NER) on the twocities.txt file*

*Figure 2: Result of Visualizing Sentence Dependency on the Title of the Book*

*Figure 3: Result of Visualizing Sentence Dependency on the First Sentence*

*Figure 4: Result of Visualizing Sentence Dependency on the Second Sentence*

*Figure 5: Result of Visualizing Sentence Dependency on the Third Sentence*

Code Details

« Previous : Processing : Case Studies : Web Scraping

Next : Processing : Case Studies : Automation »

« Previous : Processing : Case Studies : Web Scraping
Next : Processing : Case Studies : Automation »