Table of Contents

[Natural language Processing] Dealing with text data: Text Pre-processing

We generate large amounts of text data these days. Applying machine learning algorithms to extract meaningful insight from this data is the target for most organizations. In raw form, text data is not very useful and needs to be processed to extract insights.

Natural language processing (NLP) is an interdisciplinary domain, lying at the intersection between linguistics and computer science (specifically artificial intelligence). While NLP deals with two types of data, text and audio (speech), in this article we will focus on text data.

We generate exabytes of data everyday and text makes up a significant portion of this. With easier accessibility to powerful interactive web and high bandwidth connectivity, we have been generating more and more text data in recent years. E.g. social media, emails, text messages, forums and blogs, etc. Given that the content is user generated and (largely) unstructured, it requires pre-processing before it can be used for further operations.

Why do we use text-preprocessing?

Text data is sourced from diverse places like websites, social platforms, emails, literature and more. Each source differs in language, style, formatting, grammar and other attributes. Additionally, text data tends to contain noise – punctuation, numbers, emoticons and abbreviations are some examples. This variety and disorder can negatively impact how well NLP models operate and their precision.

Text preprocessing intends to minimize noise and inconsistencies in text data to make it more structured and uniform. This allows NLP models to zoom in on meaningful information in the text rather than get sidetracked. Consequently, their efficiency and accuracy see an uptick. In this article, we will discuss the fundamental steps to make the text data more usable with text pre-processing. We will also use python implementation along with concepts for better understanding.

1. HTML Tag removal

The web generates tons of text data and this text might have HTML tags in it. These HTML tags do not add any value to text data and only enable proper browser rendering.

We will use the unprocessed text that appears below to cover all the use cases we want to look at in this article (The text is inspired from one of the reviews from the amazon food review dataset. I have modified the text to cover all use cases).

Why is this $12 when the same product is available for $10 here?<br />http://www.domain.com/product/dp/B00004RBDY<br />I don’t understand reason behind two different prices.<br />The Victor M380 and M502 traps are unreal, /of course, — total fly genocide. Pretty stinky, but only right nearby.

Here, for removing HTML tags, we will use BeautifulSoup, a python library for pulling and processing data from HTML and XML files.

Python code:

Output:

Why is this $12 when the same product is available for $10 here? http://www.domain.com/product/dp/B00004RBDY  I don’t understand reason behind two different prices. The Victor M380 and M502 traps are unreal, /of course, — total fly genocide. Pretty stinky, but only right nearby.

2. URL removal

URLs (or Uniform Resource Locators) in a text are references to a location on the web, but do not provide any additional information. We thus, remove these too using the library named re, which provides regular expression matching operations.

We take our sample text and analyse each word, removing words or strings starting with http.

Python code:

Output:

Why is this $12 when the same product is available for $10 here?   I don’t understand reason behind two different prices. The Victor M380 and M502 traps are unreal, /of course, — total fly genocide. Pretty stinky, but only right nearby.

3. Expand contracted words

In our everyday verbal and written communication, a lot of us tend to contract common words like “you are” becomes “you’re”. Converting contractions into their natural form will bring more insights.

Python code:

Output:

Why is this $12 when the same product is available for $10 here?   I do not understand reason behind two different prices. The Victor M380 and M502 traps are unreal, /of course, — total fly genocide. Pretty stinky, but only right nearby.

4. Removing special characters

Special characters like  – (hyphen) or / (slash) don’t add any value, so we generally remove those. Characters are removed depending on the use case. If we are performing a task where the currency doesn’t play a role (for example in sentiment analysis), we remove the $ or any currency sign.

Here, we will also remove words that contain a number; like M380 in our example.

Python code:

Output:

Why is this when the same product is available for here I do not understand reason behind two different prices The Victor and traps are unreal of course total fly genocide Pretty stinky but only right nearby

In the above code, if we want to consider numbers as special characters, we just need to set remove_number as True.

5. Stopword removal

Apart from URLs, HTML tags and special characters, there are words that are not required for tasks such as sentiment analysis or text classification. Words like I, me, you, he and others increase the size of text data but don’t improve results dramatically and thus it is a good idea to remove those.

For the task, we can use pre-defined stopwords collection (e.g. from NLTK or any other NLP library), or we can define our own set of stopwords based on our task.

We will also convert all the text to lower case as text in python is case sensitive.

Python code:

Output:

product available not understand reason behind two different prices victor traps unreal course total fly genocide pretty stinky right nearby

RELATED BLOG

Machine Learning FAQs

6. Lemmatization

Now that we have removed all the “noise” from the text, it is time to normalize the data set. A word in a text may exist in multiple forms like stop and stopped (past participle or price and prices (plural). Text normalization converts variations of the word into root form of the same word.

For this task, we will use WordNetLemmatizer from nltk library.

Python code:

Output:

product available not understand reason behind two different price victor trap unreal course total fly genocide pretty stinky right nearby

And we are done with fundamental text pre-processing. You can conduct more such steps depending on your use case and data set like emoji removal, converting accented characters to  normal (é to e) and so forth depending on the characters present in your text.

Looking at the output you may think that it looks nonsensical, but once we  train a machine learning model, it will make sense. For instance, in the case of classification, the model would learn the words or combination of words to classify the text.

Great news, we now have meaningful text for our NLP task. But wait! Computers don’t understand text and only understand and process numbers. In the next article, we will learn to represent this text as numbers, the format that computers can accept as an input.

Once we figure out how to convert text into numbers, we can choose appropriate machine learning algorithms depending on our objective. But NLP is not just about applying algorithms, it requires the creation of a data transformation strategy and a strong data pipeline. We, at eInfochips, have end-to-end machine learning capabilities for NLP along with computer vision, deep learning, and anomaly detection. We help clients to bring their machine learning projects to life with 360-degree expertise from data collection to deployment.

Explore More

Talk to an Expert

Subscribe
to our Newsletter
Stay in the loop! Sign up for our newsletter & stay updated with the latest trends in technology and innovation.

Our Work

Innovate

Transform.

Scale

Partnerships

Device Partnerships
Digital Partnerships
Quality Partnerships
Silicon Partnerships

Company

Products & IPs