Natural language processing (NLP) is an interdisciplinary domain, lying at the intersection between linguistics and computer science (specifically artificial intelligence). While NLP deals with two types of data, text and audio (speech), in this article we will focus on text data.
We generate exabytes of data everyday and text makes up a significant portion of this. With easier accessibility to powerful interactive web and high bandwidth connectivity, we have been generating more and more text data in recent years. E.g. social media, emails, text messages, forums and blogs, etc.
Given that the content is user generated and (largely) unstructured, it requires pre-processing before it can be used for further operations. In this article, we will discuss the fundamental steps to make the text data more usable with text pre-processing. We will also use python implementation along with concepts for better understanding.
1. HTML Tag removal
The web generates tons of text data and this text might have HTML tags in it. These HTML tags do not add any value to text data and only enable proper browser rendering.
We will use the unprocessed text that appears below to cover all the use cases we want to look at in this article (The text is inspired from one of the reviews from the amazon food review dataset. I have modified the text to cover all use cases).
Why is this $12 when the same product is available for $10 here?<br />http://www.domain.com/product/dp/B00004RBDY<br />I don’t understand reason behind two different prices.<br />The Victor M380 and M502 traps are unreal, /of course, — total fly genocide. Pretty stinky, but only right nearby.
Here, for removing HTML tags, we will use BeautifulSoup, a python library for pulling and processing data from HTML and XML files.
Why is this $12 when the same product is available for $10 here? http://www.domain.com/product/dp/B00004RBDY
I don’t understand reason behind two different prices. The Victor M380 and M502 traps are unreal, /of course, — total fly genocide. Pretty stinky, but only right nearby.
2. URL removal
URLs (or Uniform Resource Locators) in a text are references to a location on the web, but do not provide any additional information. We thus, remove these too using the library named re, which provides regular expression matching operations.
We take our sample text and analyse each word, removing words or strings starting with http.
Why is this $12 when the same product is available for $10 here? I don’t understand reason behind two different prices. The Victor M380 and M502 traps are unreal, /of course, — total fly genocide. Pretty stinky, but only right nearby.
3. Expand contracted words
In our everyday verbal and written communication, a lot of us tend to contract common words like “you are” becomes “you’re”. Converting contractions into their natural form will bring more insights.
Why is this $12 when the same product is available for $10 here? I do not understand reason behind two different prices. The Victor M380 and M502 traps are unreal, /of course, — total fly genocide. Pretty stinky, but only right nearby.
4. Removing special characters
Special characters like – (hyphen) or / (slash) don’t add any value, so we generally remove those. Characters are removed depending on the use case. If we are performing a task where the currency doesn’t play a role (for example in sentiment analysis), we remove the $ or any currency sign.
Here, we will also remove words that contain a number; like M380 in our example.
Why is this when the same product is available for here I do not understand reason behind two different prices The Victor and traps are unreal of course total fly genocide Pretty stinky but only right nearby
In the above code, if we want to consider numbers as special characters, we just need to set remove_number as True.
5. Stopword removal
Apart from URLs, HTML tags and special characters, there are words that are not required for tasks such as sentiment analysis or text classification. Words like I, me, you, he and others increase the size of text data but don’t improve results dramatically and thus it is a good idea to remove those.
For the task, we can use pre-defined stopwords collection (e.g. from NLTK or any other NLP library), or we can define our own set of stopwords based on our task.
We will also convert all the text to lower case as text in python is case sensitive.
product available not understand reason behind two different prices victor traps unreal course total fly genocide pretty stinky right nearby
Now that we have removed all the “noise” from the text, it is time to normalize the data set. A word in a text may exist in multiple forms like stop and stopped (past participle or price and prices (plural). Text normalization converts variations of the word into root form of the same word.
For this task, we will use WordNetLemmatizer from nltk library.
product available not understand reason behind two different price victor trap unreal course total fly genocide pretty stinky right nearby
And we are done with fundamental text pre-processing. You can conduct more such steps depending on your use case and data set like emoji removal, converting accented characters to normal (é to e) and so forth depending on the characters present in your text.
Looking at the output you may think that it looks nonsensical, but once we train a machine learning model, it will make sense. For instance, in the case of classification, the model would learn the words or combination of words to classify the text.
Great news, we now have meaningful text for our NLP task. But wait! Computers don’t understand text and only understand and process numbers. In the next article, we will learn to represent this text as numbers, the format that computers can accept as an input.
Once we figure out how to convert text into numbers, we can choose appropriate machine learning algorithms depending on our objective. But NLP is not just about applying algorithms, it requires the creation of a data transformation strategy and a strong data pipeline. We, at eInfochips, have end-to-end machine learning capabilities for NLP along with computer vision, deep learning, and anomaly detection. We help clients to bring their machine learning projects to life with 360-degree expertise from data collection to deployment.