Site Overlay

Clean Text for Natural Language Process in Five Steps

Clean and prepare the data is critical to effectively solve any problem in data science. So when I first approached my first Natural Language Processing project I wanted to make sure I was able to effectively prepare my data for the modeling that was going to come next.

It soon became clear to me that running a series of individual cells in my Jupyter Notebook was not going to be a sustainable solution, so I stepped back and looked into creating a function that would improve my data cleaning process for Natual Language Data.

In this blog, I walk you through the process, so that you can create your own python function, that using Regular Expressions (RegEx) fulfills your data cleanliness needs.

Step 1 – Uniform Text

The first step is to turn all text in lower caps, removing any capital letter.
This is particularly helpful to go through the entire function treating all letters with the same format.

    # Make lowercase
    txt = txt.lower()

Note: consider that if you want to apply a sentiment analysis algorithm, you should first check if it is case-sensitive (e.g. Flair) before applying this.

Step 2 – Handle Format Characters

Now, to hop on the use of Regular expression, make sure you import re in your code.

Then starting with replacing special HTML characters. These, in fact, might just add noise to your text and not give any insights.

To do so we start with calling our re.sub()function, that has the following structure:

re.sub(find, replace, text)

So first we insert the argument we want to find, then the empty string we want to replace it with, and ultimately, the string this action will be performed on.

This same structure will be repeated throughout the entire function for each of our next steps.

To identify the HTML special characters then we have to first call out that we will be looking for a word, by using the expression r'' and then we want to identify those that start with & . To do so we write it as \& telling this way to the program to interpret the special character as itself (aka escape a special character).

The next element in our string would be to tell the program to search for any other character following the & and include it into the selection if the word also ends with ;. To achieve this we use \w (in this context indicating all following characters) and we close with ;.

    # Handle formatting characters (e.g. &amp)
    txt = re.sub(r'\&\w*;', '', txt)

Having done this we can move on removing other noise from our text!

Step 3 – Delete Whitespaces

Next, we will delete whitespaces. These could be generated by missing values, by formatting, or just be mistakes in our text.

As anticipated in step 2, we use again re.sub and use \s, expression that identifies all white spaces , and add it an other \s+ to select those multiple consequential whitespaces. We then replace it with a single empty space, and apply it to the text.

    # Delete whitespaces
    txt = re.sub(r'\s\s+', ' ', txt)

Even though many tokenizers would handle this automatically, including this step in our data cleaning would make our data cleaner for potential further and different use.

Step 4 – Exclude Hyperlink

Also, many times, analyzing data coming from the web, posts and texts might include hyperlinks. Those are isolates using the same methods applied before.

After https we add then a ? that will look for either 0 or 1 repetition of the previous string and the to add :\\ we have to anticipate each slash with a /. we will then add a dot and each and any character following, till the end of the word.

The above steps will result in:

    # Exclude hyperlinks
    txt = re.sub(r'https?:\/\/.*\/\w*', '', txt)

Suggestion: depending on the problem you are looking into you might want to use such regular expression to identify and isolate the extension of websites linked, or domain names, and store them in a separate function.

Step 5 – Handle punctuation and ‘s, ‘t, ‘ve

Contractions are common in colloquial language, but for the NLP algorithms, they might create unwanted noise. Also any punctuation could constitute noise, and removing it could give a better dataset to the models we plan to test later.

For this reason we use the string.punctuation method, followed by replace and, once identified, we replace them with a whitespace.

# Handle punctuation and 's, 't, 've 
txt = re.sub(r'[' + string.punctuation.replace('@', '') + ']+', ' ', txt)

Notice that to use string.punctuation method you would need to make sure you have imported string into your notebook. You can to so adding this to your imports:

import string

Step 6 – Remove characters outside Unicode

Ultimately, it can happen while scraping the web or using APIs, that the text is still contaminated with non-Unicode characters. To get rid of them in our corpus, we proceed letter by letter in our corpus, checking if they are part of Unicode (indicated in RegEx as uFFF; if they are, they will be added back to the corpus through the join function.

# Remove characters outside Unicode:
txt = ''.join(l for l in txt if l <= '\uFFFF')

All together

All the above mentioned steps can now be included into a unique function, as follows:

def cleaner(txt):
    """cleaner accepts a string"""
    # Make lowercase
    txt = txt.lower()

    # Handle formatting characters (e.g. &amp)
    txt = re.sub(r'\&\w*;', '', txt)
    
    # Delete whitespaces
    txt = re.sub(r'\s\s+', ' ', txt)
    
    # Exclude hyperlinks
    txt = re.sub(r'https?:\/\/.*\/\w*', '', txt)
    
    # Handle punctuation and 's, 't, 've 
    txt = re.sub(r'[' + string.punctuation.replace('@', '') + ']+', ' ', txt)
    
    # Remove characters outside Unicode:
    txt = ''.join(l for l in txt if l <= '\uFFFF')
    
    return txt

To apply your data cleaning function to your data you’ll only need to use the .apply method to the column in which you are storing the text.

df['text_column'] = df['text_column'].apply(cleaner)

Concluding, small functions can greatly benefit your process of data cleaning, make sure your data is in the best possible shape to extract the most value from it.

Leave a Reply

Scroll Up