Clean and prepare the data is critical to effectively solve any problem in data science. So when I first approached my first Natural Language Processing project I wanted to make sure I was able to effectively prepare my data for the modeling that was going to come next.
It soon became clear to me that running a series of individual cells in my Jupyter Notebook was not going to be a sustainable solution, so I stepped back and looked into creating a function that would improve my data cleaning process for Natual Language Data.
In this blog, I walk you through the process, so that you can create your own python function, that using Regular Expressions (RegEx) fulfills your data cleanliness needs.
Step 1 – Uniform Text
The first step is to turn all text in lower caps, removing any capital letter.
This is particularly helpful to go through the entire function treating all letters with the same format.
# Make lowercase txt = txt.lower()
Note: consider that if you want to apply a sentiment analysis algorithm, you should first check if it is case-sensitive (e.g. Flair) before applying this.
Step 2 – Handle Format Characters
Now, to hop on the use of Regular expression, make sure you import
re in your code.
Then starting with replacing special HTML characters. These, in fact, might just add noise to your text and not give any insights.
To do so we start with calling our
re.sub()function, that has the following structure:
re.sub(find, replace, text)
So first we insert the argument we want to find, then the empty string we want to replace it with, and ultimately, the string this action will be performed on.
This same structure will be repeated throughout the entire function for each of our next steps.
To identify the HTML special characters then we have to first call out that we will be looking for a word, by using the expression
r'' and then we want to identify those that start with
& . To do so we write it as
\& telling this way to the program to interpret the special character as itself (aka escape a special character).
The next element in our string would be to tell the program to search for any other character following the
& and include it into the selection if the word also ends with
;. To achieve this we use
\w (in this context indicating all following characters) and we close with
# Handle formatting characters (e.g. &) txt = re.sub(r'\&\w*;', '', txt)
Having done this we can move on removing other noise from our text!
Step 3 – Delete Whitespaces
Next, we will delete whitespaces. These could be generated by missing values, by formatting, or just be mistakes in our text.
As anticipated in step 2, we use again
re.sub and use
\s, expression that identifies all white spaces , and add it an other
\s+ to select those multiple consequential whitespaces. We then replace it with a single empty space, and apply it to the text.
# Delete whitespaces txt = re.sub(r'\s\s+', ' ', txt)
Even though many tokenizers would handle this automatically, including this step in our data cleaning would make our data cleaner for potential further and different use.
Step 4 – Exclude Hyperlink
Also, many times, analyzing data coming from the web, posts and texts might include hyperlinks. Those are isolates using the same methods applied before.
https we add then a
? that will look for either 0 or 1 repetition of the previous string and the to add
:\\ we have to anticipate each slash with a
/. we will then add a dot and each and any character following, till the end of the word.
The above steps will result in:
# Exclude hyperlinks txt = re.sub(r'https?:\/\/.*\/\w*', '', txt)
Suggestion: depending on the problem you are looking into you might want to use such regular expression to identify and isolate the extension of websites linked, or domain names, and store them in a separate function.
Step 5 – Handle punctuation and ‘s, ‘t, ‘ve
Contractions are common in colloquial language, but for the NLP algorithms, they might create unwanted noise. Also any punctuation could constitute noise, and removing it could give a better dataset to the models we plan to test later.
For this reason we use the string.punctuation method, followed by replace and, once identified, we replace them with a whitespace.
# Handle punctuation and 's, 't, 've txt = re.sub(r'[' + string.punctuation.replace('@', '') + ']+', ' ', txt)
Notice that to use string.punctuation method you would need to make sure you have imported
string into your notebook. You can to so adding this to your imports:
Step 6 – Remove characters outside Unicode
Ultimately, it can happen while scraping the web or using APIs, that the text is still contaminated with non-Unicode characters. To get rid of them in our corpus, we proceed letter by letter in our corpus, checking if they are part of Unicode (indicated in RegEx as
uFFF; if they are, they will be added back to the corpus through the
# Remove characters outside Unicode: txt = ''.join(l for l in txt if l <= '\uFFFF')
All the above mentioned steps can now be included into a unique function, as follows:
def cleaner(txt): """cleaner accepts a string""" # Make lowercase txt = txt.lower() # Handle formatting characters (e.g. &) txt = re.sub(r'\&\w*;', '', txt) # Delete whitespaces txt = re.sub(r'\s\s+', ' ', txt) # Exclude hyperlinks txt = re.sub(r'https?:\/\/.*\/\w*', '', txt) # Handle punctuation and 's, 't, 've txt = re.sub(r'[' + string.punctuation.replace('@', '') + ']+', ' ', txt) # Remove characters outside Unicode: txt = ''.join(l for l in txt if l <= '\uFFFF') return txt
To apply your data cleaning function to your data you’ll only need to use the
.apply method to the column in which you are storing the text.
df['text_column'] = df['text_column'].apply(cleaner)
Concluding, small functions can greatly benefit your process of data cleaning, make sure your data is in the best possible shape to extract the most value from it.