Raise your hand if you have tried doing some analytics with company names and it led to a visualization like this one
All 3 are the same company but because of improper formatting end up as different bars
Dirty data strikes again. The same company name can be written a bunch of different ways.
Not to mention the usual dirty data suspects
Let’s go step by step and use this list as an example
Pick one type of capitalization to follow. You can make everything UPPERCASE, lowercase or Title Case (only the first letter in each word is capitalized).
We prefer Title Case because it looks better and follows the conventions of proper nouns. But it’s really up to you.
In this step we are going to remove all non-alphanumeric characters such as commas (,) and dots (.)
We are also going to remove extra text such as (An Alphabet Co.) from Google Llc (An Alphabet Co.)
This whole step is a bit tricky and an inspection of your data is necessary to figure out which characters to remove. And what counts as extra text.
For example, Jeni’s Splendid Ice Creams is an ice cream company with an apostrophe (‘) in its name. You will not want to remove the apostrophe in this case.
Also depending on the company, things can be counted as extra text or not.
For example, both these company names have dashes ( — ) but the text after the dash in Johnson & Johnson is extra but not for Baskin — Robbins:
So you can’t reliably use the dash( — ) as the indicator for removing extra text in this case.
Generally, removing between parenthesis is fine in 99% of situations. But feel free to provide us counter-examples!
So after removing non-alphanumeric characters and extra text, our list now looks like this
This is the last step. We are now going to remove both the abbreviated and full legal entity suffixes.
You can also standardize the legal entity suffixes so each one appears consistently. For example instead of having a mix of Inc and Incorporated, all names only have Inc
This can be done in 2 ways:
NOTE: you have to list both the abbreviated and the full legal entity suffix, such as Inc vs. Incorporated
Once you have your list, go through all the company names and remove words that occur in your legal entity list. Or replace the words with one standard.
We have chosen to remove the suffixes. Which leads us to our final beautiful cleaned company list ready for Data Analysis
As you can see, this gets out of hand really fast. Especially if you have company names from around the globe.
And this is why we created www.cleanspreadsheets.com
You simply upload your spreadsheet, choose the column with your company names, and click clean.
Our algorithms then auto-magically do all the cleaning for you and you get this as the result. We break up the names and provide both full and abbreviated suffixes. We support all G20 countries
As always, feel free to let us know if you have any questions, comments or feedback at info@lovespreadsheets.com
Happy Data Cleaning!