Share


Clean Company Names from Your Spreadsheets

by Adhaar Sharma on 2021-04-24

Raise your hand if you have tried doing some analytics with company names and it led to a visualization like this one

clean_company_viz

All 3 are the same company but because of improper formatting end up as different bars


WHY DID THIS HAPPEN?!


Dirty data strikes again. The same company name can be written a bunch of different ways.

  1. Just the company name—Apple
  2. The company name with abbreviated legal entity suffix—Apple Inc
  3. The company name with full legal entity—Apple Incorporated

Not to mention the usual dirty data suspects

  1. Inconsistent capitalization—apple Inc
  2. Random commas and dots—Apple, Inc.


How do I clean these company names?

Let’s go step by step and use this list as an example



Step 1 — Standardize capitalization

Pick one type of capitalization to follow. You can make everything UPPERCASE, lowercase or Title Case (only the first letter in each word is capitalized).

We prefer Title Case because it looks better and follows the conventions of proper nouns. But it’s really up to you.



Step 2 — Remove all non-alphanumeric characters and extra text

In this step we are going to remove all non-alphanumeric characters such as commas (,) and dots (.)

We are also going to remove extra text such as (An Alphabet Co.) from Google Llc (An Alphabet Co.)

This whole step is a bit tricky and an inspection of your data is necessary to figure out which characters to remove. And what counts as extra text.

For example, Jeni’s Splendid Ice Creams is an ice cream company with an apostrophe (‘) in its name. You will not want to remove the apostrophe in this case.

Also depending on the company, things can be counted as extra text or not.

For example, both these company names have dashes ( — ) but the text after the dash in Johnson & Johnson is extra but not for Baskin — Robbins:

  1. Johnson & Johnson — a family company
  2. Baskin — Robbins

So you can’t reliably use the dash( — ) as the indicator for removing extra text in this case.

Generally, removing between parenthesis is fine in 99% of situations. But feel free to provide us counter-examples!

So after removing non-alphanumeric characters and extra text, our list now looks like this



Step 3 — Remove/standardize full and abbreviated legal entity suffixes

This is the last step. We are now going to remove both the abbreviated and full legal entity suffixes.

You can also standardize the legal entity suffixes so each one appears consistently. For example instead of having a mix of Inc and Incorporated, all names only have Inc

This can be done in 2 ways:

  1. Inspect your data and make a list of all the legal entity names you see — best for small data
  2. Note down all the possible legal entity names for the countries in which your companies are based — best for big data

NOTE: you have to list both the abbreviated and the full legal entity suffix, such as Inc vs. Incorporated

Once you have your list, go through all the company names and remove words that occur in your legal entity list. Or replace the words with one standard.

We have chosen to remove the suffixes. Which leads us to our final beautiful cleaned company list ready for Data Analysis





An Easier Way

As you can see, this gets out of hand really fast. Especially if you have company names from around the globe.

And this is why we created www.cleanspreadsheets.com

You simply upload your spreadsheet, choose the column with your company names, and click clean.

Our algorithms then auto-magically do all the cleaning for you and you get this as the result. We break up the names and provide both full and abbreviated suffixes. We support all G20 countries



As always, feel free to let us know if you have any questions, comments or feedback at info@lovespreadsheets.com

Happy Data Cleaning!