How to encode categorical variables ready for machine learning

Transform categorical variables into numerical values ready to be used as inputs for a machine learning model.

Machine learning algorithms generally only understand numerical inptus. This raises the question, what can we do with categorical variables?

The answer is to transform the categorical variables into numerical values.

There are a few ways to do this as you’ll see later on. But first, let’s take a look at an example with some sample data.

You can follow along by downloading the Jupyter Notebook and data from Github.

Fork on Github

Github logo

First let’s import Pandas, and the data.

import pandas as pd

df = pd.read_csv("life_insurance_data.csv") 
income_usd property_status has_life_insurance
31500 Owner with Mortgage 1
67500 Owner with Mortgage 1
77000 Owner Without Mortgage 0
56000 Owner Without Mortgage 1
26500 Renter 0
60500 Renter 0

As this is a small dataframe, it’s easy to see that property_status has categorical data with three distinct values.

df['property_status'].value_counts()

---
'Owner with Mortgage'       2
'Owner Without Mortgage'    2
'Renter'                    2

For larger dataframes, it can be useful to look at the dtypes attribute to see which variables are classified as object.

df.dtypes

---
income_usd             int64
property_status       object
has_life_insurance     int64

For convenience, let’s create a new dataframe df2 which only contains the property_status column:

df2 = df[['property_status']].copy()

📌 Remember:

Use two square brackets above to keep df2 as a dataframe. Otherwise, df2 will become a series.

Label encoding

One way to map these categories to numbers is to use the map() method:

x = {'Owner with Mortgage':0, 'Owner Without Mortgage':1, 'Renter':2}

df2['property_status_numeric'] = df2['property_status'].map(x)
property_status property_status_numeric
Owner with Mortgage 0
Owner with Mortgage 0
Owner Without Mortgage 1
Owner Without Mortgage 1
Renter 2
Renter 2

As you can see above, property_status has been mapped like this:

This does what we need, but there’s an easier way.

You can get Pandas to do this automatically, which is very helpful if you have many different categories.

First, we need to change the property_status column to a category data type.

df2['property_status'] = df2["property_status"].astype('category')

df2.dtypes

---
property_status            category
property_status_numeric       int64

Then apply the cat.codes attributes to a new numeric value column:

df2["property_status_numeric_2"] = df2["property_status"].cat.codes
property_status property_status_numeric property_status_numeric_2
Owner with Mortgage 0 1
Owner with Mortgage 0 1
Owner Without Mortgage 1 0
Owner Without Mortgage 1 0
Renter 2 2
Renter 2 2

What you can see above is that the encoding has been done in column property_status_numeric_2, but the mapping is different to the manual mapping we did in the previous step for property_status_numeric.

If specifying the order in which categories are numbered is important in your mapping, then you can use CategoricalDtype:

from pandas.api.types import CategoricalDtype

cat_type = CategoricalDtype(categories=['Owner with Mortgage', 'Owner Without Mortgage', 'Renter'], ordered=True)

df2['property_status'] = df2["property_status"].astype(cat_type)
df2["property_status_numeric_3"] = df2["property_status"].cat.codes

Which maps property_status_numeric_3 in the order specified in the categories= list above:

property_status property_status_numeric property_status_numeric_2 property_status_numeric_3
Owner with Mortgage 0 1 0
Owner with Mortgage 0 1 0
Owner Without Mortgage 1 0 1
Owner Without Mortgage 1 0 1
Renter 2 2 2
Renter 2 2 2

Now that we’ve applied the ordering, we get the same result as when using the mapping.

One thing you may have noticed is that there’s no natural way to order the categories Owner with Mortgage, Owner without Mortgage and Renter.

This could potentially be a problem for your model as it could place more weight on Renter just because it has a higher encoding.

In some cases, label encoding makes sense as there’s a natural ranking. For example, education level:

They follow a natural ranking, where a bachelor’s degree is more valuable than a high school diploma.

Even though you’ve seen above how to classify property_status using label encoding, this isn’t the recommended approach when there isn’t a natural ranking for the categories.

One-hot encoding

One-hot encoding is just a fancy name for creating dummy variables which have values 0 or 1 based a categorical variable.

In the example below, animal has been encoded to the new columns animal_dog, animal_cat and animal_mouse:

animal animal_dog animal_cat animal_mouse
dog 1 0 0
cat 0 1 0
mouse 0 0 1
dog 1 0 0
mouse 0 0 1

Back to our property_status example. To do this in Pandas, we can use the get_dummies() method:

df3 = df[['property_status']].copy()

df3['property_status_original'] = df3['property_status']
df3 = pd.get_dummies(df3, columns=['property_status'], prefix=['property_status'])

get_dummies() drops the original classification column - property_status in our case. This is why I created property_status_original in the second line above.

property_status_original property_status_Owner Without Mortgage property_status_Owner with Mortgage property_status_Renter
Owner with Mortgage 0 1 0
Owner with Mortgage 0 1 0
Owner Without Mortgage 1 0 0
Owner Without Mortgage 1 0 0
Renter 0 0 1
Renter 0 0 1

As you can see above, the property_status categories have now been encoded into the three new columns.

Multicollinearity

In the above example, a customer can’t be both a renter and an owner.

This means that there’s some negative correlation between the new columns. For example, if we know that a customer rents, then we also know they’re not an owner.

This is called multicollinearity, and it should be avoided when modelling using generalized linear models which are fitted using least squares or maximum likelihood.

One way around this is to drop one of the dummy variables - property_status_Renter in this case.

So there we have it, that’s how you can change a categorical variable into a numerical variable which can be used by machine learning models.