How to encode categorical variables ready for machine learning
Transform categorical variables into numerical values ready to be used as inputs for a machine learning model.
Machine learning algorithms generally only understand numerical inptus. This raises the question, what can we do with categorical variables?
The answer is to transform the categorical variables into numerical values.
There are a few ways to do this as you’ll see later on. But first, let’s take a look at an example with some sample data.
You can follow along by downloading the Jupyter Notebook and data from Github.
Fork on Github
First let’s import Pandas, and the data.
import pandas as pd
df = pd.read_csv("life_insurance_data.csv")
income |
property_status | has |
---|---|---|
31500 | Owner with Mortgage | 1 |
67500 | Owner with Mortgage | 1 |
77000 | Owner Without Mortgage | 0 |
56000 | Owner Without Mortgage | 1 |
26500 | Renter | 0 |
60500 | Renter | 0 |
As this is a small dataframe, it’s easy to see that property_status
has categorical data with three distinct values.
df['property_status'].value_counts()
---
'Owner with Mortgage' 2
'Owner Without Mortgage' 2
'Renter' 2
For larger dataframes, it can be useful to look at the dtypes
attribute to see which variables are classified as object
.
df.dtypes
---
income_usd int64
property_status object
has_life_insurance int64
For convenience, let’s create a new dataframe df2
which only contains the property_status
column:
df2 = df[['property_status']].copy()
📌 Remember:
Use two square brackets above to keep df2
as a dataframe. Otherwise, df2
will become a series.
Label encoding
One way to map these categories to numbers is to use the map()
method:
x = {'Owner with Mortgage':0, 'Owner Without Mortgage':1, 'Renter':2}
df2['property_status_numeric'] = df2['property_status'].map(x)
property_status | property |
---|---|
Owner with Mortgage | 0 |
Owner with Mortgage | 0 |
Owner Without Mortgage | 1 |
Owner Without Mortgage | 1 |
Renter | 2 |
Renter | 2 |
As you can see above, property_status
has been mapped like this:
- Owner with Mortgage ➡️ 0
- Owner without Mortgage ➡️ 1
- Renter ➡️ 2
This does what we need, but there’s an easier way.
You can get Pandas to do this automatically, which is very helpful if you have many different categories.
First, we need to change the property_status
column to a category
data type.
df2['property_status'] = df2["property_status"].astype('category')
df2.dtypes
---
property_status category
property_status_numeric int64
Then apply the cat.codes
attributes to a new numeric value column:
df2["property_status_numeric_2"] = df2["property_status"].cat.codes
property_status | property |
property |
---|---|---|
Owner with Mortgage | 0 | 1 |
Owner with Mortgage | 0 | 1 |
Owner Without Mortgage | 1 | 0 |
Owner Without Mortgage | 1 | 0 |
Renter | 2 | 2 |
Renter | 2 | 2 |
What you can see above is that the encoding has been done in column property_status_numeric_2
, but the mapping is different to the manual mapping we did in the previous step for property_status_numeric
.
If specifying the order in which categories are numbered is important in your mapping, then you can use CategoricalDtype
:
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=['Owner with Mortgage', 'Owner Without Mortgage', 'Renter'], ordered=True)
df2['property_status'] = df2["property_status"].astype(cat_type)
df2["property_status_numeric_3"] = df2["property_status"].cat.codes
Which maps property_status_numeric_3
in the order specified in the categories=
list above:
property_status | property |
property |
property |
---|---|---|---|
Owner with Mortgage | 0 | 1 | 0 |
Owner with Mortgage | 0 | 1 | 0 |
Owner Without Mortgage | 1 | 0 | 1 |
Owner Without Mortgage | 1 | 0 | 1 |
Renter | 2 | 2 | 2 |
Renter | 2 | 2 | 2 |
Now that we’ve applied the ordering, we get the same result as when using the mapping.
One thing you may have noticed is that there’s no natural way to order the categories Owner with Mortgage
, Owner without Mortgage
and Renter
.
This could potentially be a problem for your model as it could place more weight on Renter
just because it has a higher encoding.
In some cases, label encoding makes sense as there’s a natural ranking. For example, education level:
- High school graduate ➡️ 0
- Bachelor’s degree ➡️ 1
- Postgraduate degree ➡️ 2
They follow a natural ranking, where a bachelor’s degree is more valuable than a high school diploma.
Even though you’ve seen above how to classify property_status
using label encoding, this isn’t the recommended approach when there isn’t a natural ranking for the categories.
One-hot encoding
One-hot encoding is just a fancy name for creating dummy variables which have values 0
or 1
based a categorical variable.
In the example below, animal
has been encoded to the new columns animal_dog
, animal_cat
and animal_mouse
:
animal | animal |
animal |
animal |
---|---|---|---|
dog | 1 | 0 | 0 |
cat | 0 | 1 | 0 |
mouse | 0 | 0 | 1 |
dog | 1 | 0 | 0 |
mouse | 0 | 0 | 1 |
Back to our property_status
example. To do this in Pandas, we can use the get_dummies()
method:
df3 = df[['property_status']].copy()
df3['property_status_original'] = df3['property_status']
df3 = pd.get_dummies(df3, columns=['property_status'], prefix=['property_status'])
get_dummies()
drops the original classification column - property_status
in our case. This is why I created property_status_original
in the second line above.
property_status_original | property |
property |
property |
---|---|---|---|
Owner with Mortgage | 0 | 1 | 0 |
Owner with Mortgage | 0 | 1 | 0 |
Owner Without Mortgage | 1 | 0 | 0 |
Owner Without Mortgage | 1 | 0 | 0 |
Renter | 0 | 0 | 1 |
Renter | 0 | 0 | 1 |
As you can see above, the property_status
categories have now been encoded into the three new columns.
Multicollinearity
In the above example, a customer can’t be both a renter and an owner.
This means that there’s some negative correlation between the new columns. For example, if we know that a customer rents, then we also know they’re not an owner.
This is called multicollinearity, and it should be avoided when modelling using generalized linear models which are fitted using least squares or maximum likelihood.
One way around this is to drop one of the dummy variables - property_status_Renter
in this case.
So there we have it, that’s how you can change a categorical variable into a numerical variable which can be used by machine learning models.