How to split your data into train and test sets
Having data to test how your model performs is essential to give you confidence it works.
As with any data project, when developing a model, you want to check that everything works as expected. The data you use for developing a machine learning model can be split into train and test sets.
The train set will be used for building the model, and will generally be around 70 to 80 percent of your initial data. While the test set will make up the remaining 20 to 30 percent.
Why do I need a test set?
You can use the test set to check how well your model performs.
You do this by feeding the input variables from the test set into your model and seeing what output it gives you. You can then compare this predicted output with the actual output.
Let’s take a look at an example of how to split your data into train and test sets.
You can follow along by downloading the Jupyter Notebook and data from Github.
Fork on Github
Step 1: Split input and output data
First let’s import Pandas, and the data.
import pandas as pd
df = pd.read_csv("life_insurance_data.csv")
income |
property |
has |
---|---|---|
20500 | Owner with Mortgage | 0 |
31500 | Owner with Mortgage | 1 |
37000 | Owner with Mortgage | 0 |
⠇ | ⠇ | ⠇ |
71500 | Renter | 1 |
92000 | Renter | 0 |
93500 | Renter | 1 |
There are 30 rows in this dataframe.
From this data, we can model has_life_insurance
using income_usd
and property_status
as inputs.
When we train a model, we always have to provide the input data (income_usd
, property_status
) and output data (has_life_insurance
) separately.
The input dataframe is denoted using an upper case X
:
X = df.drop(columns=['has_life_insurance'])
income |
property |
---|---|
20500 | Owner with Mortgage |
31500 | Owner with Mortgage |
37000 | Owner with Mortgage |
47000 | Owner with Mortgage |
⠇ | ⠇ |
And the output is denoted using a lower case y
:
y = df['has_life_insurance']
has |
---|
0 |
1 |
0 |
1 |
⠇ |
This is fine if all we want to do is train a model.
But we also need to split the data into train and test sets, not just input and output sets.
Step 2: Split train and test data
For this we can use test_train_split()
, which we first need to import:
from sklearn.model_selection import train_test_split
Then, call train_test_split()
by specifying your input data X
, output data y
and how much of the original data you want in the test sample - in this case we specified 20%:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
train_test_split()
returns a tuple with these four dataframes:
X_train
– input training dataX_test
– input testing datay_train
– output training datay_test
– output testing data
X_train
will now contain 80% of the input data along with the corresponding output data in y_train
.
X_test
will have 20% of the input data along with the corresponding output data in y_test
. We can confirm this with a quick check for number of records in each:
len(X_train)
---
24
len(y_train)
---
24
len(X_test)
---
6
len(y_test)
---
6
💡 Top Tip:
As the output of train_test_split()
is a random sample. The records chosen each time you run the code will be different.
If you need the random sample to stay the same, you can add the random_state=n
argument. where n
is any random number. As long as n
stays the same, you’ll get the same random sample. For example:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
You can now fit your model using X_train
and y_train
. And then test using X_test
and y_test
.