How to split your data into train and test sets

Having data to test how your model performs is essential to give you confidence it works.

12 April 2022

As with any data project, when developing a model, you want to check that everything works as expected. The data you use for developing a machine learning model can be split into train and test sets.

The train set will be used for building the model, and will generally be around 70 to 80 percent of your initial data. While the test set will make up the remaining 20 to 30 percent.

Why do I need a test set?

You can use the test set to check how well your model performs.

You do this by feeding the input variables from the test set into your model and seeing what output it gives you. You can then compare this predicted output with the actual output.

Let’s take a look at an example of how to split your data into train and test sets.

You can follow along by downloading the Jupyter Notebook and data from Github.

Fork on Github

Step 1: Split input and output data

First let’s import Pandas, and the data.

import pandas as pd

df = pd.read_csv("life_insurance_data.csv")

income_usd	property_status	has_life_insurance
20500	Owner with Mortgage	0
31500	Owner with Mortgage	1
37000	Owner with Mortgage	0
⠇	⠇	⠇
71500	Renter	1
92000	Renter	0
93500	Renter	1

There are 30 rows in this dataframe.

From this data, we can model has_life_insurance using income_usd and property_status as inputs.

When we train a model, we always have to provide the input data (income_usd, property_status) and output data (has_life_insurance) separately.

The input dataframe is denoted using an upper case X:

X = df.drop(columns=['has_life_insurance'])

income_usd	property_status
20500	Owner with Mortgage
31500	Owner with Mortgage
37000	Owner with Mortgage
47000	Owner with Mortgage
⠇	⠇

And the output is denoted using a lower case y:

y = df['has_life_insurance']

has_life_insurance
0
1
0
1
⠇

This is fine if all we want to do is train a model.

But we also need to split the data into train and test sets, not just input and output sets.

Step 2: Split train and test data

For this we can use test_train_split(), which we first need to import:

from sklearn.model_selection import train_test_split

Then, call train_test_split() by specifying your input data X, output data y and how much of the original data you want in the test sample - in this case we specified 20%:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

train_test_split() returns a tuple with these four dataframes:

X_train – input training data
X_test – input testing data
y_train – output training data
y_test – output testing data

X_train will now contain 80% of the input data along with the corresponding output data in y_train.

X_test will have 20% of the input data along with the corresponding output data in y_test. We can confirm this with a quick check for number of records in each:

len(X_train)
---
24

len(y_train)
---
24

len(X_test)
---
6

len(y_test)
---
6

💡 Top Tip:

As the output of train_test_split() is a random sample. The records chosen each time you run the code will be different.

If you need the random sample to stay the same, you can add the random_state=n argument. where n is any random number. As long as n stays the same, you’ll get the same random sample. For example:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

You can now fit your model using X_train and y_train. And then test using X_test and y_test.