How to check the accuracy of your classification model

Use accuracy score, confusion matrix, and F1-score to check how accurate your classification model is.

In the previous post we built a decision tree model with scikit-learn. It attempts to predict which customers had life insurance based on their income and property status.

Step 7 of the development process is to check the accuracy of the model. Which is what we’ll look at here.

You can follow along by downloading the Jupyter Notebook and data from Github.

Fork on Github

Github logo

We already split our data into train and test sets before fitting the model. we can now use the test set to see how well the model performs.

For demonstration purposes, the output test set y_test has 6 records. These are the actual output values which correspond to the input test set X_test. Let’s take a look:

y_test

has_life_insurance
1 0
2 0
3 1
4 1
5 0
6 1

Now we want to see what the model predicted for the test input X_test. That is, what the model predicted for has_life_insurance given its inputs income_usd and property_status. We can do this using predict():

y_predicted = model.predict(X_test)

y_predicted

has_life_insurance
1 0
2 0
3 0
4 1
5 0
6 1

A quick glance at the data shows that the model predicted 5 out of 6 cases correctly, with only row 3 being incorrectly classified.

Accuracy score

Accuracy score is the number of correct predictions divided by the total number of predictions.

It’s an intuitive measure that’s easy to understand. In fact, we’ve already calculated it above when we said that the model predicted 5 out of 6, or 83.3% of cases correctly.

An easy way to calculate this in Python is with accuracy_score().

from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_predicted)

---
0.8333

A low accuracy score is a sign that there are some issues with the model. You may want to increase your sample size or retrain using a different algorithm.

Confusion matrix

A confusion matrix simply shows you where any differences are coming from, i.e. are there certain areas of your model which are misclassifying?

Let’s take a look at our example:

matrix = pd.DataFrame(
    confusion_matrix(y_test, y_predicted), 
    index=['actual: 0', 'actual: 1'], 
    columns=['predicted: 0', 'predicted: 1']
)
predicted
actual 0 1
0 3 0
1 1 2

The confusion matrix above shows how many observations were in each predicted and actual classification. The cells highlighted in yellow are where the model predicted the has_life_insurance flag correctly.

Each cell in the matrix can be classified like this:

predicted
actual 0 1
0 True negative False positive
1 False negative True positive

If there are a high number of false negatives or false positives then you can focus your attention on fixing those cases in your model.

F1-score

The F1-score for a model is calculated using precision and recall.

For a binary classifier like we have here, the precision, recall, and F1-score for the model can be calculated like this.

Precision

Precision is the percentage of positive predictions which were correct.

I.e. true positives as a percentage of all the positive predictions - column predicted = 1 in the confusion matrix.

Precision = True positive / (True positive + False positive)
          = 2 / (2 + 0)
          = 1.00

Recall

Recall is the percentage of actual positive values which were predicted correctly.

I.e. the percentage of true positives in the row where actual = 1.

Recall = True positive / (True positive + False negative)
       = 2 / (2 + 1)
       = 0.67

F1-score

The F1-score ranges from 0 to 1, with 1 being the best and 0 being the worst.

F1-score is the harmonic mean of precision and recall.

F1-score = 2 * (Precision * Recall) / (Precision + Recall)
         = 2 * (1.00 * 0.67) / (1.00 + 0.67)
         = 0.80

For a model which classifies into more than one category, the precision, recall, and F1-score of each class can be calculated. Then an average of these F1-scores can be used as a score for the entire model - more on this below.

Classification report

If all these calculations seem laborious, don’t worry! These all get calculated for you with classification_report().

from sklearn.metrics import classification_report
print(classification_report(y_test, y_predicted))

---
              precision    recall  f1-score   support

           0       0.75      1.00      0.86         3
           1       1.00      0.67      0.80         3

    accuracy                           0.83         6
   macro avg       0.88      0.83      0.83         6
weighted avg       0.88      0.83      0.83         6

For our case, where we only have a binary output, 0 or 1, you should read the row with 1 for the model’s overall precision, recall, and F1-score.

You can see that the model has an F1-score of 0.8, which is what we calculated manually above.

If we had a multi-class model (i.e. more output options than just 0 and 1), then you could use the weighted average of each class’s F1-score as a score for the model. This takes into account the number of actual observations for each class, which is shown under the support column. This is calculated for you in the weighted avg row.

Accuracy score vs. F1-score

One of the benefits of using accuracy score is that it’s easy to interpret. If a model predicts 95% of the classifications correctly, then the accuracy score will be 95%.

However, this can be a problem for cases where the model is predicting something that actually happens 95% of the time.

For example, the probability that a patient is healthy. If the model simply says that every patient patient is healthy regardless of the inputs, it would still have a high accuracy score of 95% as it would be correct 95% of the time!

This is where F1-score comes in useful, as it takes into account how the data is distributed between true/false positives and negatives. Therefore, it’s a good idea to use F1-score when we can see a large imbalance between the groups on a confusion matrix.