What is Cross-Validation in Machine Learning and how to implement it?

The problem with the validation technique in Machine Learning is, that it does not give any indication on how the learner will generalize to the unseen data. This is where Cross-Validation comes into the picture. This article covers the basic concepts of Cross-Validation in Machine Learning, the following topics are discussed in this article:

What is Cross-Validation?

For any model in Machine Learning, it is considered as a best practice if the model is tested with an independent data set. Normally, any prediction model works on a known data set which is also known as the training set.

But in a real-life scenario, the model will be tested for its efficiency and accuracy with an altogether different and unique data set. Under those circumstances, you’d want your model to be efficient enough or at least to be at par with the same efficiency that it shows for the training set. Basically this testing is known as cross-validation in Machine Learning so that it is fit to work with any model in the future.

Transform yourself into a highly skilled professional and land a high-paying job with the Artificial Intelligence Course.

We can also call it a technique for asserting how the statistical model generalizes to an independent data set. Now that we know what cross-validation stands for, let us try to understand cross-validation in simple terms.

The basic purpose of cross-validation is to assess how the model will perform with an unknown data set. For instance, you are trying to score a goal in an empty goal. It looks pretty easy, and you could even score from a considerable distance too. But the real test starts when there is a goalkeeper and a bunch of defenders. That’s why you need to get trained in a real match facing all the heat and still score the goal.

Similarly, a statistical model is trained in such a way that it excels in its efficiency with other unknown data sets using cross-validation.

Types Of Cross-Validation

There are two types of cross-validation techniques in Machine Learning.

Exhaustive Cross-Validation – This method basically involves testing the model in all possible ways, it is done by dividing the original data set into training and validation sets. Example: Leave-p-out Cross-Validation, Leave-one-out Cross-validation.
Non-Exhaustive Cross-Validation – In this method, the original data set is not separated into all the possible permutations and combinations. Example: K-fold Cross-Validation, Holdout Method.

Let’s get into more details about various types of cross-validation in Machine Learning.

K-Fold Cross-Validation

In Machine Learning, there is never enough data to train the model. Even then, if we remove some part of the data, it poses a threat of overfitting the Machine Learning model. It is also possible that it may not recognize a dominant pattern if enough data is not provided for the training phase.

By reducing the data, we also face the risk of reduced accuracy due to the error induced by bias. To overcome this problem, we need a method that would provide ample data for training and some data for testing. K-fold Cross-validation does exactly that.

How does it work?

In this cross-validation technique, the data is divided into k subsets. We take one subset from the bunch and treat it as the validation set for the model. And we keep the k-1 subset for training the model.

The error estimation is averaged for all the ‘k trials’ to get the effective readiness of the model. Each k subset will be in the validation set at least once. It is also included in the k-1 training set at least once. This significantly reduces the error induced by bias. It also reduces the variance as each of the k subsets is used in the validation.

Stratified K-fold Cross-Validation

In this technique, a slight change is made in the k-fold Cross-Validation. It changes such that each fold will have an approximately equal percentage of samples of each target class as the whole set. In the case of prediction problems, the mean responsive value is approximately equal in all the folds.

In some cases, there is a large imbalance in the responsive variables. Let us understand this with an example. In a house pricing problem, the prices of some houses can be much more than the other houses. Also, in classification problems, the samples may have more negative examples than the positive samples. To tackle this discrepancy we follow the stratified k-fold Cross-Validation technique in Machine Learning.

Holdout Method

This is the simplified cross-validation method among all. In this method, we randomly assign data points to two data sets. The size is not relevant in this case.

The basic idea behind this is to remove a part from your training set and use it to get predictions from the model that is trained on the rest of the data. This method suffers from high variance since it takes only a single run to execute all this . It may also give misleading results.

Leave-p-out Cross-Validation

In this approach, p data points are left out of the training data. Let’s say there are m data points in the data set, then m-p data points are used for the training phase. And the p data points are kept as the validation set.

This technique is rather exhaustive because the above process is repeated for all the possible combinations in the original data set. To check the overall effectiveness of the model, the error is averaged for all the trials.

It becomes computationally infeasible since the model needs to train and validate for all possible combinations and for a considerably large p.

Leave-one-out Cross-Validation

This method of Cross-validation is similar to Leave-p-out Cross-validation but the only difference is that in this case p = 1. It actually saves a lot of time which is a big advantage.

Although If the sample data is too large, it can still take a lot of time. But it would still be quicker than the Leave-p-out cross-validation method.

Now that we have discussed the different types of Cross-Validation techniques, let us take a look at the Cross-Validation API.

Cross-Validation API

We do not have to implement Cross-Validation manually, Scikit-Learn library in Python provides a simple implementation that will split the data accordingly. There are Cross-Validation iterators that are used depending upon various Cross-Validation strategies.

k-fold Cross-Validation: KFold() scikit-learn class
Leave-one-out Cross-Validation: LeaveOneOut() scikit-learn class
Leave-p-out Cross-Validation: LeavePOut() scikit-Learn class
Stratified K-Fold Cross-Validation: StratifiedKFold() scikit-learn class

For example, let us try to use the Kfold using python to create training and validation sets.

from numpy import array
from sklearn.model_selection import KFold
# sampling the data
data = array([0.10, 0.22, 0.31, 0.43, 0.52, 0.63,0.72,0.85,0.92,0.99])
# Splittinf the data
kfold = KFold(3, True, 1)
# enumerating the splits
for train, test in kfold.split(data):
    print('train: %s, test: %s' % (data[train], data[test]))

Output:

Similarly, we can choose other cross-validation iterators depending upon the requirement and the type of data. Now let us try to understand how we can calculate the model’s bias and variance.

How To Measure Model’s Bias-Variance

If we do the k-fold cross-validation, we will get k different estimation errors. In an ideal situation, these errors would sum up to zero, but it is highly unlikely to get such results. To get the bias, we take the average of all the estimation error.

To calculate the model’s variance, we take the standard deviation of all the errors. If we get a low value of standard deviation it means that our model does not vary a lot with different sets of training data.

The focus should be to maintain a balance between the bias and the variance of the model. This can be achieved by reducing the variance to the minimum and controlling the bias. This trade-off usually results in making better predictive models.

But there are a few limitations with Cross-Validation as well. Let us take a look at various limitations with Cross-Validation.

With immense applications and easier implementations of Python with data science, there has been a significant increase in the number of jobs created for data science every year. Enroll for Edureka’s Data Science with Python and get hands-on experience with real-time industry projects along with 24×7 support, which will set you on the path of becoming a successful Data Scientist,

Limitations Of Cross-Validation

The following are a few limitations faced by Cross-Validation:

In an ideal situation, Cross-Validation will produce optimum results. But in case of inconsistent data, the results may vary drastically. It is quite uncertain what kind of data will be encountered by the model.
Predictive modeling often requires an evolution in terms of data, this can pretty much change the training and the validation sets drastically.
The results may vary depending upon the features of the data set. Let us say we make a predictive model to detect an ailment in a person and we train it with a specific set of population. It may vary with the general population causing inconsistency and reduced efficiency.

Cross-Validation Applications

With the overpowering applications to prevent a Machine Learning model from Overfitting and Underfitting, there are several other applications of Cross-Validation listed below:

We can use it to compare the performances of a set of predictive modeling procedures.
Cross-Validation excels in the field of medical research.
It can be used in the meta-analysis since a lot of data analysts are already using cross-validation.

This brings us to the end of this article where we have learned Cross-Validation in Machine Learning. I hope you are clear with all that has been shared with you in this tutorial.

If you found this article on “Cross-Validation In Machine Learning” relevant, check out the Edureka’s Machine Learning Course, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe.

We are here to help you with every step on your journey and come up with a curriculum that is designed for students and professionals who want to be a Machine Learning Engineer. The course is designed to give you a head start into Python programming and train you for both core and advanced Python concepts along with various Machine Learning Algorithms like SVM, Decision Tree, etc.

If you come across any questions, feel free to ask all your questions in the comments section of “Cross-Validation In Machine Learning” and our team will be glad to answer.