Ridge, Lasso And ElasticNet

Himanshu Sachdeva
Sep 24, 2020
5 min read

A regression model has to be as simple as possible, but no simpler. Simpler models are more generalizable and require fewer training samples for effective training than the more complex ones.

However, Linear regression tries to minimize the error (e.g. MSE or OLS) without taking into account complexity of the model which often result in arbitrarily complex coefficients. Let's look at the two equations below:

Y = x + 2

Y = 56.79 + 84.37x

If both equations produce similar results, which one do we choose for our model? It is the first one as it will generalize far better than the second, more complex equation. The first equation is a more simple representation and also takes less space in memory, just a few bits. Next question is, how simple (or complex) a model should be? Let's explore it a little bit.

A problem with too simple model is that it doesn't learn the pattern well. As a result it is too naïve to produce meaningful results. This is also called underfitting. It's like a student who only learns the basic concepts and do not attempt to understand the application of those concepts. We can see in first graph how a simple linear equation misses the trend in the data. Underfitting is easier to detect as the model with perform poorly in training itself and adding more training data will address the issue.

On the other hand, complex models have the tendency to learn all the data and thus cause overfitting. Overfitting means instead of learning the pattern, the model fits through all the data points resulting in a wavy pattern, as seen in the third graph. Though it will give very good accuracy in the training but it will fail miserably when used on unseen test data. It is like a student who learns answers to all the questions and does not understand the concepts. Student will do very well when asked the same questions but unable to answer questions that (s)he didn't read earlier.

The solution to prevent overfitting is Regularization, which is to add a penalty to different parameters of the model to create an optimally complex model, i.e. a model which is as simple as possible while performing well on the training data.

Through regularization, we make a deliberate attempt to bring down the complexity and try to strike the delicate balance between keeping the model simple, yet not making it too naïve to be of any use. Hence the model will be less likely to fit the noise of the training data and improve the generalization of the model.

In regularized regression, the objective function has two parts - the error term and the regularization term.

The regularization term can be:

Sum of squares of the coefficients or Ridge regression
Sum of absolute value of the coefficients or Lasso regression

Ridge Regression

In Ridge regression, we add a penalty term which is equal to the square of the coefficient. The L2 term is equal to the square of the magnitude of the coefficients. We also add a hyperparameter λ to control that penalty term (more about hyperparameters in a separate blog). In this case if λ is zero then the equation is the basic OLS and if λ > 0 then it will add a constraint to the coefficient. As we increase the value of λ, this constraint causes the value of the coefficient to tend towards zero. This leads to both low variance (as some coefficient leads to negligible effect on prediction) and low bias (minimization of coefficient reduce the dependency of prediction on a particular variable).

In summary,

It shrinks the parameters, therefore it is mostly used to prevent multicollinearity.
It reduces the model complexity by coefficient shrinkage.

Disadvantages of Ridge Regression - it decreases the complexity of a model but does not reduce the number of variables since it never leads to a coefficient been zero rather only minimizes it. Hence, this model is not good for feature reduction.

Lasso Regression

Lasso stands for Least Absolute Shrinkage and Selection Operator. It adds penalty term to the cost function which is the absolute sum of the coefficients. As the value of coefficients increases from 0 this term penalizes, cause model, to decrease the value of coefficients in order to reduce loss. The difference between ridge and lasso regression is that it tends to make coefficients to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero.

It uses L1 regularization technique
It is generally used when we have more number of features, because it automatically does feature selection.

Disadvantages of Lasso Regression:

Lasso sometimes struggles with some types of data. If the number of predictors (p) is greater than the number of observations (n), Lasso will pick at most n predictors as non-zero, even if all predictors are relevant (or may be used in the test set).
If there are two or more highly collinear variables then LASSO regression select one of them randomly which is not good for the interpretation of data

Let us visualise how L1 and L2 regression works using the contours they produce and understand why Lasso produces 0 value for coefficients and Ridge cannot.

In L1, where we use absolute terms, the regularization term forms lines as shown in left graph above. As a result, there are places where error term (elliptical circles) will meet regularization term on the axes itself causing the coefficients to be 0.

While in L2, the regularization term graph has circular pattern as it is a square function. In this case, the error term will meet the circle of regularization term very close to axes, but never on the axes. So the coefficients will be very close to 0 but never become equal to 0.

Elastic Net Regression

Now that we have a basic understanding of ridge and lasso regression, let’s think of an example where we have a large dataset with 10,000 features in which some of the independent features are correlated with other independent features. Which regression should we use, Rigde or Lasso?

If we apply ridge regression to it, it will retain all of the features but will shrink the coefficients. But the problem is that model will still remain complex as there are 10,000 features, thus may lead to poor model performance.

The main problem with lasso regression is when we have correlated variables, it retains only one variable and sets other correlated variables to zero. That will possibly lead to some loss of information resulting in lower accuracy in our model. Then what is the solution for this problem?

There is another type of regression, known as elastic net regression, which is basically a hybrid of ridge and lasso regression. Similar to the lasso, the elastic net simultaneously does automatic variable selection and continuous shrinkage, and it can select groups of correlated variables. It is like a stretchable fishing net that retains ‘all the big fish’.

So if you know elastic net, you can implement both Ridge and Lasso by tuning the parameters. It uses both L1 and L2 penalty terms.

To summarise, the goal of a good machine learning model is to generalize well from the training data to any data that allows us to make predictions on data the model has never seen.

There are three popular regularization techniques, each of them aiming at decreasing the size of the coefficients to check the models from getting too complex:

Ridge Regression, which penalizes sum of squared coefficients (L2 penalty).
Lasso Regression, which penalizes the sum of absolute values of the coefficients (L1 penalty).
Elastic Net, a convex combination of Ridge and Lasso.

Ridge, Lasso And ElasticNet

Ridge Regression

Lasso Regression

Elastic Net Regression

Recent Posts

Comments