Linear Regression is a good for understanding the concept. But, one can easily guess that it is too simple for anything meaningful. Polynomial regression was the next best approach for researchers. As the computational power developed, people started trying out polynomial regression.
As the name suggests, Polynomial Regression is based on a polynomial hypothesis function rather than a linear hypothesis. The process of regression thus involves identification of multiple weights rather than just two.
The process of regression requires identifying the minimum value of the cost function. This minimum is defined by the derivative of the function. When the derivative is 0 or near 0, we consider that point to be the minimum. But, we have one major problem in polynomial regression. As the polynomial order increases, the cost function gets more and more complicated.
As the complication and order of the cost function increases, we can have multiple points where the derivative is 0. Only one of these is the real minimum. The others are called Local Minimum - because the cost function is less than the values around it. But not less than all the values. If the gradient descent lands on one of these local minimum, we get into trouble!
There are different ways to implement the cost function and error values and the gradient descent functionality in order to reduce the possibility of getting trapped in a local minimum. But should always be careful of this problem.
This has been a major problem in supervised learning. If the model underfits the data, you can identify it rather quickly. But if it overfits the data, the situation can be really tricky.
In polynomial regression, it can be observed that the possibility of overfitting is much higher when any particular weight is too high or if the weights of the higher order coefficients are too high.
Regularization is one of the common approaches to avoid overfitting - by preventing any particular weight from growing too high. There are two main types of Regularization based on how the weights are penalized:
Here, the weights are penalized based on the absolute values of the weights. (L1 because it uses the values and not their squares as in L2). L1 Regularization tends to produce a sparce model by discarding a lot of weights that are not so important. This may not result in a model as good as L2; but the performance is a lot better.
Here, the weights are penalized based on the sum of squares rather than the absolute values. (L2 because it uses the squares rather than the value). L2 Regularization tends to produce a dense model by lowering the weights. Hence the model is a lot better but the performance is not so good.