Linear Regression


Linear Regression is the most basic implementation of Regression. It is too simple to be of any real use in the modern age machine learning applications. But is the best way to understand the concepts. This is how it works:

Consider a very simple world, where the salary of a person is related to the years of experience and performance. If someone wants to understand this relation, this is what one would do:

Essentially it involves five steps:

  1. Collect information across the industry - about various samples of salary and experience and the performance measured in terms of successful deliveries.
  2. Try to identify the relation in these,
  3. Test this relation with some more data and make required changes
  4. Repeat step 3 till you are satisfied with the results.
  5. Conclude that the relation is established and use it going forward, for making predictions about expected salary.

Now let's check this process in more detail. Assume you have successfully gathered the required data as follows. This is the first step. Now Linear Regression helps you with steps 2/3/4. You start with assuming some 'Linear' relation.

Salary = a * experience + b * performance + c

Here, the Salary is the output, Experience and Performance are the inputs. a, b, c are the parameters. Now, the question boils down to identifying the values of a, b and c. We all know that there is some relation between the inputs and outputs. But identifying such relation using logical deduction is almost an impossible task - because it requires extreme analysis. Regression helps you identify this relation - not through logic, but by learning.

Learning begins with a hypothesis. Propose a particular solution to the problem - based on what you already know, or just a random guess. Naturally, hypothesis based on previous knowledge helps speed up things. But, when you know nothing about the topic, a random guess is not bad either. A better word for this random guess is hypothesis. So, we start with some hypothesis

Salary = a0 * experience + b0 * performance + c0

Here, a0, b0 and c0 are arbitrary random numbers that we start with. We choose random numbers instead of 0's because that helps us begin better, with more relevant initialization. Ofcourse, this is not the correct solution. It is just a hypothesis - that we will verify and refine as we move on. In fact, any real life problem will never find a perfect solution in a numerical equation. We call it a good solution if the error is minimal or perhaps, just less than a given threshold - that is good enough for our purpose. Now, we test this hypothesis. There are may ways to test a hypothesis. Brute force is the easiest to understand. So we will start with that. We validate each records in the 'Training Set', to identify the error. Then we find out the 'Mean Square Error' - that is a measure of how good or bad is our hypothesis. The Cost of a hypothesis is a cumulative function of the errors. It could be a simple numeric mean. As we progress to higher algorithms, the derivation of cost gets more and more complex.

Cost = Sum((a0 * e(i) + b0 * p(i) + c(0) - s(i))**2) / (number of samples)

This is the most basic formula to calculate the error. A you go along, you will come across more meaningful and complicated ways of calculating the error.

This error tells us how bad is our hypothesis. If we are extremely lucky, this could be 0 right away. But, Murphy says that is not the case. So we have to adjust our parameters - a, b, c - to a1, b1, c1. We go through this correction again and then get a2, b2, c2 and so on... till we get values of a, b, c such that the error is within the threshold that we need - and the problem is solved! That sounds too simple. But how do we get a1,b1,c1 or a2,b2,c2 ...? The error that we get, tells us our hypothesis is not correct. Not just that, it guides us with the direction and amount of the change required to each parameter. This is identified using partial derivatives. Remember high school calculus? Machine Learning is full of calculus and linear algebra. Check out this Calculus Tutorial if you would like a refresher.

The derivative shows us the way in which we need to move and the amount. Higher the derivative, further away is the ideal and we need to take larger steps towards the end. The ideal point is where the derivatives are 0. That is the point where error is at the minimum possible level. Obviously, the mean square error is extremely high if the values of a,b,c are very low and also if these values are very high. So the optimum is between the two. Since our hypothesis is a linear equation, the mean square error would be a quadratic equation - having a single minimum. That simplifies our task of finding the minimum. We calculate the partial derivatives of the error expression with respect to a, b and c - Call them da, db and dc. Based on these we pick the next values:

a1 = a0 - alpha * da
b1 = b0 - alpha * db
c1 = c0 - alpha * dc

Here alpha is the learning rate. It is some positive number, that we choose based on our experience about machine learning. We can see many such Hyper Paramteres in the machine learning domain. Choosing the right hyperparameter is very important in machine learning. The hyperparameters differentiate successful and unsuccessful machine learning projects. Partial derivative of f(x, y, z) with respect to x can be calculated as

(f(x + d, y, z) - f(x - d, y, z)) / 2d    # for very small values of d.

According to this,

da = (2a * sum(e) + b ( sum(e) * sum(p) ) + c ( sum(e)) - sum(s) ) / N
   = (sum(e) * (2a + b * sum(p) + c - sum(s))) / N

Thus,

a(i) = a(i-1) - alpha * (sum(e) * (2a(i-1) + b(j-i) * sum(p) + c(i-1) - sum(s))) / N

The process of predicting the salary set based on the current set of parameters is called forward propagation. And the process of correcting the parameters based on the error is called backward propagation. One complete cycle of forward and backward propagation is called an epoch. Each epoch should take us closer to the optimal if the hyperparameters are correctly chosen. This process is called Regression. Since we used a linear hypothesis, we call it Linear Regression. As we train the parameters through several epochs, the error level can go below the threshold. But this is only the beginning. We need to do a lot more to get things working. We can have problems like overfitting, underfitting or local optimum.

Obviously very few real life applications are simple enough to fit in a straight line. Most of them will need a lot more than this. For these, we have Polynomial Regression, Logistic Regression, Neural Networks and many other types of input - output relations that can used as a model for the hypothesis. But core the concept of regression remains the same. Ofcourse, the calculations get more and more complex along with the hypothesis.