1. Cost Function in Linear Regression: A Clear Guide

In the field of computer science and mathematics, the cost function for linear regression also called as loss function or objective function is the function that is used to quantify the difference between the predicted value and actual value of the model.

In the previous article, we saw Linear regression in detail, the goal is to sales prediction and automobile consulting company case study. you can follow this in my previous article on Linear Regression using Python with an automobile company case study

Welcome to the module on the Cost function.

Consider the situation, in which you are trying to solve the classification problem, i.e. classify data into categories. Suppose the data is pertaining to the weight and height of two different categories of fish denoted by red and blue points in the scatter plot below.

In fact, all three classifications have high accuracy, but the 3rd solution has the best solution. Because it classifies all the points perfectly because the line is almost exactly between the two groups.

This is where the Cost function concepts come in. Cost function algorithm leverage to reach an optimal solution. The agenda of the concept is to understand how to minimize and maximize the cost function using an algorithm.

What is Cost Function?

In Machine learning, the cost function is a mathematical function that measures the performance of the model. In another word, we can say the difference between the predicted output and the actual output of the model.

Let’s say we want to predict the salary of a person based on his experience, bellow table is just made-up data.

Now let’s make a scatter plot of these data points and now we need to fit a straight line that is the best-fit line. Now in the below diagram if you take (6,6), now consider the straight line given that.

Y=mx + c at this time on Xi we have a value Yi which is coming from data set and the predicated value Ypred = mXi + C now we would like to define a cost function which is based on the difference between Yi and Ypred which (Yi-Ypred)² (remember the residual and RSS.)

And this is what we would like to minimize, which is the sum of all the points which are in the data set, we would like to take this square error term and sum it over all the data-point and minimize the sum which is.

This gives us a cost function that we would like to minimize, so just to give you a perspective using this equation we want to find ‘m’ and ‘C’ such that the sum of the above expression is minimum because that would give us the best line fit. (A best straight line where the error is minimum).

Now the question is how to minimize this, very simply recall your high school Math (Diffraction).

For example, With one variable.

With two variables.

So basically, what we have done, we found will minimize the given cost function. Now if we talk about our equation.

And calculate the cost function with respect to (w.r.t) m and C we will get two linear equations to check the below calculation.

And now check this bellow implementation if we put our data-point and calculate.

So, we managed to solve m and c and find out which straight line that fits our data point. Now, if we put the value of m and c in the bellow equation, we will get the regression line.

Fitting a straight line, the cost function was the sum of squared errors, but it will vary from algorithm to algorithm. To minimize the sum of squared errors and find the optimal m and c, we differentiated the sum of squared errors w.r.t the parameters m and c. We then solved the linear equations to obtain the values m and c. In most cases, you will have to minimize the cost function.

Differentiate the function w.r.t the parameter and equate it to 0.

For minimization — the function value of the double differential should be greater than 0.
For maximization — the function value of the double differential should be less than 0.

Type of minimization or maximization

Constrained
Unconstrained

I will not go into detail about constrained minimization and maximization since it’s not been used much in machine learning except SVM (support vector machine), for more detail about constrained optimization you can follow this link

But I will give you some intuition about constrained and unconstrained optimization problems.

So, you go out with your friends after a long time, but everyone has budget constraints of 1000 Rs. you basically want to have maximum fun but you have a budget constraint so you want to maximize something based on constraint this would be a constraint maximization problem.

similarly, for the unconstrained problem, you just want to minimize and maximize output but there are no constraints involved the problem of minimizing the sum of square error (RSS) which we have been discussing, does not have any constraint applied to X and Y which we are trying to estimate therefore this is the problem the unconstrained minimization problem.

constrain minimization problem has some conditions and restrictions to impose on the range of parameters that the values of parameters can take.

let’s get an intuition about the constrained and unconstrained problems. For example on a given function (see the bellow image), is a constraint which means x can take value more than or equal to B then we can see the minimum value of the cost function can take at x=b which means X can’t take value A=0, because of this constraint the minimum value of cost function will take at B.

So, the cost function for the given equation would be 4(Four). So, the minimum value we can reach with this constraint is 4(Four), where the unconstrained way it would be (0) zero.

In this way, we have two possible solutions depending on whether constrained or unconstrained.

We saw the example of optimization using differentiation, there are two ways to go about unconstrained optimization.

Differentiation
Using Gradient descent

Gradient descent we will see in the next blog, this time pretty much that’s it about the Cost function.

Recall

If you recall the equation for the line that fit the data in Linear Regression, is given as:

Where β0 is the intercept of the fitted line and β1 is the coefficient for the independent variable x. As discussed above similarly we can calculate the value of β0 and β1 through differentiation.

Footnotes:

Additional Reading

OK, that’s it, we are done now. If you have any questions or suggestions, please feel free to comment. I’ll come up with more Machine Learning and Data Engineering topics soon. Please also comment and subs if you like my work any suggestions are welcome and appreciated.

You can subscribe to my YouTube channel

Post Views: 4,058