Skip to content

Cost Function in Logistic Regression

In Machine learning, the cost function is a mathematical function that measures the performance of the model. In another word, we can say the difference between the predicted output and the actual output of the model.

We have covered a good amount of time in understanding the decision boundary. Check out the previous blog Logistic Regression for Machine Learning using Python. And how to overcome this problem of the sharp curve, with probability.

In the Logistic regression model the value of the classier lies between 0 to 1.

logistic regression hypothesis. In the Logistic regression model the value of the classier lies between 0 to 1. Nucleusbox

So to establish the hypothesis we also found the Sigmoid function or Logistic function.

So to establish the hypothesis we also found the Sigmoid function or Logistic function. Nucleusbox
sigmoid function or logistic function Fig-1

So let’s fit the parameter θ for the logistic regression.

Likelihood Function

So let’s say we have datasets X with m data points. Now the logistic regression says, that the probability of the outcome can be modeled as below.

So let's say we have datasets X with m data points. Now the logistic regression says, that the probability of the outcome can be modeled as shown in image. Based on the probability rule. If the success event probability is P then the fail event would be (1-P). That's how the Yi indicates above. Nucleusbox
Fig-2

Based on the probability rule. If the success event probability is P then the fail event would be (1-P). That’s how the Yi indicates above.

This can be combined into a single form as below.

likelihood for data points. This means, what is the probability of Xi occurring for a given  Yi value P(x|y)?The likelihood of the entire dataset X is the product of an individual data point. This means a forgiven event (coin toss) H or T. If H probability is P then T probability is (1-P). Nucleusbox
Fig-3

This means, what is the probability of Xi occurring for a given Yi value P(x|y)?

The likelihood of the entire dataset X is the product of an individual data point. This means a forgiven event (coin toss) H or T. If H probability is P then T probability is (1-P).

So, the Likelihood of these two events is.

logistic regression likelihood, explained by this image. Nucleusbox
logistic regression likelihood Fig-4

Now the principle of maximum likelihood says. we need to find the probability that maximizes the likelihood P(X|Y). Recall the odds and log-odds.

odds and log-odds.  the principle of maximum likelihood says. we need to find the probability that maximizes the likelihood P(X|Y). Recall the odds and log-odds. Nucleusbox
odds and log-odds Fig-5

So as we can see now. After taking a log we can end up with the linear equation.

linear equation

So in order to get the parameter θ of the hypothesis. We can either maximize the likelihood or minimize the cost function.

Now we can take a log from the above logistic regression likelihood equation. Which will normalize the equation into log odds.

maximum likelihood estimation. we can take a log from the above logistic regression likelihood equation. Which will normalize the equation into log odds. Nucleusbox
Fig-6

MLE is the Maximum likelihood estimation.

What is Cost Function?

In Machine learning, the cost function is a mathematical function that measures the performance of the model. In another word, we can say the difference between the predicted output and the actual output of the model.

I would recommend first checking this blog on The Intuition Behind Cost Function.

In logistic regression, we create a decision boundary. And this will give us a better seance of, what logistic regression function is computing.

 logistic regression function. In logistic regression, we create a decision boundary. And this will give us a better seance of, what logistic regression function is computing. Nucleusbox
Fig-7

As we know the cost function for linear regression is residual sum of squares.

linear regression cost function

We can also write as below. Taking half of the observation.

linear and logistic cost function. we can see in logistic regression the H(x) is nonlinear (Sigmoid function). And for linear regression, the cost function is convex in nature. For linear regression, it has only one global minimum. In nonlinear, there is a possibility of multiple local minima rather the one global minima.
Fig-8

As we can see in logistic regression the H(x) is nonlinear (Sigmoid function). And for linear regression, the cost function is convex in nature. For linear regression, it has only one global minimum. In nonlinear, there is a possibility of multiple local minima rather the one global minima.

So to overcome this problem of local minima. And to obtain global minima, we can define a new cost function. We will take the same reference as we saw in Likelihood.

logistic cost function

Cross entropy loss or log loss or logistic regression cost function. cross-entropy loss measures the performance of the classification model. And the output is a probability value between 0 to 1

So the cost function is as bellow.

cost function

Now we can put this expression into the Cost function Fig-8.

relation between cost function and likelihood function.
relation between the cost function and likelihood function. Fig-9

As we can see L(θ) is a log-likelihood function in Fig-9. So we can establish a relationship between the Cost function and the Log-Likelihood function. You can check out Maximum likelihood estimation in detail.

Maximization of L(θ) is equivalent to min of -L(θ) and using average cost over all data points, our cost function would be.

logistic regression cost function
the logistic regression cost function

Choosing this cost function is a great idea for logistic regression. Because Maximum likelihood estimation is an idea in statistics to find efficient parameter data for different models. And has also properties that are convex in nature.

Gradient Descent

Now we can reduce this cost function using gradient descent. The main goal of Gradient descent is to minimize the cost value. i.e. min J(θ).

gradient descent

Now to minimize our cost function we need to run the gradient descent function on each parameter i.e.

gradient descent update

Gradient descent is an optimization algorithm used to find the values of the parameters. To solve for the gradient, we iterate through our data points using our new m and b values and compute the partial derivatives.

Additional reading on Gradient descent

Gradient Descent for Logistic Regression Simplified – Step-by-Step Visual Guide

Footnotes:

Additional Reading

OK, that’s it, we are done now. If you have any questions or suggestions, please feel free to comment. I’ll come up with more Machine Learning and Data Engineering topics soon. Please also comment and subs if you like my work any suggestions are welcome and appreciated.

You can subscribe to my YouTube channel

5 1 vote
Article Rating
Subscribe
Notify of
guest
3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Narendra

Good read…

admin

Thanks Narendra

To tài khon cá nh^an

Thanks for sharing. I read many of your blog posts, cool, your blog is very good.