We have covered a good amount of time in understanding the decision boundary. Check out previous blog Logistic Regression for Machine Learning using Python. And how to overcome this problem of the sharp curve, with probability.
In Logistic regression model the value of classier lies between 0 to 1.
So to establish the hypothesis we also found the Sigmoid function or Logistic function.
So let’s fit the parameter θ for the logistic regression.
So let say we have datasets X with m data-points. Now the logistic regression says, that the probability of the outcome can be modeled as bellow.
Based on the probability rule. If the success event probability is P than fail event would be (1-P). That’s how the Yi indicates above.
This can be combine into single form as bellow.
Which means, what is the probability of Xi occurring for given Yi value P(x|y).
The likelihood of the entire datasets X is the product of an individual data point. Which means forgiven event (coin toss) H or T. If H probability is P then T probability is (1-P).
So, Likelihood of these two event is.
Now the principle of maximum likelihood says. we need to find the probability that maximizes the likelihood P(X|Y). Recall the odds and log-odds.
So as we can see now. After taking a log we can end up with linear equation.
So in order to get the parameter θ of hypothesis. We can either maximize the likelihood or minimize the cost function.
Now we can take a log from the above logistic regression likelihood equation. Which will normalize the equation into log-odds.
I would recommend first check this blog on The Intuition Behind Cost Function.
In logistic regression, we create a decision boundary. And this will give us a better seance of, what logistic regression function is computing.
As we know the cost function for linear regression is residual sum of square.
We can also write as bellow. Taking the half of the observation.
As we can see in logistic regression the H(x) is nonlinear (Sigmoid function). And for linear regression, the cost function is convex in nature. For linear regression, it has only one global minimum. In nonlinear, there is a possibility of multiple local minima rather the one global minima.
So to overcome this problem of local minima. And to obtain global minima, we can define new cost function. We will take the same reference as we saw in Likelihood.
Cross entropy loss or log loss or logistic regression cost function. cross entropy loss measure the performance of the classification model. And the output is probability value between 0 to 1
So the cost function is as bellow.
Now we can put this expression into Cost function Fig-8.
As we can see L(θ) is a log-likelihood function in Fig-9. So we can establish a relation between Cost function and Log-Likelihood function. You can check out Maximum likelihood estimation in detail.
Maximization of L(θ) is equivalent to min of -L(θ), and using average cost over all data point, out cost function would be.
Choosing this cost function is a great idea for logistic regression. Because Maximum likelihood estimation is an idea in statistics to finds efficient parameter data for different models. And it has also the properties that are convex in nature.
Now we can reduce this cost function using gradient descent. The main goal of Gradient descent is to minimize the cost value. i.e. min J(θ).
Now to minimize our cost function we need to run the gradient descent function on each parameter i.e.
Additional reading on Gradient descent
Gradient descent is an optimization algorithm used to find the values of the parameters. To solve for the gradient, we iterate through our data points using our new m and b values and compute the partial derivatives.
OK, that’s it, we are done now. If you have any questions or suggestions, please feel free to reach out to me. I’ll come up with more Machine Learning topic soon.