**In** the last article, we saw Cost function in detail. ** Y=β0 + β1x** where

**is the intercept of the fitted line and**

*β0***is the coefficient for the independent variable**

*β1***x**. Now the challenges are how we are going to find β0 and β1. β0 intercept and β1 is the slope of the straight line during the training procedure.

In this context, we need to talk about the optimization method which will help us to learn ** β0** and

**. There are various methods to find an optimal parameter. But in this section will talk about**

*β1***gradient descent**which is an iterative method.

As we can see in **Fig-1** we have a bunch of data-point. Now we want to fit a model (**red line**), during training we want to find out optimal **β0, β1,** and we have an input {Xi, Yi}. So these are our objectives. We will achieve this through an optimization method called **Gradient Descent.** But before jumping into that we need to find one intermediate step called **Cost Function**.

To find the best β0, β1, we need to reduce the cost function for all data points.

### How to minimize the cost function

There are two type to of optimization method

- Closed-form solution
- Iterative form solution

Closed form solution I pretty much cover in my previous blog The Intuition Behind Cost Function. which is nothing but high school Math (**Diffraction**). How to calculate optimal minima, take a differentiation and equate to 0

Iterative form solution also has two part-

- First Order – also infer to gradient descent
- Second-Order – Newton Method

So, Gradient descent is an iterative form solution of order one. So to compute optimal thetas, we need to apply Gradient Descent to the Cost function.

#### Gradient Descent Intution

Gradient descent is one of the most popular algorithms to perform optimization. Gradient descent is an iterative method of optimization of an objective function, in our case the cost function.

Consider a Person **(A)** in Fig-5 who is walking along with the hill bellow and his destination to reach point B. He can’t view From point A to point B. So the possible action would be either he goes upward or downward. But he might take a bigger step or a little step to reach his destination.

**Which way to go and how big step to take**. These are two things that you should know to reach point B.

A gradient descent algorithm will help you to make this decision. now let’s introduce a calculus term called derivative. It is all about slope, the slope of the graph at a particular point. follow this link

Now let’s understand math in bellow example. Don’t get confuse **β0, β1 is same as θ0, θ1**

We have function **Y=x^2** will start with point **θ0**=10, now we are going to take a step in a negative direction green direction. Learning rate** η = 0.1** and gradient we have computed as 20, as we can see below.

To compute θ1, we saw that the equation will look like this.

Now we have moved in the direction where cost function becoming lesser and we move to position **8**. Now if we keep taking these steps we will reach a minimum.

### Context

We all are wondering why we are computing with this complex technique. Why we can’t use a **Closed-form solution** to calculate **θ0, θ1, **and reach the solution. Also, we have to choose the **η** parameter (Learning rate).

So, to understand the context if we have a one-dimensional function which is a very easy function. In this case, if you apply gradient descent it will take lot of effort to reach local minima.We only have to differentiate and equate to zero and will get **θ.**In 2 dimensions if we differentiate with 2 parameters, **θ1**,**θ2 **will get 2 linear equations. We can solve 2 linear equations and get **θ1**,**θ2** simple.

But if you adding more parameter n dimensions. After differentiation, the form of the equation is not linear then we are not able to solve. So, **Closed-form solutions** work with simple cases like one or two dimensions. In complex cases, an iterative form solution (Gradient descent) will reach local minima.

To summaries, Though gradient descent looks complicated for a **1dimension** function, it’s easier to compute the optimal minima using gradient descent for **higher dimension** function. We will be using it in Logistic Regression and even for Neural Networks.

### Learning Rate

So far we have seen, the gradient is a vector value function. the gradient vector has both a direction and a magnitude. The gradient descent multiplies with **η** (learning rate) to stimulate the next point.

To compute θ1, as we can see in (**Fig-7**), gradient multiplies with **η** to get the next point.

Now we will understand the intuition for small learning rate vs large learning rate.

- What if
**η**is Big - What if
**η**is small

As we can see in **Fig-8**, we have chosen 3 different learning rates **η**=0.4, **η**=0.1, **η**=2 respectively.

#### Observation when **η**=0.4

When **η**=0.4 in the first iteration, θ1=2 from **10 to 2** a big jump. And 0.4, 0.8, -0.56, -0.1 for 2nd, 3rd, 4th, 5th iteration respectively. So, in the first iteration, it took a long jump but after that, it got conserve and reach local minima. But, if observe on some point we missed the optimal point and do the multiple iterations to reach local minima, overall we can this is a good learning rate to start.

#### Observation when **η**=0.1

When **η**=0.1 As we can see in 1st iteration we reach point 8 from 10, and then to 6.4, 5.12, 4.6, and so on. Which state over the iteration it converges to local minima without oscillating. With this observation, we can say **η**=0.1 is the best learning rate for this cost function.

#### Observation when **η**=2

When **η**=2 which refers large learning rate as compare to the other two. In the first iteration will get -30 which means starting from 10 to reach -30 point a big jump. Similarly, as we go in each iteration will get 90, -270, 810 in 2nd, 3rd and 4th iteration respectively.

So in a larger learning rate, we could miss the optimal point and Gradient Descent fails to reach the local minimum. As we can see over each iteration θ is becoming larger and diverge to the local minima.

### Observation Summary

So the learning rate as a parameter that needs to be adopted as per the problem demands. There would be a situation when **η** is small then conversion is too slow and when **η** is large then you could oscillate.

To summarize, A large value of the learning rate may oscillate your solution, and you may skip the optimal solution(global minima). So it’s always a good practice to choose a small value of learning rate and slowly move towards the negative of the gradient.

## Gradient Descent in Python

So far we have seen how gradient descent works in terms of the equation. We took a simple 1D and 2D cost function and calculate **θ0, θ1**, and so on. Followed with multiple iterations to reach an optimal solution.

We will start off by implementing gradient descent for simple linear regression and move forward to perform multiple regression using **gradient descent in Python.**

Follow the python code for gradient descent GitHub link with Housing datasets.

```
# Implement gradient descent function
# Takes in X, y, current m and c (both initialized to 0), num_iterations, learning rate
# returns gradient at current m and c for each pair of m and c
def gradient(X, y, m_current=0, c_current=0, iters=1000, learning_rate=0.01):
N = float(len(y))
gd_df = pd.DataFrame( columns = ['m_current', 'c_current','cost'])
for i in range(iters):
y_current = (m_current * X) + c_current
cost = sum([data**2 for data in (y-y_current)]) / N
m_gradient = -(2/N) * sum(X * (y - y_current))
c_gradient = -(2/N) * sum(y - y_current)
m_current = m_current - (learning_rate * m_gradient)
c_current = c_current - (learning_rate * c_gradient)
gd_df.loc[i] = [m_current,c_current,cost]
return(gd_df)
```

The ‘learning_rate’ variable controls the steps we take in a downward direction in each iteration. If ‘learning_rate’ is too low, the algorithm may take longer to reach the minimum value. On the other hand, if it is high, the algorithm may overstep the minimum value.

The ‘m_gradient’ variable corresponds to **∂J/∂m** , and ‘c_gradient’ corresponds to** ∂J/∂c**. Moreover, m_current and c_current correspond to the steps in which we are updating the values.

### Footnotes:

Gradient descent is an optimization algorithm used to find the values of the parameters. To solve for the gradient, we iterate through our data points using our new m and b values and compute the partial derivatives.

**OK**, that’s it, we are done now. If you have any questions or suggestions, please feel free to reach out to me. I’ll come up with more Machine Learning topic soon.