How Neural Networks Learn using Gradient Descent

In this article, we are going to learn how the Gradient descent concept helps us in making predictions in Neural Networks

Daksh Bhatnagar
10 min readNov 5, 2022

Introduction

Gradient descent is a popular optimization approach for training machine learning models and neural networks. These models evolve over time with the use of training data, and the cost function (inside gradient descent) essentially functions as a barometer, assessing its correctness with each iteration of parameter changes.

The model will continue to change its parameters until the function is near to or equal to zero, at which point it will stop.

For each type of Machine Learning problem (regression or classification), you must compute a loss, which is simply a number that indicates how well or poorly the model predicts. Lower the value of loss function, better the model.

How does gradient descent work?

You may recall the slope of a line formula: y = mx + b, where m denotes the slope and b is the intercept on the y-axis.

The formula for calculating slope i.e m and b i.e intercept

Remember from your math class how the mean squared error formula was used to calculate the error between the actual output and the expected output (y-hat) while generating a scatterplot chart.

The gradient descent technique is similar in behavior, but it is based on a convex function, as seen in the first figure above.

Best Fit Line Plot in case of regression. On X we have targets / actual values and on Y, we have the predictions made by the model.
Best Fit Line Plot, Regression

A beginning point is simply an arbitrary place from which we may assess performance. We’ll find the derivative , rate of changes in y given the Changes in X, (or slope) from that starting point, and then use a tangent line to see how steep the slope is.

The slope will influence parameter changes, such as weights and bias. The slope will be higher at the start, but as new parameters are created, it should steadily decrease until it hits the lowest point on the curve, known as the point of convergence.

The purpose of gradient descent, like finding the line of best fit in linear regression, is to minimize the cost function or the difference between expected and real values. This necessitates the collection of two data points: a direction and a learning rate.

These parameters influence subsequent iterations’ partial derivative computations, allowing it to progressively arrive at the local or global minimum (i.e. point of convergence). More information about these components is provided below:

  • Learning rate (also referred to as step size or the alpha) — The size of the steps taken to reach the minimum is known as step size. It is usually a small number and is adjusted depending on the behavior of the cost function. Having a large learning rate will cause larger steps, but there is a chance that it could go past the minimum. On the other hand, a low learning rate will have smaller steps and be more accurate, but it will take a lot more time and calculations to reach the minimum.
How different learning rates impact training
  • The cost (or loss) function — Measures the difference between the actual and predicted values of y at a given point. This feedback is used to modify the parameters of the machine learning model and reduce the error, helping it to reach either the local or global minimum. The model will keep adjusting itself in the direction of the steepest descent (or the negative gradient) until the cost function is close to zero, at which point the learning process will stop.

Challenges with Gradient Descent

While gradient descent is the most common approach for optimization problems, it does come with its own set of challenges. Some of them include:

  • Local minima and saddle points — Gradient descent can easily find the global minimum of convex problems, but it can struggle to do the same for nonconvex problems, where the model would achieve the best results. Once the slope of the cost function approaches or reaches zero, the learning process stops. This can happen in multiple scenarios, such as local minima and saddle points. The former looks like a global minimum, with the slope increasing on either side of the current point. The latter is named after a horse’s saddle, since the negative gradient only exists on one side of the point, with a local maximum on one side and a local minimum on the other. To escape such locations, it can be beneficial to introduce noise to the gradient.
Diagram showing local and global minima
Image displaying local and global minima
  • Vanishing gradients: When backpropagation happens, if the gradient is too small, the earlier layers of the network may learn more slowly than the later layers. This can lead to the weight parameters becoming so small as to be insignificant, meaning the algorithm would no longer be learning.
  • Exploding gradients: When the gradient is too large, it can lead to an unstable model, causing the weights to become too large and be represented as NaN. To avoid this, one solution is to use a dimensionality reduction technique to reduce the complexity of the model.

Gradient Descent for Regression:-

For gradient descent to happen and the loss reaches the global minimum, it is crucial to select a loss function wisely. We are going ahead with the Mean Squared Error which is quite easy to interpret.

After we differentiate our base equation which is (y-yhat)², we get 2 (y-y-hat) using the power rule of calculus, and then (y-yhat) is further differentiated w.r.t to both b and W to get the equations we have below:

Differentiation equation for calculating gradients wrt b and W

Now the update rule becomes

The parameter update rule for weights and biases

The rule will help us determine where we are on the function path and eventually reach the global minimum. Below is the code for the gradient descent in case of regression:-

#Function For EarlyStopping
def EarlyStopping(loss):
for i in range(1, len(loss)):
yield (loss[i-1], loss[i])

#Creation of Batches
def batch_size(batchsize, X):
batches = round(X.shape[0]//batchsize)
return batches

#Gradient Descent Function for regression
def regression_gradient_descent(X_train, y_train, m, b):
yhat = np.dot(X_train,m) + b
MSE = (np.sum((y_train-yhat)**2))/N
r_squared = r2_score(y_train,yhat)
loss_slope_b = -(2/N)*sum(y_train-yhat) #wrt to b aka intercept
loss_slope_m = -(2/N)*(np.dot((y_train - yhat),X_train)) #wrt to the slope of the line
m = m - (learning_rate*loss_slope_m)
b = b - (learning_rate*loss_slope_b)
return m, b, MSE, r_squared

The R-Squared value will indicate how well our model has been able to fit the data, and the Mean Squared Error (MSE) will tell us how far off our predictions are from the actual values. The code below shows the training and test sets.

np.random.seed(0)
N = X.shape[0]
learning_rate=0.3
decay_rate = 0.01
LR = []
ValidationLoss = []
Trainingloss = []
batchsize = 50
Intercept = []
Slope = []
m=np.ones(X.shape[1]) #initializing some values of slope
b=1 #initializing some values of intercept
print('The initial Value of w and b are', m, b)
batches = batch_size(batchsize, X)
for i in range(1000):
epoch = i
for j in range(batches):
if i==0:

#Updating the params at certain intervals in an epoch
if j % batchsize==0:
learning_rate = learning_rate
np.random.seed(0)
np.random.shuffle([X_train, y_train])
m, b, MSE, r_squared = regression_gradient_descent(X_train, y_train, m, b)
m_test, b_test, MSE_test, r_squared_test = regression_gradient_descent(X_test, y_test, m, b)
else:
m = m
b = b

else:

#Updating the params at certain intervals in an epoch
if j % batchsize==0:
learning_rate = [(1/(1+decay_rate))* learning_rate for j in range(batches)][0]
np.random.seed(0)
np.random.shuffle([X_train, y_train])
m, b, MSE, r_squared = regression_gradient_descent(X_train, y_train, m, b)
m_test, b_test, MSE_test, r_squared_test = regression_gradient_descent(X_test, y_test, m, b)
else:
m = m
b = b

Intercept.append(b)
Slope.append(m)
Trainingloss.append(MSE)
ValidationLoss.append(MSE_test)
LR.append(learning_rate)

if i % 20 == 0:
print('===> Epoch: ',i,' Loss: ',MSE ,' Val Loss: ', MSE_test,
' R-Squared:', round(r_squared,4), ' Val R-Squared: ', round(r_squared_test,4))

#Early Stopping Mechanism
for prev, curr in EarlyStopping(ValidationLoss):
if prev - curr < 1e-6:
print('-- Early Stopping at Epoch',i,'with Val Loss', np.around(MSE_test,5), 'and Val R-Squared', np.around(r_squared_test,5), '--')
break #Inner Loop Break
else:
continue # executed if the inner loop did NOT break
break # executed if the inner loop DID break
Regression Loss and Best Fit Line Plots

Gradient Descent For Classification

We, first of all, calculate the probabilities using Log Loss and then pass the value to the sigmoid function which classifies it into either 0 or 1. The formula for the Log Loss is as follows:-

The loss function for classification gradient descent

The formula is also sometimes referred to as `Binary Cross Entropy` because we take the maximum likelihood by multiplying the probabilities, but the resulting value is too small to be valid. To address this, we take the log values of the probabilities, but this has the opposite problem since the log of a lower number is higher.

In order to address this, we use the negative sign and take into account all of the possible values. This requires two different terms, one for the positive outcome and one for the negative outcome, so the formula is capable of handling both. Since there is no exact solution to this formula, we must rely on Gradient Descent to identify the values that result in the lowest loss.

The derivative of the log loss will be used and we will update our weights to get to the minima.

Differentiation Equation for calculating gradients for classification

The Parameter Update rule then becomes the following. The addition sign is because of the negative sign in the final equation of the derivation

Weight Update Rule

Below would be the code for gradient descent for classification:-

#initializing random Parameters
def initialize_betas(dim):
b = random.randint(0, 1)
w = np.random.rand(dim)
return b,w
#Sigmoid Function
def sigmoid(b, w ,X_new):
Z = b + np.matmul(X_new,w)
return (1.0 / (1 + np.exp(-Z)))
#Cost Calculation
def cost( y, y_hat):
return - np.sum((np.dot(y.T,np.log(y_hat)))+ (np.dot((1-y).T,np.log(1-y_hat)))) / ( len(y))
#Updating Parameters
def update_params (b, w , y , y_hat, X_new, alpha):
db = np.sum( y_hat - y)/ len(y)
b = b - (alpha * db)
dw = np.dot((y_hat - y), X_new)/ len(y)
w = w - (alpha * dw)
return b,w

We will now proceed to begin the training with the code below. To make the code more organized, we can put the functions from above into a library, creating a custom library.

np.random.seed(0)
train_costs = []
val_costs = []
Accuracy = []
Val_accuracy= []
batchsize = 10
learning_rate = 0.001
decay_rate = 0.01
b,w = initialize_betas(X.shape[1])
print('The initial Value of w and b are', w, b)
batches = batch_size(10, X)
for i in range(2000):
epoch = i

for j in range(batches):
if i==0:

#Updating the params at certain intervals in an epoch
if j % batchsize==0:
learning_rate = learning_rate
y_hat = sigmoid(b, w , X_train)
current_cost = cost(y_train, y_hat)
prev_b = b
prev_w = w
b, w = update_params(prev_b, prev_w, y_train, y_hat, X_train, learning_rate)
y_pred = [1 if i>0.5 else 0 for i in y_hat]
accuracy = round(len(y_train[y_train==y_pred])/y_train.shape[0]*100,4)
np.random.seed(0)
np.random.shuffle([X_train, y_train])

#Calculations on test data
y_hat_test = sigmoid(b, w , X_test)
current_cost_test = cost(y_test, y_hat_test)
prev_b = b
prev_w = w
b, w = update_params(prev_b, prev_w, y_test, y_hat_test, X_test, learning_rate)
y_pred_test = [1 if i>0.5 else 0 for i in y_hat_test]
accuracy_test = round(len(y_test[y_test==y_pred])/y_test.shape[0]*100,4)
np.random.seed(0)
np.random.shuffle([X_test, y_test])
else:
#Not training the Parameters if the above condition is not met
prev_w = prev_w
prev_b = prev_b

else: #Decaying the Learning Rate if its not the first iteration

#Updating the params at certain intervals in an epoch
if j % batchsize==0:
learning_rate = [(1/(1+decay_rate))* learning_rate for j in range(batches)][0]
y_hat = sigmoid(b, w , X_train)
current_cost = cost(y_train, y_hat)
prev_b = b
prev_w = w
b, w = update_params(prev_b, prev_w, y_train, y_hat, X_train, learning_rate)
y_pred = [1 if i>0.5 else 0 for i in y_hat]
accuracy = round(len(y_train[y_train==y_pred])/y_train.shape[0]*100,4)
np.random.seed(0)
np.random.shuffle([X_train, y_train])

#Calculations on test data
y_hat_test = sigmoid(b, w , X_test)
current_cost_test = cost(y_test, y_hat_test)
prev_b = b
prev_w = w
b, w = update_params(prev_b, prev_w, y_test, y_hat_test, X_test, learning_rate)
y_pred_test = [1 if i>0.5 else 0 for i in y_hat_test]
accuracy_test = round(len(y_test[y_test==y_pred_test])/y_test.shape[0]*100,4)
np.random.seed(0)
np.random.shuffle([X_test, y_test])
else:

#Not training the Parameters if the above condition is not met
prev_w = prev_w
prev_b = prev_b

Accuracy.append(accuracy)
Val_accuracy.append(accuracy_test)
train_costs.append(current_cost)
val_costs.append(current_cost_test)

if i % 100 == 0:
print('===> Epoch:',i,' Loss: ',round(current_cost_test,3) , ' Accuracy: ', accuracy, ' Val Accuracy ',
accuracy_test, ' Learning Rate ', learning_rate)

#Early Stopping Mechanism
for prev, curr in EarlyStopping(train_costs):
if prev -curr<1e-6:
print('-- Early Stopping at Epoch', i, 'with Val Loss',np.around(current_cost_test,5),
'and Val Accuracy', np.around(accuracy_test,4), '--')
break #Inner Loop Break
else:
continue # executed if the inner loop did NOT break
break # executed if the inner loop DID break

You are welcome to adjust the learning rate, batch size, and epochs in order to improve your results. The following graphs demonstrate that we were able to reach the optimal solution.

Training plots displaying the training performance, classification

Conclusion

  1. Gradient descent is a simple optimization procedure that you can use with most machine learning algorithms.
  2. The algorithm is computationally efficient and can work in any dimensional space.
  3. Given the correct data preprocessing and hyperparameter tuning, it’s most likely to find the global minima in lesser iterations.

To see the full code, please click here for the Kaggle notebook. Thanks for taking out the time to go through the article. If you find the article useful, please be sure to clap and follow for more.

Further Reading:-

1. Logistic Regression Using Gradient Descent https://bit.ly/3UGv1JL
2. Linear Regression Using Gradient Descent https://bit.ly/2OwHopq
3. Curse Of Dimensionality https://bit.ly/3RerRtZ
4. Gradient Descent — Deep Dive https://bit.ly/3xR8rog

--

--

Daksh Bhatnagar
Daksh Bhatnagar

Written by Daksh Bhatnagar

Data Analyst who talks about #datascience, #dataanalytics and #machinelearning

No responses yet