# Maths behind training neural network using Gradient Descent

Well, recently I tried to contribute on a neural network project, I won’t lie I sucked miserably. But it is never too late to learn something and I will be pretty upset if I got another chance to contribute some thing as sexy as neural network and I suck once again. So here I go, learning neural network.

So what does a neural network model look like, it looks like a composition of bunch of layers and every layer contains bunch of neurons. The top most layer is called output layer and the down most layer is called input layer. Depending on the input the data propagates output. There are many many hidden layers resides in between these two. How many hidden layers would you love to have, it is a modeling problem, we need to consider many things to model our neural network. Every neurons are connected with every neurons of its next layer. Every node/neuron has its own activation function, every edge has its own weight, every layer has its own bias. So end of the day, this way or the other every neuron contributes to the output. Now when we are talking about training a neural network we are basically saying, we basically want to set the value of this weight, bias parameter in such a way so that for every input we get the correct result. How do we do that? Thats what this blog is all about.

So what can we do? We can go for Empirical risk minimization! So basically we are transforming the training problem to an optimization problem where we need to minimize a loss function of output and desired output. To save papers, or to impress academics we put it that way, , where f() is some function that predicts output, l() is some loss function of predicted output and real outputs. Ω() is a regularization function that takes Θ which is the weights we are predicting. Which has a use to filter on which values we don’t want to take. We will need to smooth our loss function because it is hard to optimize a non-smooth function.

From optimization literature we can use stochastic gradient descent to solve this. Θ= [w_1, b_1,…w_(i+1), b_(i+1)]. So we do what, For N iteration we will find l() is also known as cross entropy of information theory.

Now we will discuss about the gradients, the partial derivatives of our negative log function is As we see, -1 was not a necessary part of the derivatives, but we are adding this as a filter, when y is not c it is 0 so it filters everything else for the term Fc.

So the gradient of the probablity function Now we replace f() with a softmax function that basically normalizes the exponential of activation it over the summation of other exponentials. . If we apply this on our previous equation we get this: But now we back on our f(x) we got:   –(i)

We will also need to find out the gradient of the hidden layers of neural network, if we calculate gradients for each neurons, we will grow old solving this. So we take this equation for chain rule, where a is the activation, p is the loss function, q is the preactivation layer above.

. if we are k’th layer we are interested learn their gradients.

Partial derivative at j’th hidden unit and k’th hidden layer, with respect to the activation of hidden layer.  which is basically because a (x)=b(k)+∑W(k) h(k-1) (x)

Partial derivatives of Biases: • for K <= L+1 to 1
• Compute gradient of hidden layer: .
• Compute gradient of hidden layer below: •

How should we initialize our parameters? Well, looks like it is required a lot of insight to write about it in this blog, but I can say that there is a paper of Glarot and Bengio published in 2010 that suggested to have a gradient H_i,j = Uniform (-b,b). b=�?6/�?(H_k+ H_k-1)

Thanks:
1) Martin Thoma, for the inspiration
2) Hugo Larochelle for the tutorial