Activation Functions - Tanh vs. Sigmoid
On my Deep Learning journey, I started wondering why the tanh activation function performs better than the sigmoid function. The sigmoid function famously suffers from the vanishing gradient problem but why does tanh seemingly not suffer the same fate (true?).
I decided to plot common activation functions and their derivatives over a small input range to gain a better insight into what they are up to. I wrote a simple Python script to help visualise the functions using matplotlib (see Github):
Let's plot the function outputs over the specified input range to see what's going on (including Gaussian for fun). Note that the derivative for the rectified linear unit (relu) is undefined at 0.
The derivatives for tanh activation function has a stronger signal than the one displayed by the sigmoid function. Could this help with the momentum during gradient descent? What else could be going on here?
Incidentally, sigmoid is equivalent to (1 + tanh(x / 2)) / 2.
Softmax and Cross-Entropy
The output from a densely connected NN layer is generally passed to an activation function, such that output = activation(dot(input, weights) + bias). The last dense layer is typically connected to a softmax function that turns the output values into probabilities. During training we want to find the model weights and parameters and we can do that by minimising the average cross entropy. We use the probability values from the softmax function to calculate the cross entropy for each possible labelled outcome given the target labels. This is the topic of another post but a simple code sample can be found here (softmax_cross_entropy.py).
Further information:
Activation functions in Tensorflow
Activations with plots on Wikipedia
Comments
Post a Comment