Learning LSTMS

Parag Chaudhari
5 min readAug 17, 2020

When we are born we knew nothing, we were blank slates, through various teachings and experiences we picked up useful skills and talents throughout our lives.

Artificial intelligence was an attempt to replicate this in computers and to solve this we came up with deep neural networks, able to learn complex data and patterns and give valuable inferences. Object detection and classification, predicting land and stock prices to name a few. These days Deep neural networks are everywhere, your closest access point being your home Alexa device or your mobile phone.

An issue that arises with dnn comes up when want to do predictive tasks based on events that are related to each other. Say if we want to make a system that predicts twists in a movie. A traditional neural network has no way of determining what events would come based on current and past events, as it has no mechanism to retain memory.

Recurrent Neural networks

What makes rnn’s different than its traditional counterpart is its ability to retain information. Continuing our movie example a rnn is capable of remembering its previous frames and might be able to predict information about the current frame.

Wikipedia describes RNN as “a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence.”

They do this by looping the information back in allowing it to persist.

All Rnn’s consist of repeating units having the same but a simple structure.The above node is an rnn node consists of a tanh function

RNN however suffer from a huge problem feared by all data scientists and machine learning engineers. The problem of vanishing gradients. This hampers the models’ ability to learn long data sequences.

Enter Long Short Term Memory networks

Lstm for short, Long short term memory networks are an implementation of rnn capable of remembering long term memories.They solve the earlier vanishing gradient problem by controlling what information to forget, thanks to its forget gate. I recommend reading Colah’s blog linked below for an in-depth explanation.

Lstms are similar to rnns in the way they but have a more complicated internal structure. Instead of layer one, there are four interacting in very interesting ways.

Now I am sure the above figure is giving you a headache, it did so to me when I first saw it. Lets break it down and understand it piece by piece.

The building blocks of Lstm

The core concept here is cell state and gates.The cell state is responsible for remembering information and transferring it down the chain. The gates are different mathematical functions (layers) which determine what to do with the data that has come in and manipulated the existing cell state

Forget Gate

As it comes with all memories, we must have a mechanism for old memories to be forgotten so new ones can exist, the forgot gate is the mechanism for this task. This gate decides how much information is to be retained by the cell. Information coming from the previous cell is passed through a sigmoid function which maps the existing values from zero to one, zero being completely forgotten and one being completely remembered.

Input gate

We covered handling old memories, what if some new ones come in, that's exactly what done by the input gate, it takes in the hidden input from the previous cell along with current input. Passes them through a sigmoid and a tanh function. The sigmoid determines the importance of the information that has come in, while the tanh squishes the information between -1 to 1 to facilitate regulation. Regulation is done to prevent values get too large which is bad for any neural network.

Cell state

Once the input gate has done with its number crunching, we may update the cell state, which is nothing but its memory. This is done by multiplying the output of the previous cell's memory and the output of the current cell we calculated earlier. We may also include a forget vector too set how much. of this new state should be remembered.

And finally, after all that number-crunching we have the output. The output consists of a current state and a hidden state which may be fed to yet another lstm cell like this one or to a densely connected layer.

Conclusion

So that's that about lstms , I avoided getting into the mathematics of it as it gets pretty scary but if you are up for the challenge I recommend reading the below articles that dive deep into the seas of mathematics and calculus.

References

--

--