- Monoco Quant Insights
- Posts
- 🧠 The Mind That Remembers: Understanding LSTMs from the Ground Up
🧠 The Mind That Remembers: Understanding LSTMs from the Ground Up
How the mathematics of memory changed machine learning forever
1. Historical Context: The Forgetting Problem
In 1997, a pair of German researchers—Sepp Hochreiter and Jürgen Schmidhuber—addressed a long-standing problem in the evolution of neural networks: the vanishing gradient. In traditional RNNs, when trying to learn long-term dependencies, the gradients used for training either vanished (became too small) or exploded (became too large) as they were backpropagated through time.
This meant RNNs could not remember information over long sequences. LSTMs were designed as a response to this limitation.
“We introduce a novel, efficient, gradient-based method that can learn to bridge very long time lags…”
— Hochreiter & Schmidhuber (1997)
2. High-Level Intuition: A Neural Memory Cell
At the heart of the LSTM lies a cell state—a kind of memory stream—and gates that learn how to write to, read from, or erase parts of this stream. These gates act like valves that control the flow of information:
Forget Gate: What should we discard?
Input Gate: What new information should we store?
Output Gate: What should we output at this time step?
Think of it as a dynamical system governed by these learned gates.
3. Mathematical Formulation of LSTM
Let’s walk through the equations and their intuitive meaning. At each time step t
, we operate on the input vector xₜ
, previous hidden state hₜ₋₁
, and previous cell state cₜ₋₁
.
Let:
σ
= sigmoid activation functiontanh
= hyperbolic tangent function·
= element-wise (Hadamard) productW
andb
= weight matrices and bias vectors
🔲 Forget Gate
Decides what to discard from the cell state.
fₜ = σ(W_f · [hₜ₋₁, xₜ] + b_f)
Output: A vector with values between 0 (completely forget) and 1 (completely keep)
🔲 Input Gate
Decides what new information to store.
iₜ = σ(W_i · [hₜ₋₁, xₜ] + b_i)
c̃ₜ = tanh(W_c · [hₜ₋₁, xₜ] + b_c)
iₜ
controls how much of the candidate c̃ₜ
to write into memory.
🔲 Cell State Update
Combines forget and input updates.
cₜ = fₜ · cₜ₋₁ + iₜ · c̃ₜ
A linear memory track—the core of long-term memory.
🔲 Output Gate
Decides what part of the memory to output.
oₜ = σ(W_o · [hₜ₋₁, xₜ] + b_o) hₜ = oₜ · tanh(cₜ)
Output vector hₜ
becomes the input for the next time step.
4. Memory and Gradient Flow: Why It Works
Traditional RNNs suffer from multiplicative decay of gradients. But in LSTMs:
The cell state
cₜ
has an additive update, not multiplicative.Gradients can flow linearly through time via the forget gate.
This architecture avoids gradient vanishing by design.
This is the core design pattern: Linear error flow through internal memory + multiplicative gates for control.
5. Design Patterns and Variants
📦 Patterns
Gate Composition: All gates share the same input
[hₜ₋₁, xₜ]
Parameter Sharing: Across time steps, the weights are tied
Gradient Highway: Cell state
cₜ
acts as a highway with minimal interference
🔀 Variants
GRU (Gated Recurrent Unit): A simplification with fewer gates
Peephole Connections: Allow gates to look at the cell state
Bidirectional LSTM: Uses both past and future context
Stacked LSTM: Layers multiple LSTM cells for richer representation
6. Visual Intuition
Picture this:
A conveyor belt (cell state) runs through time.
At each step, three robots (gates) inspect and modify the belt:
One removes useless info (forget gate),
One adds useful patterns (input gate),
One outputs the best summary for now (output gate).
This modular, gate-based machinery is what gives LSTM its power.
7. Use Cases
Speech Recognition: Memory is crucial for pronunciation.
Language Modeling: Understanding syntax over multiple tokens.
Anomaly Detection: Learning long-term normal patterns.
Financial Time Series: Capturing long-range dependencies.
8. The Legacy: A Gateway to Modern Architectures
While LSTMs have been partially eclipsed by Transformers, their gating mechanisms live on in:
Gated attention models
Memory-augmented neural networks
Temporal Fusion Transformers
Understanding LSTMs gives you a foundation to understand attention, memory networks, and sequence modeling more broadly.
🐝 Reference Reading
Hochreiter & Schmidhuber (1997): Long Short-Term Memory
Colah’s Blog on LSTMs – Illustrated
PyTorch LSTM Docs