Skip to main content
Please wait...
Training Deep Neural Networks with the Adam Optimizer
08 Apr, 2025

Training Deep Neural Networks with the Adam Optimizer

The Adam optimizer, short for Adaptive Moment Estimation, is a popular optimization algorithm in machine learning and deep learning. It combines the benefits of two other extensions of stochastic gradient descent (SGD): AdaGrad and RMSProp. Adam computes adaptive learning rates for each parameter by using estimates of first and second moments of the gradients. 

 

How Adam Optimizer Works 

Adam uses momentum by maintaining an exponentially decaying average of past gradients (first moment) and past squared gradients (second moment). This helps smooth the optimization process and accelerate convergence. The algorithm updates the parameters using the following steps: 

  • Compute the gradients of the loss function with respect to the parameters. 

  • Update biased first moment estimate: $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$ 

  • Update biased second raw moment estimate: $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$ 

  • Compute bias-corrected first moment estimate: $\hat{m_t} = \frac{m_t}{1 - \beta_1^t}$ 

  • Compute bias-corrected second raw moment estimate: $\hat{v_t} = \frac{v_t}{1 - \beta_2^t}$ 

  • Update parameters: $\theta_t = \theta_{t-1} - \frac{\alpha \hat{m_t}}{\sqrt{\hat{v_t} + \epsilon}}$

Here, $(\beta_1)$ and $(\beta_2)$ are hyperparameters that control the decay rates of these moving averages, $(\alpha)$ is the learning rate, and $(\epsilon)$ is a small constant to prevent division by zero. 

 

Improvements Over Stochastic Gradient Descent 

Adam improves upon SGD by adapting the learning rate for each parameter individually, which helps in dealing with sparse gradients and noisy data. It also combines the benefits of momentum (which helps in accelerating convergence) and adaptive learning rates (which helps in dealing with different scales of parameters).

 

Compiling a Neural Network Model with Adam Optimizer 

To compile a neural network model with the Adam optimizer, you can use libraries like TensorFlow or Keras. Here’s an example of how to do it: 

import tensorflow as tf 
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense 
 
# Define the neural network model 
model = Sequential() 
model.add(Dense(64, activation='relu', input_shape=(input_dim,))) 
model.add(Dense(32, activation='relu')) 
model.add(Dense(1)) 
 
# Compile the model with Adam optimizer 
model.compile(optimizer='adam', loss='mean_squared_error') 
 
# Train the model 
model.fit(X_train, y_train, epochs=50, batch_size=32) 
 

 

Example of a Neural Network (NN) Model

Here’s a complete example of a neural network model with hidden layers, using Mean Squared Error (MSE) as the loss function, ReLU as the activation function, and the Adam optimizer: 

import tensorflow as tf 
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense 
 
# Define the neural network model 
model = Sequential() 
model.add(Dense(64, activation='relu', input_shape=(input_dim,))) 
model.add(Dense(32, activation='relu')) 
model.add(Dense(1)) 
 
# Compile the model with Adam optimizer 
model.compile(optimizer='adam', loss='mean_squared_error') 
 
# Train the model 
model.fit(X_train, y_train, epochs=50, batch_size=32) 
 

In this example, the model consists of an input layer, two hidden layers with 64 and 32 neurons respectively, and an output layer with a single neuron. The ReLU activation function is used in the hidden layers, and the Adam optimizer is used to update the model parameters. The Mean Squared Error (MSE) is used as the loss function to measure the performance of the model during training. 

 

Adam Optimizer in Natural Language Processing 

The Adam optimizer is extensively used in natural language processing (NLP) tasks. For instance, it is commonly employed to train large language models like BERT and GPT variants. These models benefit from Adam's ability to handle large datasets and high-dimensional parameter spaces efficiently[1][2]. Here’s an example of using Adam in an NLP task: 

import tensorflow as tf 
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Embedding, LSTM, Dense 
 
# Define the NLP model 
model = Sequential() 
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length)) 
model.add(LSTM(128, return_sequences=True)) 
model.add(LSTM(64)) 
model.add(Dense(1, activation='sigmoid')) 
 
# Compile the model with Adam optimizer 
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) 
 
# Train the model 
model.fit(X_train, y_train, epochs=10, batch_size=64) 
  

In this example, the model uses an embedding layer followed by two LSTM layers and a dense output layer. The Adam optimizer helps in efficiently training the model for tasks like text classification. 

 

References 

[3] Редька, М. О., and С. В. Хорошилов. "Determination of the force impact of an ion thruster plume on an orbital object via deep learning." Космічна наука і технологія 28.5 (2022): 15-26. 

[4 ] Pohle, Dennis, et al. "Intelligent self calibration tool for adaptive few-mode fiber multiplexers using multiplane light conversion." Journal of the European Optical Society-Rapid Publications 19.1 (2023): 29. 

[5] Huang, Xinquan, and Tariq Alkhalifah. "PINNup: Robust neural network wavefield solutions using frequency upscaling and neuron splitting." Journal of Geophysical Research: Solid Earth 127.6 (2022): e2021JB023703.