Skip to main content
Please wait...
Cross Entropy Loss in Machine Learning
07 Apr, 2025

Cross Entropy Loss in Machine Learning

Overview

Cross entropy loss is a measure of the difference between two probability distributions: the predicted probability distribution and the actual distribution (often represented as one-hot encoded vectors). It is commonly used in classification tasks to evaluate the performance of a model. The formula for cross entropy loss is:     

$$L = -\sum_{i} y_i \log(p_i)$$ 

where ( $y_i$ ) is the actual label (0 or 1) and ( $p_i$ ) is the predicted probability for class (i).     

 

Convolutional Neural Networks (CNNs) and Cross Entropy Loss

         CNNs are a type of neural network designed for processing structured grid data, such as images. They consist of layers like convolutional layers, pooling layers, and fully connected layers.      

       Convolutional Neural Networks (CNNs) are a type of deep learning model particularly effective for image and video recognition tasks. CNNs consist of layers that apply convolution operations to input data, extracting features and patterns.     

       Cross entropy loss is often used as the loss function in CNNs for classification tasks. During training, the CNN adjusts its weights to minimize the cross entropy loss, thereby improving its predictions.     

 

ReLU Activation Function in CNNs

       The ReLU (Rectified Linear Unit) activation function is commonly used in CNNs. It introduces non-linearity into the model by converting negative values to zero while keeping positive values unchanged. The formula for ReLU is:     

$$f(x) = \max(0, x)$$   

ReLU helps in preventing the vanishing gradient problem, allowing the model to learn faster and perform better.

 

Output Layer of a CNN        

The output layer of a CNN typically consists of a fully connected layer followed by a softmax activation function for multi-class classification tasks. The softmax function converts the raw scores from the fully connected layer into probabilities that sum to 1. The formula for the softmax function is:     

$$\sigma(z)_i = \frac{e^{zi}}{\sum_{j} e^{zj}}$$

where $(z_i)$ is the raw score for class (i). 

 

Example of Cross Entropy Loss Function with Training Data     

Consider a classification task with ( N ) training samples. The cross entropy loss for the entire dataset can be expressed as:     

$$L = -\frac{1}{N} \sum_{j=1}^{N} \sum_{i} y_{ij} \log(p_{ij})$$     

where $(y_{ij})$ is the actual label for sample (j) and class (i), and $(p_{ij})$ is the predicted probability for sample (j) and class (i).     

 

Binary Cross Entropy Loss     

       Binary cross entropy loss is a special case of cross entropy loss used for binary classification tasks. It measures the difference between the actual binary labels and the predicted probabilities. The formula for binary cross entropy loss is:     

$$L = -\frac{1}{N} \sum_{j=1}^{N} \left[ y_j \log(p_j) + (1 - y_j) \log(1 - p_j) \right]$$ 

where $(y_j)$ is the actual binary label (0 or 1) and $(p_j)$ is the predicted probability for sample (j).

Example of Binary Classification Problem Using Binary Cross Entropy Loss            

Let's consider a binary classification problem where we want to predict whether an email is spam (1) or not spam (0). In a logistic regression model, binary cross entropy loss can be used to evaluate the performance of the model. Stochastic Gradient Descent (SGD) is an optimization algorithm that updates the model's weights iteratively based on the gradient of the loss function.

 

Example:

Initialize weights: Start with random weights. 

 

For each training sample: 

  • Compute the predicted probability $(p_j)$ using the logistic function:

$$p_j = \frac{1}{1 + e^{-w \cdot xj}}$$

where (w) is the weight vector and $(x_j)$ is the feature vector for sample (j).

  • Compute the binary cross entropy loss:

$$L_j = -\left[ y_j \log(p_j) + (1 - y_j) \log(1 - p_j) \right]$$ 

  • Compute the gradient of the loss with respect to the weights:

$$\nabla L_j = (p_j - y_j) x_j$$ 

  • Update the weights using SGD:

$$w = w - \eta \nabla L_j$$

where $(\eta)$ is the learning rate. 

By iteratively updating the weights, the model minimizes the binary cross entropy loss, improving its predictions.

 

References 

  • Davidashvilly, Shelly, et al. "Deep neural networks for wearable sensor-based activity recognition in Parkinson’s disease: investigating generalizability and model complexity." Biomedical engineering online 23.1 (2024): 17. 

  • Suresh, Pranav, et al. "Effective training strategy for nn models of working memory classification with limited samples." 2023 IEEE 20th international symposium on biomedical imaging (ISBI). IEEE, 2023. 

  • Liu, Chen-Yu, et al. "Quantum-train: Rethinking hybrid quantum-classical machine learning in the model compression perspective." arXiv preprint arXiv:2405.11304 (2024).