Aussie AI Blog
Why is Training Loss Computed in Batches?
-
April 26th, 2026
-
by David Spuler, Ph.D.
Batched Training
Training involves computing a "loss" between what you want the model to predict as the probability of the token, and what it actually does. We use this loss to compute "gradients" which indicate multi-dimensional directions that our model weights should move toward. Then, the "learning rate" is used to add a small proportion of these gradients to each weight, and this is how we get updated weights. A small amount of change every time, over a lot of training data iterations.
Per batch.
A batch is a group of sequences of tokens. LLM training averages the loss over all tokens in all the sequences in each batch, rather than just looking at one token position in one sequence. The result is a single scalar number, which is then propagated backwards to compute "gradients" and then update the model weights. So, it's merging the losses for multiple tokens (words) in multiple sequences (a batch).
It sounds on first impression like it's an efficiency optimization to make training go faster, by computing losses only once per batch, and that it sacrifices some training accuracy to do so. Obviously, yes, it would be far more expensive to do back-propagation for every token without this per-batch averaging. But it would seem that this is also trading off some accuracy in the training session, because the averaged loss over a large batch of tokens hides a lot of extra gradients that could be used to set the per-token updated directions of the model weight vectors.
So, faster training but less accurate?
In fact, no. There's more to it than efficiency, and the reason for batched training is not just to reduce compute cost. There are various statistical factors that damage the training process in terms of achieving convergence and avoiding oscillating losses if you compute the loss at a more granular level. There's a relationship between the batch size and stability of the training process, and there are a lot of research papers that talk about gradients, variances, and batch sizes.
Here's a few of them:
- Xin Qian, Diego Klabjan, 27 Apr 2020, The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent, https://arxiv.org/abs/2004.13146
- Jichu Li, Xuan Tang, Difan Zou, 12 Feb 2026, The Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient, https://arxiv.org/abs/2602.11557
- Hao Wu, 17 Apr 2026 (v3), Convergence of Riemannian Stochastic Gradient Descents: Varying Batch Sizes And Nonstandard Batch Forming, https://arxiv.org/abs/2604.06350
More AI Research Topics
Read more about:
Aussie AI Advanced C++ Coding Books
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
|
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |
|
C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
Get your copy from Amazon: C++ Ultra-Low Latency |
|
Advanced C++ Memory Techniques: Efficiency & Safety:
Get your copy from Amazon: Advanced C++ Memory Techniques |
|
Safe C++: Fixing Memory Safety Issues:
Get it from Amazon: Safe C++: Fixing Memory Safety Issues |
|
Efficient C++ Multithreading: Modern Concurrency Optimization:
Get your copy from Amazon: Efficient C++ Multithreading |
|
Efficient Modern C++ Data Structures:
Get your copy from Amazon: Efficient C++ Data Structures |
|
Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
Get your copy from Amazon: Low Latency C++ |