Aussie AI Blog

Why is Training Loss Computed in Batches?

  • April 26th, 2026
  • by David Spuler, Ph.D.

Batched Training

Training involves computing a "loss" between what you want the model to predict as the probability of the token, and what it actually does. We use this loss to compute "gradients" which indicate multi-dimensional directions that our model weights should move toward. Then, the "learning rate" is used to add a small proportion of these gradients to each weight, and this is how we get updated weights. A small amount of change every time, over a lot of training data iterations.

Per batch.

A batch is a group of sequences of tokens. LLM training averages the loss over all tokens in all the sequences in each batch, rather than just looking at one token position in one sequence. The result is a single scalar number, which is then propagated backwards to compute "gradients" and then update the model weights. So, it's merging the losses for multiple tokens (words) in multiple sequences (a batch).

It sounds on first impression like it's an efficiency optimization to make training go faster, by computing losses only once per batch, and that it sacrifices some training accuracy to do so. Obviously, yes, it would be far more expensive to do back-propagation for every token without this per-batch averaging. But it would seem that this is also trading off some accuracy in the training session, because the averaged loss over a large batch of tokens hides a lot of extra gradients that could be used to set the per-token updated directions of the model weight vectors.

So, faster training but less accurate?

In fact, no. There's more to it than efficiency, and the reason for batched training is not just to reduce compute cost. There are various statistical factors that damage the training process in terms of achieving convergence and avoiding oscillating losses if you compute the loss at a more granular level. There's a relationship between the batch size and stability of the training process, and there are a lot of research papers that talk about gradients, variances, and batch sizes.

Here's a few of them:

  1. Xin Qian, Diego Klabjan, 27 Apr 2020, The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent, https://arxiv.org/abs/2004.13146
  2. Jichu Li, Xuan Tang, Difan Zou, 12 Feb 2026, The Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient, https://arxiv.org/abs/2602.11557
  3. Hao Wu, 17 Apr 2026 (v3), Convergence of Riemannian Stochastic Gradient Descents: Varying Batch Sizes And Nonstandard Batch Forming, https://arxiv.org/abs/2604.06350

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Modern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++