Aussie AI

Hybrid MoE Dense FFN Architecture

Last Updated 23 April, 2026

by David Spuler, Ph.D.

What is Hybrid MoE Dense FFN Architecture?

A hybrid Mixture-of-Experts (MoE) with a dense FFN architecture consists of parallel execution of an MoE FFN architecture with a dense non-MoE FFN. The MoE operates its sparse experts, and the dense FFN runs like a classic FFN, and the results are combined at the end.

Isn't this just a shared expert?

Yes, they are similar since a shared expert is like a fixed always-run FFN, but no, there are differences:

The dense FFN may have different dimensions to the other experts.
The dense FFN is not included in the normal MoE gating mechanism (although arguably, neither is a shared expert).
Different weightings of the combination of the MoE experts and the dense FFN at the end (possibly).

An example of this architecture is the Edge versions of the Gemma models. The idea is to run a slightly larger dense FFN, as well as the various MoE experts.

Research on Hybrid MoE Dense FFN Architecture

Research papers include:

Devansh, Apr 2026, Google’s Gemma 4 is Weirder than you Realize: The architecture matters more than the numbers. Here’s what Google actually built, https://machine-learning-made-simple.medium.com/googles-gemma-4-is-weirder-than-you-realize-17d00d95b0d5

AI Books from Aussie AI

The Sweetest Lesson: Your Brain Versus AI

The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:

Your brain is 50 times bigger than the best AI engines.
Truly intelligent AI will require more compute!
Another case of the bitter lesson?
Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson

RAG Optimization

RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:

Smarter RAG
Faster RAG
Cheaper RAG
Agentic RAG
RAG reasoning

Get your copy from Amazon: RAG Optimization

Generative AI in C++

Generative AI Applications book:

Deciding on your AI project
Planning for success and safety
Designs and LLM architectures
Expediting development
Implementation and deployment

Get your copy from Amazon: Generative AI Applications

Generative AI in C++

Generative AI programming book:

Generative AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

CUDA C++ Optimization

CUDA C++ Optimization book:

Faster CUDA C++ kernels
Optimization tools & techniques
Compute optimization
Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

CUDA C++ Optimization

CUDA C++ Debugging book:

Debugging CUDA C++ kernels
Tools & techniques
Self-testing & reliability
Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

C++ AVX Optimization

C++ AVX Optimization: CPU SIMD Vectorization:

Introduction to AVX SIMD intrinsics
Vectorization and horizontal reductions
Low latency tricks and branchless programming
Instruction-level parallelism and out-of-order execution
Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization

C++ Ultra-Low Latency

C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:

Low-level C++ efficiency techniques
C++ multithreading optimizations
AI LLM inference backend speedups
Low latency data structures
Multithreading optimizations
General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency

More AI Research Topics

Read more about: