Aussie AI

Open Source Inference Engines for LLMs

  • Last Updated 29 August, 2025
  • by David Spuler, Ph.D.

AI inference for an LLM requires an engine. There are many open source LLMs available online, notably from Meta and Mistral, but they need an engine to run. Fortunately, there are multiple open-source full implementations of Transformer engines that can run inference for an LLM.

List of Open Source Inference Frameworks

Many examples are listed below and it's quite an overwhelming group. Most "famous" are PyTorch and TensorFlow, but there are many others with a stack. There are also several new fully-coded inference-specific engines, which don't have much training capability. Some of these frameworks are ML compilers (e.g. XLA and MLIR). There are also several frameworks that have gained a reputation for running RAG architectures, such as LangChain and Ollama.

These frameworks are mostly offered with permissive and non-copyleft licenses that allow commercial usage (review each package for its license details).

Here's the list so far:

  • PyTorch
  • TensorFlow
  • LangChain
  • TensorRT (NVIDIA)
  • ROCm (AMD)
  • GGML
  • Llama.cpp
  • MLIR (LLVM)
  • Ollama
  • LLMFarm
  • Llama2.c
  • OpenVINO (Intel)
  • Transformers (Hugging Face)
  • FasterTransformer (NVIDIA)
  • vLLM
  • TGI (Text Generation Inference) (Hugging Face)
  • MXNet
  • CTranslate2
  • DeepSpeed/DeepSpeed-MII
  • OpenLLM
  • RayServe
  • tinygrad
  • MLX (Apple)
  • TinyChatEngine

ML compilers (graph compilers) that are open source:

  • ONNX (Industry coalition)
  • TVM (Apache)
  • MLC LLM
  • XLA (TensorFlow)

And some non-open source inference platforms:

  • CUDA (NVIDIA)
  • AICore (Google)
  • MediaPipe (Google)

Research on Inference Frameworks

Industry articles. Online blog articles and industry press releases on inference frameworks:

Research papers. Academic papers about inference frameworks, with evaluations or theoretical aspects:

Benchmarking papers. Various research papers with performance measurement and benchmarking of inference frameworks:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: