Introduction

  • As AI becomes increasingly integral to modern software applications, deploying models directly on devices—such as smartphones, embedded systems, wearables, and edge computing nodes—has gained prominence. This approach, known as on-device machine learning, enables faster inference, improved privacy, offline capabilities, and lower latency compared to cloud-based alternatives.

  • Several runtimes/inference engines have been developed to facilitate the efficient execution of ML models on diverse hardware architectures. These runtimes vary significantly in terms of platform compatibility, supported model formats, execution optimizations, and hardware acceleration. This primer covers a detailed comparison of key ML runtimes that support on-device inference:

    • TensorRT
    • Core ML
    • MLX (Apple MLX)
    • ONNX Runtime
    • ExecuTorch
    • LidarTLM
    • llama.cpp
    • TensorFlow Lite / TensorFlow Serving
  • This primer includes both general-purpose and specialized runtimes, ranging from Core ML and TensorFlow Lite to transformer-specific tools like llama.cpp and GPU-optimized engines such as TensorRT.

Architecture Overview of On-Device ML Runtimes

  • On-device machine learning runtimes are engineered to execute pre-trained models efficiently within the constraints of mobile devices, embedded platforms, and personal computers. Despite the diversity of runtimes, they typically share core architectural components that manage model parsing, hardware abstraction, and execution flow.
  • This section outlines common architectural patterns and then provides architecture summaries for each runtime discussed in this primer.

Common Architectural Layers

  • Most on-device ML runtimes follow a layered architecture consisting of the following components:

    • Model Loader / Parser: Responsible for reading serialized model files (e.g., .mlmodel, .tflite, .onnx, .pt, etc.) and converting them into an internal representation suitable for execution.

    • Serialization Format: Defines how models are stored on disk. Most runtimes use specialized formats (e.g., FlatBuffer in TFLite, Protobuf in TensorFlow/ONNX). Protobuf offers fast binary encoding and structured metadata representation, and is common in ONNX (.onnx) and TensorFlow (.pb) models.

    • Intermediate Representation (IR): Some runtimes convert models into an internal graph or IR that enables further optimization and abstraction from the original framework.

    • Kernel / Operator Library: A collection of pre-implemented mathematical operations (e.g., convolution, matmul, ReLU) that form the backbone of computation. These may be hand-optimized for specific CPU, GPU, NPU, or DSP targets.

    • Execution Engine / Scheduler: Coordinates the evaluation of the computational graph, manages dependencies, and dispatches workloads to the appropriate hardware accelerators.

    • Hardware Abstraction Layer (HAL): Encapsulates hardware-specific APIs and provides runtime support for leveraging specialized units like Apple’s ANE, Qualcomm’s Hexagon DSP, or CUDA cores on NVIDIA GPUs.

Architecture by Runtime

TensorRT

  • Model Format: .plan (TensorRT Engine)
  • Execution Flow:

    • Accepts models in ONNX, TensorFlow, or Caffe formats
    • Optimizes and compiles model into a serialized CUDA engine (.plan)
    • Engine executes directly via CUDA on supported NVIDIA GPUs
  • Hardware Support: NVIDIA GPUs (desktop, embedded, server)
  • Backend Design: Layer fusion, kernel autotuning, INT8/FP16 quantization, Tensor Cores
  • Strengths: Extreme inference speed on NVIDIA hardware, minimal latency, quantization support
  • Weaknesses: GPU-only, requires CUDA, less flexible for model updates at runtime

Core ML

  • Model Format: .mlmodel, optionally converted from other formats using coremltools
  • Execution Flow:

    • Model is compiled into a Core ML model package (.mlmodelc)
    • Uses internal execution graph
    • Runtime determines target hardware (CPU, GPU, or ANE) dynamically
  • Hardware Support: CPU, GPU, Apple Neural Engine (ANE)
  • Backend Design: Proprietary graph engine, no direct user-accessible IR
  • Strengths: Seamless Apple integration, high-level API, automatic hardware optimization
  • Weaknesses: Apple-platform only, opaque architecture, limited transparency for debugging

MLX (Apple MLX)

  • Model Format: Python-based tensor operations with PyTorch-like syntax
  • Execution Flow:

    • Eager mode and graph execution both supported
    • Uses Metal Performance Shaders and ANE backend where possible
  • Hardware Support: Primarily Apple Silicon (M-series CPU, GPU, ANE)
  • Backend Design: Dynamic execution engine; uses MLX backend API
  • Strengths: Developer flexibility, research-oriented, direct tensor ops
  • Weaknesses: Early-stage, Apple-only, smaller community, fewer pre-built models

ONNX Runtime

  • Model Format: .onnx
  • Execution Flow:

    • Loads ONNX graph and converts to optimized IR
    • Graph optimization passes applied (e.g., constant folding, fusion)
    • Execution providers (EPs) handle hardware-specific execution
  • Hardware Support: CPU, GPU (CUDA, ROCm), NNAPI, DirectML, ARM, OpenVINO
  • Backend Design: Pluggable EP system, modular kernel dispatch
  • Strengths: Cross-platform, flexible, highly optimized
  • Weaknesses: Model conversion may be lossy or complex, mobile-specific tuning needed

ExecuTorch

  • Model Format: PyTorch Lite models, ptc compiled bytecode
  • Execution Flow:

    • TorchScript traced models compiled using Ahead-of-Time (AOT) compiler
    • Produces a minimal runtime with only needed ops
    • Bytecode is executed on microcontroller or mobile device
  • Hardware Support: CPU, MCU, potentially DSP/NPU
  • Backend Design: AOT compiler, custom micro runtime, graph executor
  • Strengths: Lightweight, optimized for resource-constrained environments
  • Weaknesses: Limited model format support, newer toolchain

LidarTLM

  • Model Format: Custom or converted models for lidar data processing
  • Execution Flow:

    • Ingests sparse point cloud or voxel data
    • Uses spatial and temporal inference pipelines
  • Hardware Support: ARM CPUs, embedded GPU, or AI co-processors
  • Backend Design: Spatially-aware computation graph; sensor-fusion modules
  • Strengths: Specialized for lidar, supports sensor fusion
  • Weaknesses: Niche use case, limited community and documentation

llama.cpp

  • Model Format: Quantized LLM formats (GGUF, etc.)
  • Execution Flow:

    • Loads quantized model into memory
    • Performs batched matmul-based transformer inference
    • Multi-threaded CPU execution with optional GPU offload (via OpenCL, Metal)
  • Hardware Support: CPU, optionally GPU
  • Backend Design: Minimalist tensor framework, custom linear algebra, no IR
  • Strengths: Extremely portable, optimized for low-RAM devices, self-contained
  • Weaknesses: Focused only on LLMs, lower-level interface

TensorFlow Lite / Serving

  • Model Format: .tflite (Lite), .pb or SavedModel (Serving)
  • Execution Flow:

    • TFLite: uses FlatBuffer model, loads and interprets ops
    • Serving: REST/gRPC server for remote model inference
  • Hardware Support:

    • TFLite: CPU, GPU, EdgeTPU, NNAPI, Hexagon DSP
    • Serving: Primarily server-side; not for on-device use
  • Backend Design:

    • TFLite: statically compiled interpreters with kernel registry
    • TFLite delegates for hardware acceleration
  • Strengths: Broad compatibility, active ecosystem, stable
  • Weaknesses: Delegate configuration can be tricky, Serving not suitable for offline use

TensorRT Deep Dive

  • TensorRT is NVIDIA’s high-performance, low-latency inference runtime for deep learning models. It is purpose-built for GPU-accelerated inference and heavily optimized for NVIDIA’s hardware, including desktop GPUs, Jetson embedded boards, and datacenter GPUs with Tensor Cores.

Overview

  • Developer Target: Engineers deploying deep learning models on NVIDIA hardware
  • Use Cases: Vision inference, robotics, autonomous vehicles, embedded AI with Jetson, high-throughput servers
  • Model Format: ONNX, Caffe, TensorFlow (converted to .plan engine)
  • Conversion Tools: trtexec, TensorRT Python/C++ APIs

Architecture

  • TensorRT transforms trained models into an optimized engine using multiple optimization passes:

  • Execution Flow:

    1. Model Import: Loads model (typically ONNX) using TensorRT parser
    2. Optimization:

      • Layer fusion
      • Precision calibration (FP16, INT8)
      • Kernel selection and scheduling
    3. Engine Building:

      • Generates a .plan file (serialized CUDA engine)
      • This engine can be reused for fast deployment
    4. Inference Execution:

      • Input data fed through pre-allocated CUDA buffers
      • Execution is entirely GPU-bound using CUDA streams
  • Key Components:

    • Builder: Optimizes and generates runtime engine
    • Runtime: Loads and executes serialized engine
    • Execution Context: Holds all buffers and workspace
    • Calibrator: Generates INT8 quantization scale factors using sample data

Implementation Details

  • Quantization Support:

    • FP32, FP16, and INT8 precision modes
    • INT8 requires calibration dataset (representative samples)
  • Layer Fusion:

    • Combines ops like conv + bias + activation into a single kernel
    • Reduces memory overhead and execution latency
  • Dynamic Shapes:

    • Supports engines that accept varying input sizes with shape profiles
  • Deployment:

    • Supports inference from Python or C++
    • Compatible with DeepStream SDK, TensorRT-LLM, and Jetson platforms

Pros and Cons

  • Pros:

    • Best-in-class GPU inference performance
    • Optimized for Tensor Cores (Ampere, Hopper, etc.)
    • Rich tooling (e.g., trtexec, calibration tools)
    • Integration with Jetson for embedded AI
  • Cons:

    • Requires NVIDIA GPU and CUDA runtime
    • Not suitable for CPU or cross-platform apps
    • Build/optimization pipeline adds complexity
    • Engine regeneration needed if input shape or model changes significantly

Example Workflow

  • Model Conversion (ONNX → Engine):
trtexec --onnx=model.onnx --saveEngine=model.plan --fp16
  • C++ Inference:
nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger);
std::ifstream engineFile("model.plan", std::ios::binary);
nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(...);
  • Python Inference:
import tensorrt as trt
TRT_LOGGER = trt.Logger()
with open("model.plan", "rb") as f:
    engine = trt.Runtime(TRT_LOGGER).deserialize_cuda_engine(f.read())

Suitable Applications

  • Real-time object detection on Jetson Nano/Xavier
  • Batch inference in ML inference servers
  • INT8-quantized NLP models for chatbots
  • High-throughput video analytics (via DeepStream)

  • TensorRT excels in performance-critical scenarios where latency, batch throughput, or GPU utilization is a bottleneck. It’s a specialized, production-grade runtime for teams fully committed to NVIDIA’s platform.

Core ML Deep Dive

  • Core ML is Apple’s on-device machine learning framework, designed to provide seamless model deployment and execution across the Apple ecosystem. It’s tailored for iOS, macOS, watchOS, and tvOS, offering tight integration with system-level APIs and hardware acceleration units like the Apple Neural Engine (ANE).

Overview

  • Developer Target: iOS/macOS developers
  • Use Cases: Image recognition, natural language processing, AR/VR, real-time gesture and object detection
  • Model Format: .mlmodel (converted to .mlmodelc at compile time)
  • Conversion Tools: coremltools, Apple Create ML, ONNX to Core ML converters

Architecture

  • Model Compiler: Converts .mlmodel to .mlmodelc, a compiled model package optimized for fast execution. It includes a serialized computation graph, weights, metadata, and hardware hints.

  • Execution Pipeline:

    1. Model Load: App loads the .mlmodelc file at runtime using the MLModel API.
    2. Prediction API: Developer calls prediction(input:), which triggers the internal compute graph.
    3. Backend Selection: Core ML dynamically selects the best available backend (CPU, GPU, ANE) based on model ops and hardware.
    4. Execution Engine: Executes the optimized graph using Apple’s proprietary kernel implementations.
    5. Output: Returns structured model output (class label, bounding box, etc.) as Swift-native objects.
  • Key Components:

    • MLModel Interface: Main interaction point for inference
    • MLMultiArray: N-dimensional tensor abstraction
    • MLFeatureValue / MLFeatureProvider: Input-output containers
    • NeuralNetwork.proto: Defines underlying graph schema for neural network layers

Supported Model Types

  • Neural Networks (CNNs, RNNs, Transformers)
  • Decision Trees and Ensembles (from XGBoost, scikit-learn)
  • Natural Language models (tokenizers, embeddings)
  • Audio signal processing
  • Custom models using Core ML’s custom layers

Implementation Details

  • Conversion Process:

    • Models from PyTorch, TensorFlow, scikit-learn, or XGBoost are first converted to ONNX or a supported format
    • coremltools.convert() maps ops to Core ML equivalents and produces .mlmodel
    • Optional model quantization (e.g., 16-bit float) can be applied to reduce size
  • Hardware Utilization:

    • Automatically uses ANE if available (iPhone 8 and later)
    • Fallback to Metal GPU or CPU if ANE doesn’t support all ops
    • Internal heuristics determine fallback patterns and op partitioning
  • Custom Layers:

    • Developers can define MLCustomModel classes
    • Useful when Core ML lacks certain ops
    • Requires manual tensor handling and native Swift/Obj-C implementation

Pros and Cons

  • Pros:

    • Deep Apple integration (Vision, AVFoundation, ARKit, etc.)
    • Seamless use of hardware accelerators
    • High-level Swift API for rapid development
    • Secure and privacy-focused (no data leaves device)
    • Optimized runtime with minimal latency
  • Cons:

    • Apple-only ecosystem
    • Conversion limitations (unsupported ops in some models)
    • Limited visibility into runtime internals
    • Custom layer interface can be verbose and inflexible

Example Code Snippet

guard let model = try? MyImageClassifier(configuration: MLModelConfiguration()) else {
    fatalError("Model failed to load")
}

let input = try? MLMultiArray(shape: [1, 3, 224, 224], dataType: .float32)
// Fill input array with pixel data

let output = try? model.prediction(input: input!)
print(output?.classLabel ?? "Prediction failed")

MLX Deep Dive

  • MLX (Machine Learning eXperimentation) is a relatively new Apple-developed machine learning framework built specifically for Apple Silicon. It is designed for flexibility, research, and experimentation, offering a PyTorch-like Python API with eager and compiled execution. Unlike Core ML, which targets app integration and production deployment, MLX is meant for model development, prototyping, and edge inference—while taking full advantage of Apple hardware like the M-series chips.
  • Put simply, MLX is particularly well-suited for developers focused on rapid iteration and fine-tuning of models on Apple devices. It’s promising for LLMs and vision transformers on MacBooks and other Apple Silicon-powered hardware.

Overview

  • Developer Target: ML researchers and developers using Apple Silicon
  • Use Cases: Research, fine-tuning models on-device, LLM inference, Apple-optimized ML pipelines
  • Model Format: No proprietary serialized model format; models are expressed in Python source code using mlx.nn layers
  • Conversion Tools: Emerging support for PyTorch model import via mlx-trace and ONNX conversion

Architecture

  • MLX is a minimal and composable tensor library that uses Apple’s Metal Performance Shaders (MPS) and optionally the Apple Neural Engine (ANE) for hardware acceleration.

  • Execution Modes:

    • Eager Execution: Immediate computation for prototyping/debugging
    • Compiled Graph: Via mlx.compile() for performance-critical inference
  • Core Components:

    • mlx.core: Tensor definitions and low-level math operations
    • mlx.nn: High-level neural network module abstraction (analogous to PyTorch’s nn.Module)
    • mlx.optimizers: Gradient-based optimizers for training
    • mlx.transforms: Preprocessing utilities (e.g., normalization, resizing)
  • Hardware Abstraction:

    • Primarily targets the GPU via MPS
    • MLX compiler performs static analysis to optimize kernel dispatch and memory usage
    • ANE support is still evolving and model-dependent

Implementation Details

  • Tensor Memory Model:

    • MLX tensors are immutable
    • Operations generate new tensors rather than mutating in-place
    • Enables functional purity and easier graph compilation
  • JIT Compilation:

    • While code is typically run in Python, MLX allows functions to be decorated with @mlx.compile to trace and compile computation graphs
    • Reduces memory allocations and kernel overhead
  • Custom Modules:

    • Developers can create custom layers by subclassing mlx.nn.Module
    • Supports standard layers like Linear, Conv2d, LayerNorm, etc.
  • Interoperability:

    • MLX includes tools to convert PyTorch models using tracing (WIP)
    • No built-in ONNX or TensorFlow Lite importer yet, though development is ongoing

Pros and Cons

  • Pros:

    • Highly optimized for Apple Silicon (especially M1/M2)
    • Lightweight and minimalist API with functional programming style
    • Supports training and inference on-device
    • Fast experimentation with eager mode and compilation toggle
    • Tensor API is intuitive for PyTorch users
  • Cons:

    • Only runs on macOS with Apple Silicon (no iOS, no Windows/Linux)
    • Ecosystem still maturing (e.g., fewer pre-trained models, limited documentation)
    • No official deployment format—source code is the model
    • Interop with other frameworks is under active development but not production-ready

Example Code Snippet

import mlx.core as mx
import mlx.nn as nn

class SimpleMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(784, 256)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(256, 10)

    def __call__(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        return self.linear2(x)

model = SimpleMLP()
input = mx.random.normal((1, 784))
output = model(input)

print("Prediction:", output)
  • For accelerated inference:
compiled_fn = mx.compile(model)
output = compiled_fn(input)

ONNX Runtime Deep Dive

  • ONNX Runtime (ORT) is a cross-platform, high-performance inference engine for deploying models in the Open Neural Network Exchange (ONNX) format. Maintained by Microsoft, it is widely adopted due to its flexibility, extensibility, and support for numerous hardware backends. ONNX itself is an open standard that enables interoperability between ML frameworks like PyTorch, TensorFlow, and scikit-learn.

Overview

  • Developer Target: Application developers, MLOps teams, platform architects
  • Use Cases: Cross-framework inference, model portability, production deployments (cloud + edge), hardware acceleration
  • Model Format: .onnx (Open Neural Network Exchange format)
  • Conversion Tools: torch.onnx.export, tf2onnx, skl2onnx, and many others

Architecture

  • ONNX Runtime is structured around a pluggable and modular execution engine, making it suitable for CPU, GPU, and specialized accelerators. It uses an intermediate computation graph optimized at load time, and delegates computation to “Execution Providers” (EPs).

  • Execution Flow:

    1. Model Load: Parses the .onnx model file into an internal graph representation.
    2. Graph Optimization: Applies a set of graph rewrite passes—like constant folding, node fusion, and dead node elimination.
    3. Execution Provider Selection: Based on available hardware and EP priorities, operators are assigned to execution backends.
    4. Execution: ORT schedules and dispatches kernel calls for each partition of the graph.
    5. Output Handling: Results are returned in native types or via C/C++/Python APIs.
  • Key Components:

    • Session: InferenceSession is the main object for loading and running models.
    • Execution Providers (EPs): Modular backend plugins such as:

      • CPU (default)
      • CUDA (NVIDIA GPUs)
      • DirectML (Windows GPU)
      • OpenVINO (Intel accelerators)
      • NNAPI (Android)
      • CoreML (iOS/macOS)
      • TensorRT
      • QNN (Qualcomm AI Engine)
    • Graph Transformer: Rewrites and optimizes the computation graph
    • Kernel Registry: Maps ONNX ops to optimized implementations

Implementation Details

  • Model Format:

    • ONNX models are stored in protobuf format
    • Static computation graph with explicit type and shape information
    • Supports operator versioning to ensure backward compatibility
  • Customization:

    • Developers can register custom ops and execution providers
    • Optional use of external initializers and custom inference contexts
  • Execution Optimization:

    • Graph transformation level can be controlled (basic, extended, all)
    • EPs can share execution (e.g., some layers on CPU, others on GPU)
    • Quantization and sparsity-aware execution supported via tools like onnxruntime-tools
  • Mobile Support:

    • ONNX Runtime Mobile: A statically linked, size-reduced runtime
    • Works with Android and iOS, using NNAPI, Core ML, or CPU fallback

Pros and Cons

  • Pros:

    • Framework agnostic and highly interoperable
    • Broad hardware support via modular execution providers
    • Strong community and industrial backing (Microsoft, AWS, NVIDIA, etc.)
    • Mobile support with optimized builds and quantized execution
    • Extensive language bindings (Python, C++, C#, Java)
  • Cons:

    • Debugging can be complex across EPs
    • Conversion process from other frameworks may require custom scripts
    • ONNX opset compatibility issues can arise across versions
    • Mobile optimization (size, latency) requires manual tuning

Example Code Snippet (Python)

import onnxruntime as ort
import numpy as np

# Load ONNX model
session = ort.InferenceSession("resnet50.onnx")

# Prepare input
input_name = session.get_inputs()[0].name
input_data = np.random.rand(1, 3, 224, 224).astype(np.float32)

# Run inference
outputs = session.run(None, {input_name: input_data})

print("Prediction shape:", outputs[0].shape)

Using CUDA Execution Provider:

session = ort.InferenceSession("resnet50.onnx", providers=['CUDAExecutionProvider'])

Use in Edge / On-Device Scenarios

  • ONNX Runtime Mobile is specifically designed for deployment on edge devices. Key features include:

    • Stripped-down build (~1–2 MB)
    • FlatBuffer format support in preview
    • Android NNAPI and iOS Core ML integration
    • Prebuilt minimal runtime packages for specific models
  • ONNX Runtime is best suited for applications where:

    • Portability across hardware is essential
    • Mixed execution (CPU + accelerator) is beneficial
    • The model pipeline involves multiple frameworks

ExecuTorch Deep Dive

  • ExecuTorch is a lightweight runtime and deployment framework built by Meta (Facebook) to run PyTorch models on constrained edge devices, including microcontrollers (MCUs), embedded systems, and mobile hardware. It is designed with the principles of minimalism, portability, and execution efficiency. Unlike full PyTorch runtimes, ExecuTorch leverages Ahead-of-Time (AOT) compilation and produces compact bytecode representations of models.

Overview

  • Developer Target: Embedded ML engineers, mobile and edge system developers
  • Use Cases: Sensor fusion, vision at the edge, voice command detection, ultra-low-power AI applications
  • Model Format: Compiled TorchScript bytecode (.ptc)
  • Conversion Tools: PyTorch → TorchScript → ExecuTorch via AOT pipeline

Architecture

  • ExecuTorch redefines the execution pipeline for PyTorch models in low-resource environments. Its architecture includes a static graph compiler, a runtime interpreter, and pluggable dispatch interfaces for targeting different hardware backends.

  • Execution Flow:

    1. Model Export:

      • Model defined in PyTorch and traced/scripted via TorchScript.
      • ExecuTorch’s AOT compiler converts it into a compact bytecode format.
    2. Runtime Embedding:

      • The bytecode and necessary ops are compiled with the target runtime.
      • Optional op pruning removes unused operations.
    3. Deployment:

      • Model and runtime are flashed onto the device.
      • Inference is run via a lightweight VM interpreter.
  • Key Components:

    • Bytecode Format: .ptc files contain compiled operators and control flow
    • VM Runtime: A minimal interpreter that reads and executes bytecode
    • Dispatcher: Routes ops to backend implementations
    • Memory Arena: Static memory model, optionally no dynamic allocation

Implementation Details

  • AOT Compiler:

    • Converts scripted TorchScript models into bytecode and op kernels
    • Includes a model linker that statically binds required ops
    • Can target C/C++ or platform-specific formats (Zephyr, FreeRTOS)
  • Operator Handling:

    • Customizable op kernels allow device-specific optimization
    • Optional kernel fusion via compiler passes for performance
  • Runtime Constraints:

    • Code size: Can be <500 KB with aggressive pruning
    • No reliance on dynamic memory allocation (static buffer planning)
    • Designed for devices with as little as 256 KB RAM
  • Integration:

    • Written in C++
    • Can integrate with sensor pipelines, real-time OS, or MCU firmware
    • Open-sourced with tooling for building and flashing models to hardware

Pros and Cons

  • Pros:

    • Extremely lightweight, MCU-ready
    • AOT compilation reduces runtime overhead
    • Deterministic memory usage (good for real-time applications)
    • Modular and open-source with low-level control
    • PyTorch-compatible workflow for training and export
  • Cons:

    • Requires model to be written in a static subset of PyTorch
    • Limited dynamic control flow (must be scriptable)
    • Debugging and tooling less mature than mainstream PyTorch or TensorFlow Lite
    • Focused on inference only; no training support on-device

Example Workflow

  • Model Export (Python):
import torch
import torch.nn as nn

class TinyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(4, 2)

    def forward(self, x):
        return self.fc(x)

model = TinyModel()
scripted = torch.jit.script(model)
scripted.save("model.pt")
  • ExecuTorch AOT Compilation (CLI or CMake):
executorchc compile --model model.pt --output model.ptc --target cortex-m
  • Embedded Runtime Integration (C++):
#include "executorch/runtime/runtime.h"

executorch::load_model("model.ptc");
executorch::run_model(input_tensor, output_tensor);

Suitable Applications

  • Wake-word detection on MCUs
  • Gesture recognition using MEMS sensors
  • Smart agriculture (tiny vision models)
  • Battery-powered health monitoring devices

  • ExecuTorch fills a critical niche for deploying PyTorch-trained models on hardware where traditional runtimes like TensorFlow Lite or ONNX Runtime are too heavy.

LidarTLM Deep Dive

  • LidarTLM (LiDAR Tensor Layer Module) is a specialized, lower-profile runtime or processing pipeline designed for inference on LiDAR data using neural networks. It is not a mainstream or widely standardized runtime like TensorFlow Lite or ONNX Runtime, but rather refers to a class of embedded software tools tailored for 3D point cloud inference and fusion with temporal data—typically in autonomous systems, robotics, or advanced driver-assistance systems (ADAS).

  • Because LidarTLM is less commonly documented and may refer to proprietary or research-centric toolkits, this section will focus on generalized design principles, use cases, and what distinguishes LiDAR-focused runtimes from general-purpose ML engines.

Overview

  • Developer Target: Robotics, ADAS, and autonomous system engineers
  • Use Cases: Real-time 3D object detection, SLAM (Simultaneous Localization and Mapping), point cloud segmentation, obstacle avoidance
  • Model Format: Often custom or adapted from PyTorch/ONNX; serialized as tensors or voxel grids
  • Conversion Tools: Typically includes preprocessing pipelines from ROS, Open3D, or custom CUDA kernels

Architecture

  • LidarTLM-style systems typically deviate from conventional 2D image-based ML runtimes. They require efficient spatial processing, optimized memory layouts, and hardware support for sparse data structures.

  • Execution Flow:

    1. Sensor Input: Raw LiDAR packets or fused multi-sensor data (e.g., IMU + LiDAR) ingested
    2. Preprocessing: Point clouds downsampled, voxelized, or transformed to Bird’s-Eye View (BEV)
    3. Inference: Tensorized data passed through neural layers (e.g., 3D convolutions, attention modules)
    4. Postprocessing: Bounding boxes or semantic maps generated
    5. Fusion (Optional): Sensor fusion with radar, camera, or odometry
  • Key Components:

    • Spatial Encoder: Transforms sparse point clouds into dense tensor formats (e.g., voxel grids, range images)
    • Sparse CNNs or VoxelNet Layers: Specialized convolution ops for irregular input data
    • Temporal Modules: Optional RNN, attention, or transformer blocks for sequential scans
    • Hardware Abstraction: Targets CUDA-enabled GPUs or embedded AI processors (e.g., NVIDIA Xavier, TI Jacinto)

Implementation Details

  • Tensor Representation:

    • Often uses sparse tensors or hybrid dense-sparse structures
    • Libraries like MinkowskiEngine, SpConv, or custom CUDA kernels for voxel ops
    • Quantization may be used to reduce memory footprint in embedded settings
  • Optimization Techniques:

    • Efficient neighbor search (KD-trees, octrees) for local feature aggregation
    • Temporal caching of features from prior scans
    • Batch fusion for multi-sensor inputs
  • Deployment:

    • Embedded platforms like NVIDIA Jetson, TI DSPs, and ADAS-grade microcontrollers
    • Often integrated with ROS (Robot Operating System) for I/O and control flow
    • May use C++, CUDA, or even custom ASIC/NPU firmware for deterministic performance

Pros and Cons

  • Pros:

    • Designed for spatial and temporal data, not just 2D tensors
    • Optimized for sparse inputs and low-latency inference
    • Supports sensor fusion pipelines, enabling richer context
    • Can run on edge-grade GPUs or embedded NPUs
  • Cons:

    • Fragmented tooling, often bespoke or tightly coupled to hardware
    • Lack of standardized runtime interface (unlike ONNX or TFLite)
    • Difficult to deploy across platforms without custom engineering
    • Sparse community and documentation; often buried in academic or industrial codebases

Example Pseudocode Flow

# Step 1: Load point cloud
point_cloud = load_lidar_scan("/scans/frame_001.bin")

# Step 2: Convert to voxel grid
voxel_grid = voxelize(point_cloud, grid_size=(0.1, 0.1, 0.1))

# Step 3: Pass through 3D CNN
features = sparse_conv_net(voxel_grid)

# Step 4: Predict bounding boxes or labels
detections = decode_bounding_boxes(features)

# Step 5: Fuse with other sensors (optional)
fused_output = fuse_with_camera(detections, rgb_frame)

Suitable Applications

  • Autonomous vehicles (3D perception stacks)
  • Warehouse robots and drones
  • Industrial inspection systems
  • Advanced driver-assistance systems (ADAS)
  • SLAM systems for robotics

  • LidarTLM-like runtimes are not meant for general ML workloads but are highly optimized for 3D spatiotemporal inference, where conventional 2D model runtimes fall short. They tend to be integrated deep into hardware-specific SDKs or research frameworks.

llama.cpp Deep Dive

  • llama.cpp is an open-source, C++-based implementation of inference for large language models (LLMs), originally inspired by Meta’s LLaMA family. It focuses on efficient CPU (and optionally GPU) inference for quantized transformer models. Unlike full ML runtimes, llama.cpp is specialized, minimalist, and optimized for running LLMs—particularly on devices with constrained memory and compute budgets such as laptops, desktops, and even smartphones.

Overview

  • Developer Target: LLM researchers, app developers, hobbyists
  • Use Cases: Local chatbots, privacy-preserving LLM apps, embedded NLP on edge devices
  • Model Format: Quantized GGUF (GPT-generated GGML Unified Format)
  • Conversion Tools: Python conversion scripts from PyTorch checkpoints to GGUF

Architecture

  • llama.cpp does not use a traditional ML runtime stack. It is built from the ground up with custom tensor operations and a static execution loop tailored to transformer inference.

  • Execution Flow:

    1. Model Load: Quantized GGUF file loaded into memory
    2. KV Cache Allocation: Allocates buffers for key/value attention caching
    3. Token Embedding & Input Prep: Maps token IDs to embeddings
    4. Layer Execution Loop: Runs transformer blocks sequentially
    5. Logits Output: Computes next-token logits, passed to sampler
    6. Sampling & Token Generation: Greedy, top-k, nucleus, or temperature sampling
  • Key Components:

    • GGML Backend: Custom tensor library with support for CPU SIMD ops (AVX, FMA, NEON)
    • Quantization Layers: 4-bit, 5-bit, and 8-bit quantized matmuls
    • Inference Loop: Manually unrolled transformer stack—one layer at a time
    • KV Cache Management: Token sequence history for autoregressive decoding
  • Optional GPU Support:

    • Metal (macOS), OpenCL, CUDA support via modular backends
    • Offloading options: attention only, matmuls only, or full GPU

Implementation Details

  • Model Quantization:

    • Tools like quantize.py convert PyTorch models to GGUF format
    • Supports several quantization strategies (Q4_0, Q5_K, Q8_0, etc.)
    • Tradeoff between model size and accuracy
  • Tensor Engine:

    • No external libraries like BLAS, cuDNN, or MKL used by default
    • Uses hand-optimized C++ with platform-specific intrinsics
    • Cross-platform: macOS, Linux, Windows, WebAssembly (via WASM)
  • Memory Optimization:

    • Memory mapped file support (mmap)
    • Low memory mode: restricts KV cache or context length
    • Paging and streaming support for large contexts (e.g., llama.cpp + vLLM)
  • Integration:

    • C API and Python bindings (llama-cpp-python)
    • Works with tools like LangChain, OpenRouter, and Ollama
    • Compatible with most LLaMA-family models: LLaMA, Alpaca, Vicuna, Mistral, etc.

Pros and Cons

  • Pros:

    • Extremely fast CPU inference (real-time on MacBook M1/M2, even some Raspberry Pi 4)
    • Portable and minimal dependencies
    • Quantization enables running models with <4 GB RAM
    • Easily embedded into apps, games, and command-line tools
    • Active community and ecosystem (used in projects like Ollama and LM Studio)
  • Cons:

    • Transformer-only; not a general ML runtime
    • No training support—strictly for inference
    • Manual conversion and tuning process required
    • Limited ops support; cannot easily add new ML layers

Example CLI Inference

./main -m models/llama-7B.Q4_0.gguf -p "What is the capital of France?" -n 64
  • Python Inference (via llama-cpp-python):
from llama_cpp import Llama

llm = Llama(model_path="llama-7B.Q4_0.gguf")
output = llm("Q: What is the capital of France?\nA:", max_tokens=32)
print(output["choices"][0]["text"])
  • WebAssembly Example (Browser):

    • Precompiled WASM version can run LLMs client-side using WebGPU
    • Useful for private, offline AI assistants directly in browser

Suitable Applications

  • Private, offline chatbots
  • Voice assistants embedded in hardware
  • Context-aware agents in games or productivity apps
  • Developer tools with local NLP capabilities

  • llama.cpp showcases what is possible with small, optimized transformer runtimes and CPU-centric design. It’s not a general-purpose ML runtime but a powerful engine for language inference where privacy, portability, or internet-free operation is desired.

TensorFlow Lite / TensorFlow Serving Deep Dive

  • TensorFlow Lite (TFLite) and TensorFlow Serving are two distinct components from the TensorFlow ecosystem optimized for inference, but they serve different purposes and deployment environments.

  • TensorFlow Lite is designed for on-device inference, particularly for mobile, embedded, and IoT platforms.
  • TensorFlow Serving is designed for cloud and server-side model deployment, providing high-throughput, low-latency model serving over gRPC or HTTP.

  • This section focuses primarily on TensorFlow Lite due to its relevance to on-device ML runtimes, with a comparative note on Serving at the end.

Overview

  • Developer Target: Mobile developers, embedded engineers, production ML ops
  • Use Cases: Real-time image classification, object detection, audio processing, NLP, edge analytics
  • Model Format: .tflite (FlatBuffer format)
  • Conversion Tools: TensorFlow → TFLite via TFLiteConverter

TensorFlow Lite Architecture

  • TFLite’s design emphasizes performance, size efficiency, and hardware acceleration. It is structured around a model interpreter, a delegate mechanism for hardware acceleration, and a set of optimized operator kernels.

  • Execution Flow:

    1. Model Conversion:

      • Uses TFLiteConverter to convert SavedModel or Keras models into a FlatBuffer-encoded .tflite model.
    2. Model Load:

      • The model is loaded by the Interpreter class on the target device.
    3. Tensor Allocation:

      • Memory buffers for input/output tensors are allocated.
    4. Inference Execution:

      • The interpreter evaluates the computation graph, optionally using delegates.
    5. Postprocessing:

      • Output tensors are read and interpreted by the application.
  • Key Components:

    • FlatBuffer Model: Compact, zero-copy, serializable model format
    • Interpreter: Core engine that evaluates the model graph
    • Delegate Interface: Offloads subgraphs to specialized hardware (GPU, DSP, NPU)
    • Kernel Registry: Maps ops to optimized C++ implementations (or delegates)

Implementation Details

  • Model Conversion:

    • Converts SavedModels, Keras .h5, or concrete functions to .tflite
    • Supports post-training quantization (dynamic, full integer, float16)
    • Model optimizations include constant folding, op fusion, and pruning
  • Delegates:

    • Optional hardware acceleration backends:

      • NNAPI (Android)
      • GPU Delegate (OpenCL, Metal)
      • Hexagon Delegate (Qualcomm DSP)
      • Core ML Delegate (iOS/macOS)
      • EdgeTPU Delegate (Coral devices)
    • Delegates work by “claiming” supported subgraphs during interpreter initialization

  • Threading and Performance:

    • Supports multi-threaded inference
    • Interpreter can be run in C++, Java, Kotlin, Python, Swift

TensorFlow Serving (Short Overview)

  • Designed for scalable deployment of TensorFlow models on servers
  • Models are exposed as REST/gRPC endpoints
  • Automatically loads, unloads, and versions models
  • Uses SavedModel format, not .tflite
  • Not suitable for offline or embedded deployment

  • Use Case Comparison:

Here is your formatted table following the provided style:

Feature TensorFlow Lite TensorFlow Serving
Target Device Mobile/Edge Cloud/Server
Model Format .tflite SavedModel
Communication In-process / Local gRPC / REST
Latency Milliseconds Sub-second to seconds
Training Support No No (inference only)
Deployment Size Small (~100s of KB) Large, server framework

Pros and Cons

  • Pros (TensorFlow Lite):

    • Compact and efficient format (FlatBuffer)
    • Broad hardware delegate support
    • Quantization-aware and post-training optimizations
    • Cross-platform support (iOS, Android, Linux, microcontrollers)
    • Strong ecosystem and pre-trained model zoo (tflite-model-maker)
  • Cons (TensorFlow Lite):

    • Not a full subset of TensorFlow ops (requires op whitelisting or custom ops)
    • Delegate behavior can be opaque and platform-dependent
    • Conversion can fail silently if unsupported ops are encountered
    • Debugging delegate fallbacks can be non-trivial

Example Inference (Python - TFLite)

import tensorflow as tf
import numpy as np

# Load model
interpreter = tf.lite.Interpreter(model_path="mobilenet_v2.tflite")
interpreter.allocate_tensors()

# Prepare input
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_data = np.random.rand(1, 224, 224, 3).astype(np.float32)

# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
print("Prediction:", output_data)
  • Delegate usage (Android NNAPI, example via Java/Kotlin):
Interpreter.Options options = new Interpreter.Options();
options.addDelegate(new NnApiDelegate());
Interpreter interpreter = new Interpreter(tfliteModel, options);

Suitable Applications

  • On-device health and fitness apps
  • Real-time object detection in AR
  • Offline voice recognition
  • Edge anomaly detection
  • TinyML deployments with TensorFlow Lite for Microcontrollers

  • TensorFlow Lite remains one of the most production-hardened and flexible runtimes for on-device ML, particularly in mobile and embedded contexts. Its support for multiple delegates and optimizations makes it a go-to choice for developers deploying models outside the cloud.

Comparative Analysis

  • Here are detailed tabular comparisons that encapsulates all key aspects across the different on-device ML runtimes discussed in the primer.

General Characteristics

Attribute TensorRT Core ML MLX ONNX Runtime ExecuTorch LidarTLM llama.cpp TensorFlow Lite TensorFlow Serving
Target Platform(s) NVIDIA Jetson, Desktop, Server Apple devices (iOS/macOS) Apple Silicon (macOS only) Cross-platform Embedded, mobile, MCU Robotics, automotive, ADAS Desktop, mobile, browser Cross-platform (mobile/edge) Cloud / server environments
ML Task Focus Optimized inference General ML (vision, NLP) Research, transformer/NLP General ML Ultra-light inference 3D spatial perception Large language model inference General ML Scalable inference serving
Inference Only? Yes Yes No (supports training) Yes Yes Yes Yes Yes Yes
Open Source? Partially (binaries open, tools closed) Partially (via tools) Yes Yes Yes Partially / variable Yes Yes Yes

Model Formats and Conversion

Attribute TensorRT Core ML MLX ONNX Runtime ExecuTorch LidarTLM llama.cpp TensorFlow Lite TensorFlow Serving
Primary Format .plan (TensorRT engine file) .mlmodelc Python-defined layers .onnx .ptc (compiled TorchScript) Custom / converted .onnx / raw tensors .gguf (quantized LLMs) .tflite (FlatBuffer) SavedModel (.pb, .pbtxt)
Supported Frameworks PyTorch, ONNX PyTorch, TF (via converters) Native Python API PyTorch, TensorFlow, others PyTorch (TorchScript subset) PyTorch, TensorFlow (via export) LLaMA-family only TensorFlow, Keras TensorFlow only
Conversion Required? Yes (from ONNX or PyTorch export) Yes (via coremltools) No Yes (usually from PyTorch) Yes (via AOT compiler) Yes, often includes preprocessing Yes (convert + quantize) Yes (TFLiteConverter) No (already in target format)

Execution Model and Hardware Support

Attribute TensorRT Core ML MLX ONNX Runtime ExecuTorch LidarTLM llama.cpp TensorFlow Lite TensorFlow Serving
Execution Type AOT compiled CUDA graph Eager, dynamic hardware assignment Eager + compiled graph Static graph with runtime optimizations Bytecode VM interpreter Sparse 3D graph + temporal flow Manual loop over transformer layers Static interpreter + delegates REST/gRPC inference pipeline
CPU Support No (GPU only) Yes (fallback) Yes (M1/M2 optimized) Yes (default EP) Yes Yes Yes (highly optimized) Yes Yes
GPU Support Yes (CUDA, Tensor Cores) Yes (Metal) Yes (via MPS) Yes (CUDA, DirectML, etc.) Limited Yes (CUDA, embedded GPUs) Optional (Metal, CUDA, OpenCL) Yes (OpenCL, Metal) No
NPU / DSP Support No Yes (Apple ANE) Emerging ANE support Yes (via NNAPI, OpenVINO, etc.) Potential via backend interface Yes (TI, Nvidia, ADAS accelerators) No (LLM-focused, CPU-oriented) Yes (NNAPI, EdgeTPU, Hexagon) No
Hardware Abstraction Low-level plugin engine, manual tuning Automatic Manual tuning via MLX Modular Execution Providers (EPs) Compiled dispatcher with targets Device-specific optimization required Low-level SIMD/CUDA offload Delegate-based (pluggable) N/A

Optimization, Size, and Constraints

Attribute TensorRT Core ML MLX ONNX Runtime ExecuTorch LidarTLM llama.cpp TensorFlow Lite TensorFlow Serving
Model Optimization Support Yes (kernel tuning, quantization, FP16/INT8) Yes (ANE targeting, quantization) No built-in, manual scripting Yes (quantization, pruning, graph fusion) Yes (operator pruning, bytecode fusion) Yes (3D-aware compression and fusions) Yes (quantized GGUF) Yes (quantization, fusion) Yes (batching, threading)
Runtime Size Medium (~5–15 MB) Medium (~5–10 MB) Medium Large (5–30 MB) Very small (<1 MB) Medium–Large Small–Medium Small (~0.5–5 MB) Very large (>100 MB)
Memory Footprint (Inference) Low to moderate (GPU memory bound) Low to moderate Moderate (GPU-heavy) Variable (depends on EPs) Ultra-low (sub-MB possible) High (large point cloud buffers) Low (~3–6 GB RAM for 7B models) Low High
Latency Very low (sub-ms possible) Low (with ANE/GPU) Medium (eager mode) Variable (highly EP dependent) Very low Moderate to high (depends on density) Low (for small LLMs) Low (under 10ms typical) Moderate to high

Flexibility, Debugging, and Ecosystem

Attribute TensorRT Core ML MLX ONNX Runtime ExecuTorch LidarTLM llama.cpp TensorFlow Lite TensorFlow Serving
Custom Ops Support Yes (via plugin library API) Limited (via MLCustomModel) Full (via Python subclassing) Yes (custom EPs and ops) Yes (C++ op authoring) Yes (often required) No (fixed transformer kernel set) Yes (C++/C custom kernels) Yes
Community & Documentation Strong NVIDIA developer support, active forums Strong, Apple developer-centric Niche, growing Very strong Growing (Meta-sponsored) Limited / hardware-vendor specific Active open-source base Mature, large community Very mature in production
Debugger Support Nsight Systems, profiling tools, verbose logging Xcode tools Python debug console Moderate (model inspection tools) Minimal (CLI, log-based) Custom tooling per device Log-level output only TensorBoard-lite, CLI tools Monitoring via Prometheus, etc.
Ease of Use Medium (manual optimization, engine building) High for Apple developers Medium (researchers, tinkerers) Moderate to high (depends on EP) Medium (steep setup curve) Low (requires system integration) High (once model is quantized) High (especially with model maker) Medium to high (requires infra)

Comparative Summary and Guidance

Feature Comparison Table

  • This section provides a side-by-side comparison of the on-device ML runtimes discussed, highlighting their architectural differences, platform support, performance characteristics, and ideal use cases. This helps clarify which runtime best fits various project needs, from embedded development to mobile apps and language model inference.
Runtime Platform Support Model Format Hardware Acceleration Optimized For Custom Ops Size Footprint
TensorRT NVIDIA GPUs (desktop, Jetson, server) ONNX, `.plan` (engine file) CUDA, Tensor Cores Low-latency GPU inference Yes (via plugin system) Medium (~5–15 MB)
Core ML Apple only (iOS/macOS) `.mlmodelc` CPU, GPU, ANE App integration on Apple devices Limited Medium (~2–10 MB)
MLX Apple Silicon (macOS) Python code MPS, ANE (partial) Research & experimentation Yes Medium (~2–5 MB)
ONNX Runtime Cross-platform (Mobile & Desktop) `.onnx` CUDA, NNAPI, DirectML, etc. Cross-framework interoperability Yes Large (~5–30 MB)
ExecuTorch Embedded, MCUs, Android Compiled TorchScript (`.ptc`) CPU, MCU, DSP Ultra-light edge inference Yes Very small (<1 MB)
LidarTLM Embedded/Robotics Custom/ONNX CUDA, DSP, NPU Sparse point cloud inference Yes Medium–Large
llama.cpp Desktop, Mobile, WASM Quantized GGUF CPU, Optional GPU Efficient LLM inference Limited Small–Medium (CPU)
TFLite Cross-platform (MCU to mobile) `.tflite` NNAPI, GPU, DSP, EdgeTPU Mobile and embedded AI Yes Small (~500 KB–5 MB)
TF Serving Cloud/Server SavedModel N/A Scalable online inference Yes Very large (>100 MB)

Strengths by Runtime

  • Core ML: Best for iOS/macOS developers needing deep system integration with the Apple ecosystem. Ideal for apps that use Vision, SiriKit, or ARKit.

  • MLX: Best for Mac-based researchers and developers who want PyTorch-like flexibility and native hardware performance without deploying to iOS.

  • ONNX Runtime: Best for cross-platform deployments and teams needing a unified inference backend across mobile, desktop, and cloud. Excellent hardware flexibility.

  • ExecuTorch: Best for extremely constrained devices like MCUs, or custom silicon. Perfect for edge intelligence with hard memory and latency budgets.

  • LidarTLM: Best for autonomous systems, robotics, and 3D SLAM applications that involve high-bandwidth spatial data like LiDAR or radar.

  • llama.cpp: Best for private, local LLM inference on personal devices or embedding transformer models into apps without requiring cloud or heavy runtimes.

  • TFLite: Best all-around runtime for mobile and embedded ML. Huge ecosystem, widespread delegate support, and tooling maturity.

  • TF Serving: Best for cloud applications needing high-volume model serving (e.g., for APIs). Not designed for local or offline inference.

Runtime Selection Guidance

  • If you’re deploying to iOS or macOS:

    • Use Core ML for production apps.
    • Use MLX for research, local experimentation, or custom modeling.
  • If you’re deploying to embedded edge devices:

    • Use ExecuTorch for PyTorch-based workflows.
    • Use TensorFlow Lite for Microcontrollers for tight memory constraints.
    • Consider LidarTLM-style tools if dealing with 3D spatial data.
  • If you’re targeting Android or need portability:

    • Use TensorFlow Lite or ONNX Runtime with delegates like NNAPI or GPU.
  • If you’re working with LLMs locally:

    • Use llama.cpp for best CPU-based inference and minimal setup.
  • If you want cross-framework model portability:

    • Use ONNX Runtime with models exported from PyTorch, TensorFlow, or others.
  • If you require real-time, high-volume cloud inference:

    • Use TensorFlow Serving or ONNX Runtime Server.

Final Thoughts

  • Choosing the right on-device ML runtime depends heavily on the following factors:

    • Deployment environment (mobile, embedded, desktop, web, cloud)
    • Model architecture (CNN, RNN, transformer, etc.)
    • Performance requirements (latency, FPS, memory usage)
    • Development preferences (PyTorch, TensorFlow, raw C++, etc.)
    • Hardware capabilities (CPU, GPU, NPU, DSP, etc.)
  • Each runtime discussed in this primer is best-in-class for a certain domain or design constraint. Rather than a “one-size-fits-all” solution, success in on-device ML depends on thoughtful matching between the model, target platform, and available tools. Here’s a summary of which is the best runtime across a range of scenarios:

    • Best for Apple-native app development: Core ML
    • Best for Apple-based model experimentation: MLX
    • Best for cross-platform portability and hardware access: ONNX Runtime
    • Best for minimal embedded inference: ExecuTorch
    • Best for 3D LiDAR/robotics: LidarTLM-like stacks
    • Best for on-device LLM inference: llama.cpp
    • Best for mobile/embedded general ML: TensorFlow Lite
    • Best for scalable cloud inference: TensorFlow Serving
  • In machine learning runtimes, how a model is serialized—i.e., stored and structured on disk—is critical for performance, compatibility, and portability. Serialization formats determine how the computation graph, parameters, metadata, and sometimes even execution plans are encoded and interpreted by the runtime. Each runtime typically adopts a format aligned with its optimization goals: whether that’s minimal size, fast loading, platform neutrality, or human-readability for debugging.
  • Here we briefly compare four major serialization formats used across popular on-device ML runtimes: Protocol Buffers (Protobuf), FlatBuffer, GGUF, and Bytecode formats, reinforcing how data structures are stored, loaded, and interpreted at runtime.

Protocol Buffers (Protobuf)

  • Used by: TensorFlow (SavedModel, .pb), ONNX (.onnx)

  • Developed by: Google

  • Type: Binary serialization framework

  • Key Characteristics:

    • Encodes structured data using .proto schemas
    • Supports code generation in multiple languages (Python, C++, Java, etc.)
    • Strict type definitions with schema versioning
    • Produces portable, efficient, extensible binary files
  • Advantages:

    • Highly compact, faster than JSON/XML
    • Strong backward and forward compatibility through schema evolution
    • Ideal for representing complex hierarchical graphs (e.g., model computation trees)
  • In ML context:

    • TensorFlow: Stores entire computation graph, tensor shapes, and metadata in .pb (protobuf binary)
    • ONNX: Defines all model ops, weights, and IR-level metadata via Protobuf-defined schema
  • Limitations:

    • Parsing requires full message decoding into memory
    • Less suited for minimal-footprint scenarios (e.g., MCUs)
  • Example:

    • Used in: TensorFlow (.pb, SavedModel), ONNX (.onnx)

    • Protobuf defines a schema in .proto files and serializes structured binary data. Here’s a simplified view:

    • Schema Definition (graph.proto):

        message TensorShape {
          repeated int64 dim = 1;
        }
      
        message Node {
          string op_type = 1;
          string name = 2;
          repeated string input = 3;
          repeated string output = 4;
        }
      
        message Graph {
          repeated Node node = 1;
          repeated TensorShape input_shape = 2;
          repeated TensorShape output_shape = 3;
        }
      
    • Example Python Usage (ONNX-style):

        import onnx
      
        model = onnx.load("resnet50.onnx")
        print(model.graph.node[0])  # Shows first operation (e.g., Conv)
      
    • Serialized File:

      • A binary .onnx or .pb file that’s unreadable in plain text but represents a complete computation graph, including ops, shapes, attributes, and weights.

FlatBuffer

  • Used by: TensorFlow Lite (.tflite)

  • Developed by: Google

  • Type: Binary serialization library with zero-copy design

  • Key Characteristics:

    • Allows direct access to data without unpacking (zero-copy reads)
    • Compact binary representation optimized for low-latency parsing
    • Built-in schema evolution support
  • Advantages:

    • Near-instantaneous loading—no deserialization overhead
    • Perfect for mobile/embedded devices with tight latency or startup constraints
    • Schema-aware tooling for validation
  • In ML context:

    • .tflite files store computation graphs, tensors, and metadata using FlatBuffer encoding
    • Facilitates runtime interpretation without converting the graph into a different memory format
  • Limitations:

    • Harder to inspect/debug than JSON or Protobuf
    • Limited dynamic structure capabilities compared to Protobuf
  • Example:

    • Used in: TensorFlow Lite (.tflite)

    • FlatBuffer does not require unpacking into memory. Instead, the graph is directly accessed as a binary blob using precompiled accessors.

    • FlatBuffer Schema (simplified):

        table Tensor {
          shape: [int];
          type: int;
          buffer: int;
        }
      
        table Operator {
          opcode_index: int;
          inputs: [int];
          outputs: [int];
        }
      
        table Model {
          tensors: [Tensor];
          operators: [Operator];
        }
      
    • Example Python Usage:

        import tensorflow as tf
      
        interpreter = tf.lite.Interpreter(model_path="mobilenet_v2.tflite")
        interpreter.allocate_tensors()
        print(interpreter.get_input_details())
      
    • Serialized File:

      • A .tflite file with FlatBuffer encoding, which includes all tensors, ops, and buffers in an efficient, zero-copy layout.

GGUF (GPT-generated GGML Unified Format)

  • Used by: llama.cpp and its LLM-compatible ecosystem

  • Developed by: Community (successor to GGML model format)

  • Type: Lightweight binary tensor format for large language models

  • Key Characteristics:

    • Encodes quantized transformer weights and architecture metadata
    • Designed for efficient memory mapping and low-RAM usage
    • Built for CPU-first inference (with optional GPU support)
  • Advantages:

    • Extremely compact, especially with quantization (4–8 bit)
    • Simple, fast memory-mapped loading (mmap)
    • Compatible with CPU-based inference engines (no dependencies)
  • In ML context:

    • Stores models like LLaMA, Mistral, Alpaca after quantization
    • Used by llama.cpp, llm.cpp, text-generation-webui, and other local LLM tools
  • Limitations:

    • Not general-purpose—only suitable for transformer LLMs
    • Lacks complex graph control (branching, dynamic ops)
  • Example:

    • Used in: llama.cpp, quantized LLMs*

    • GGUF (GGML Unified Format) is a binary container for transformer weights and metadata.

    • Header Block (example layout in binary format):

        GGUF
        version: 3
        tensor_count: 397
        metadata:
          model_type: llama
          vocab_size: 32000
          quantization: Q4_0
      
    • Python conversion (from PyTorch):

        python convert.py --input model.bin --output model.gguf --format Q4_0
      
    • Reading from llama.cpp:

        gguf_context *ctx = gguf_init_from_file("llama-7B.Q4_0.gguf");
        ggml_tensor *wq = gguf_get_tensor_by_name(ctx, "layers.0.attn.wq");
      
    • Serialized File:

      • A .gguf file storing quantized tensors, model metadata, and attention layer structure—compact and mmap-compatible.

Bytecode Format (ExecuTorch)

  • Used by: ExecuTorch

  • Developed by: Meta

  • Type: Custom AOT-compiled bytecode

  • Key Characteristics:

    • Outputs compact bytecode (.ptc) from PyTorch models via TorchScript tracing
    • Prunes unused operators to reduce binary size
    • Embeds minimal op metadata needed for runtime VM
  • Advantages:

    • Highly portable and minimal—can run on MCUs and RTOS platforms
    • Deterministic memory usage and low overhead
    • Enables static linking of models and kernels for bare-metal systems
  • In ML context:

    • Targets constrained devices (sub-MB RAM)
    • Supports fixed operator sets with predictable memory and runtime behavior
  • Limitations:

    • Rigid format—not well suited for dynamic models or rich graph structures
    • Tied closely to PyTorch tracing and compilation pipeline.
  • Example:

    • Used in: ExecuTorch (.ptc format)

    • ExecuTorch compiles PyTorch models into bytecode similar to a virtual machine instruction set.

    • Model Compilation:

        import torch
      
        class Net(torch.nn.Module):
            def forward(self, x):
                return torch.relu(x)
      
        scripted = torch.jit.script(Net())
        scripted.save("net.pt")  # TorchScript
      
        # Compile to ExecuTorch format
        !executorchc compile --model net.pt --output net.ptc
      
    • Runtime Use in C++:

        executorch::Runtime runtime;
        runtime.load_model("net.ptc");
        runtime.invoke(input_tensor, output_tensor);
      
    • Serialized File:

      • A .ptc file containing static bytecode for model logic, stripped of unused ops, ready for microcontroller inference.

Comparative Analysis

  • Understanding the serialization format is crucial when choosing a runtime—especially for performance, portability, and debugging. Developers targeting mobile and embedded environments often prefer FlatBuffer or bytecode for efficiency, while cloud/server or cross-platform projects benefit from Protobuf’s rich graph encoding.
Format Used By Format Type Example File Viewability Tool to Inspect Strengths Limitations
Protobuf TensorFlow, ONNX Binary (schema-driven) model.onnx, model.pb Binary onnx, tf.saved_model_cli Cross-platform, schema evolution, rich structure Larger footprint, full deserialization
FlatBuffer TensorFlow Lite Zero-copy binary model.tflite Binary flatc, tflite API Instant loading, ideal for embedded use Harder to inspect/debug
GGUF llama.cpp Binary tensor map llama-7B.Q4_0.gguf Binary llama.cpp, gguf_dump.py Ultra-compact, mmap-friendly, quantized LLM-specific only
Bytecode ExecuTorch Compiled AOT VM model.ptc Binary executorchc, ExecuTorch API Tiny runtime, embedded-friendly Limited flexibility, PyTorch-only
TensorRT Engine TensorRT Binary CUDA engine model.plan Binary TensorRT API (trtexec) Hardware-optimized, precompiled inference NVIDIA-only, not portable

Further Reading

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledMLRuntimes,
  title   = {ML Runtimes},
  author  = {Chadha, Aman},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://aman.ai}}
}