Primers • On-device Transformers
- Architectural Overview and Compute Characteristics of Transformers
- Hardware-Specific Characteristics (CPU, GPU, TPU, NPU)
- Optimization Techniques for On-Device Transformers
- Practical Considerations and Pitfalls When Deploying Transformers on CPU, GPU, and NPUs
- Further Reading
- Citation
Architectural Overview and Compute Characteristics of Transformers
-
Transformer models, introduced by Vaswani et al. in 2017, rely on a self-attention mechanism that allows them to process sequences without recurrence. Their architecture consists primarily of encoders, decoders, or both — depending on the task. For instance:
- BERT: Encoder-only (used for classification, embedding, etc.)
- GPT: Decoder-only (used for generation)
- T5 / BART: Encoder-Decoder (used for translation, summarization)
-
Each part presents different computational demands, and this impacts how they perform across different hardware platforms (CPUs, GPUs, TPUs, NPUs). Below is a breakdown of the compute vs. memory characteristics of each component.
Encoder: Compute-bound Nature
-
The encoder performs operations on entire sequences in parallel. The main computational components are:
- Multi-Head Self-Attention (MHSA)
- Feedforward Networks (FFN)
- LayerNorm / Dropout (lightweight)
-
The attention operation in the encoder is full-sequence:
-
Attention Score Calculation:
\[\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right)V\]-
where:
- $Q = XW^Q$, $K = XW^K$, $V = XW^V$
- $X$: Input sequence
- $W^Q, W^K, W^V$: Projection matrices
-
-
This means the compute scales as $\mathcal{O}(n^2 \cdot d)$, where $n$ is the sequence length and $d$ is the hidden dimension.
-
FFN consists of two matrix multiplications:
\[\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2\]- Typically, $W_1 \in \mathbb{R}^{d \times 4d}$, $W_2 \in \mathbb{R}^{4d \times d}$
-
These operations are highly parallelizable and benefit from vectorized matrix operations, making them well-suited for GPUs and TPUs. Even CPUs can achieve good performance with MKL/DNNL or oneDNN optimized kernels — but they are generally slower due to fewer cores and lower memory bandwidth.
-
In short:
- Memory usage: Moderate
- Compute intensity: High
- Best suited hardware: GPU, TPU, NPU (CPU feasible for small models or batch sizes)
Decoder: Memory-bound Nature
-
The decoder is autoregressive — it generates one token at a time and feeds it back in:
- Attention involves both self-attention (causal) and encoder-decoder attention
- Cannot parallelize the same way as the encoder during generation
- During generation, keys and values grow with each step, leading to large KV caches
-
Hence, the decoder is more memory-bound than compute-bound during generation:
- Storing and reading from the KV cache dominates the time per token
- The cost grows linearly with sequence length
-
At each step $t$, the model computes:
-
Consequences:
- Prefill (initial sequence context embedding) is parallelizable and compute-intensive
- Decode (autoregressive step-by-step) is sequential, leading to limited parallelism and more cache reads
-
So:
- Memory usage: High (growing with generated sequence)
- Compute intensity: Low per step
- Best suited hardware: Depends on phase
- Prefill: GPU/NPU
- Decode: CPU often sufficient unless batch generation is needed
Hardware-Specific Characteristics (CPU, GPU, TPU, NPU)
- Understanding the architectural and operational differences between CPUs, GPUs, TPUs, and NPUs is key to optimizing Transformer inference and training. Each hardware type has unique characteristics that impact compute throughput, memory latency, and scalability — especially when applied to different Transformer phases (prefill vs decode).
CPU (Central Processing Unit)
Strengths
- Excellent single-thread performance
- Low latency; ideal for on-demand and low-batch inference
- Flexible memory access and branching logic
Limitations
- Limited parallelism (typically 4–64 cores vs 1000s of GPU cores)
- Lower throughput for matrix-heavy operations
- Cache-friendly but bandwidth-limited compared to accelerators
When to use
- Autoregressive decoding for small models (e.g., 1–2B parameters)
- Edge or offline devices with no GPU/NPU
- Lightweight server-side deployments for on-demand inference
Watchouts
- Attention mechanisms (especially multi-head) can become bottlenecks due to cache locality issues
- Performance highly dependent on software stack: use optimized libraries (e.g., oneDNN, ONNX Runtime, Intel Extension for PyTorch)
Best Practices
- Use quantization (INT8 or FP16 if supported)
- Fuse layers to reduce memory bandwidth needs
- Apply KV cache optimization aggressively for decoder workloads
GPU (Graphics Processing Unit)
Strengths
- Massive parallelism via thousands of CUDA cores
- High-bandwidth memory (HBM), e.g., 600+ GB/s
- Excellent for batched inference and training
Limitations
- Higher latency per operation than CPUs
- Can be underutilized for small batch sizes or single-token decoding
- More complex memory hierarchy
When to use
- Prefill operations where entire input sequence can be parallelized
- Training and fine-tuning large models
- Batch decoding (e.g., chat applications with multiple users)
Watchouts
- Decode stage underutilizes GPU unless using batching or advanced techniques like speculative decoding
- Need to manage memory carefully — attention cache can become large
Best Practices
- Use Tensor Cores (e.g., with FP16/bfloat16) for matrix operations
- Leverage libraries like NVIDIA’s FasterTransformer or TensorRT
- Minimize memory copies between host (CPU) and device (GPU)
TPU (Tensor Processing Unit)
Strengths
- Designed specifically for tensor operations (dense matmuls, convolutions)
- Excellent for training and large-scale inference
- Systolic array architecture delivers massive throughput
Limitations
- Limited flexibility (less suited for irregular control flow)
- Software stack tied to Google ecosystem (JAX, TensorFlow)
When to use
- Training large encoder-based models (e.g., T5, BERT)
- Batched prefill in inference
Watchouts
- Autoregressive decoding underperforms due to lack of dynamic control flow
- Fixed size memory buffers limit dynamic sequence handling
Best Practices
- Keep compute on TPU, avoid offloading to CPU
- Use
xla
compilation to fuse ops and reduce memory overhead - Align tensor shapes to match hardware block sizes
NPU (Neural Processing Unit)
Strengths
- Optimized for inference on mobile and edge devices
- Energy-efficient; often integrated with smartphone SoCs
- Dedicated acceleration for quantized models (e.g., INT8, FP16)
Limitations
- Vary widely in capability and programming APIs (e.g., Apple’s ANE, Qualcomm’s Hexagon, Huawei’s Ascend)
- Difficult to customize beyond supported ops
When to use
- On-device generation (e.g., real-time voice assistants)
- Low-latency use-cases with small LLMs or distilled models
Watchouts
- Must quantize model to match NPU-supported formats
- Performance depends heavily on vendor-specific SDKs (e.g., CoreML, NNAPI, SNPE)
Best Practices
- Use static quantization and operation fusing
- Prune and distill models before deployment
- Use vendor-optimized Transformer blocks when available
Comparative Analysis
Feature | CPU | GPU | TPU | NPU (Edge) |
---|---|---|---|---|
Parallelism | Low | High | Very High | Moderate |
Prefill Performance | Moderate | Excellent | Excellent | Moderate |
Decode Performance | Good (low-batch) | Moderate | Poor | Good (quantized) |
Memory Bandwidth | Low | High | Very High | Low–Moderate |
Flexibility | Very High | High | Low | Low |
Quantization Support | Yes (INT8) | Yes (TensorRT) | Yes (bfloat16) | Yes (INT8-only) |
Optimization Techniques for On-Device Transformers
- Efficiently running Transformers on CPUs, GPUs, and NPUs requires reducing both compute and memory footprints while preserving accuracy. Below are key techniques used in modern deployments.
Key-Value (KV) Cache Optimization
-
In decoder-based models (e.g., GPT), each new token requires computing self-attention with all past tokens. Without optimization, this scales as:
\[O(n^2 \cdot d)\]- where \(n\) is the number of generated tokens and \(d\) is the hidden dimension.
-
KV cache principle:
- Store the key (K) and value (V) projections for all past tokens during generation.
-
For a new token, compute only its query (Q) and perform:
\[\text{Attention}(Q_t, K_{1:t}, V_{1:t})\] -
This reduces complexity per step to:
\[O(n \cdot d)\]
-
Hardware impact:
- Reduces compute but increases memory bandwidth requirements since past K/V matrices must be accessed repeatedly.
- Critical for CPU-based decoding; also essential on NPUs with limited memory.
Speculative Decoding
-
Speculative decoding improves throughput by parallelizing token generation while maintaining correctness.
-
Workflow:
- Use a small draft model to propose multiple tokens ahead.
- A larger target model validates or rejects the proposed tokens in parallel.
- Accepted tokens are appended to the sequence; rejected tokens trigger fallback to standard autoregressive decoding.
-
Benefit:
- Reduces total number of sequential steps for the large model.
- Best suited for GPU or NPU where batched verification of tokens can be parallelized.
Quantization (4/8-bit)
-
Quantization reduces model weights and activations from FP32 or FP16 to lower precision (e.g., INT8 or INT4).
- Static Quantization: Precompute scale factors during calibration on a representative dataset.
- Dynamic Quantization: Scales are computed on-the-fly during inference, less accurate but easier to apply.
-
Equation:
\[x_{quant} = \text{round}\left(\frac{x}{s}\right)\]- where $s$ is the scaling factor.
-
Benefits:
- Reduces memory footprint by 4–8x.
- Increases throughput on CPUs (via AVX512/AMX) and NPUs (native INT8 support).
-
Trade-offs:
- May reduce accuracy, especially for attention layers and small models.
- Requires per-channel quantization for best results.
Knowledge Distillation
-
Use a large, accurate teacher model to train a smaller student model:
-
Objective function includes both hard labels and soft probabilities:
\[L = \alpha L_{\text{CE}} + (1 - \alpha) \, \text{KL}(p_{\text{teacher}} \| p_{\text{student}})\]
Benefit:
- Student model retains much of the teacher’s performance with fewer parameters.
- Ideal for NPUs and CPU edge deployments.
Weight Pruning and Low-Rank Approximations
-
- Pruning removes less important weights (e.g., structured pruning of entire attention heads).
-
Low-rank factorization decomposes weight matrices:
\[W \in \mathbb{R}^{d \times d} \approx U V^T\]with $U, V \in \mathbb{R}^{d \times r}, r \ll d$.
- These reduce both memory and FLOPs.
Operator and Graph Fusion
- Merge operations like Linear + Bias + LayerNorm to reduce memory reads/writes.
- Frameworks: TensorRT (GPU), OpenVINO (CPU), CoreML (NPU).
Sequence Length and Batch Optimizations
- Use sliding window attention for long sequences.
- Tune batch size to match hardware occupancy (especially on GPUs).
Hardware-Specific Optimization Notes
- CPU: Quantization (INT8), operator fusion, KV caching.
- GPU: Mixed precision (FP16/bfloat16), speculative decoding, batching.
- TPU: XLA graph compilation, dense matmul optimizations.
- NPU: Static quantization (INT8/INT4), model distillation, pruning.
Practical Considerations and Pitfalls When Deploying Transformers on CPU, GPU, and NPUs
- Deploying Transformer models involves multiple trade-offs, particularly around latency, throughput, memory usage, and hardware constraints. This section outlines the key considerations and common pitfalls when working with CPUs, GPUs, and NPUs.
CPU Deployment Considerations
Key Strengths
- General-purpose flexibility
- No need for specialized drivers or runtime
- Lower memory bandwidth than accelerators, but with good cache locality
Watch Out For
-
MatMul Bottlenecks:
- Transformer layers involve large GEMMs (General Matrix-Matrix Multiplications).
- Use libraries like Intel MKL-DNN, oneDNN, or OpenBLAS with AVX2/AVX512 instructions.
-
Cache Contention:
- KV cache can overflow L2/L3 caches, especially with long sequences.
- Performance drops significantly when cache locality is lost.
-
Thread Over-subscription:
- Avoid naive use of multithreading; prefer thread pools and NUMA-aware scheduling.
- Profile using
perf
, Intel VTune, or similar tools.
-
Memory Bandwidth:
- CPUs often become memory-bound during decoder operation.
- Optimize memory access patterns and fuse operations to reduce transfers.
-
Quantization Challenges:
- INT8 quantization is highly effective, but requires calibration and may degrade attention accuracy.
- Use dynamic quantization only if static is not viable.
GPU Deployment Considerations
Key Strengths
- Ideal for batched operations and compute-heavy encoder layers
- High memory bandwidth, many-core architecture
Watch Out For
-
Underutilization During Decoding:
- Single-token decode workloads don’t fully occupy GPU cores.
- Consider batching requests or speculative decoding to improve efficiency.
-
Kernel Launch Overhead:
- GPU launch latency can dominate runtime in step-by-step decoding.
- Use fused kernels and persistent caches.
-
Host-Device Memory Transfers:
- Frequent CPU-GPU synchronization (especially during dynamic decoding) can be costly.
- Minimize PCIe transfers and pre-load inputs onto device.
-
Precision Handling:
- Use FP16 or bfloat16 for faster inference, but be cautious of numerical stability.
- Validate that quantized weights preserve generation quality.
-
Maximize Tensor Core Use:
- Use NVIDIA’s cuBLAS, cuDNN, or FasterTransformer for optimized linear algebra.
TPU Deployment Considerations
Key Strengths
- Extremely efficient for matrix operations
- Suitable for large-batch inference and training
Watch Out For
-
Poor Fit for Decoding:
- TPUs struggle with dynamic token-wise decoding due to fixed compute graph.
- May need to run decode loop on CPU or use alternative methods (e.g., GSPMD)
-
Static Shape Requirements:
- Input/output shapes often must be known ahead of time.
- Makes dynamic batching and long-sequence handling difficult.
-
Limited Ecosystem:
- XLA and TensorFlow/JAX integration are essential.
- Less flexibility than PyTorch or ONNX-based deployments.
NPU Deployment Considerations (Edge/SoC Devices)
Key Strengths
- Extremely efficient for low-power inference
- On-chip memory reduces latency
Watch Out For
-
Operator Support Limitations:
- Custom or unsupported operations must fall back to CPU, drastically hurting performance.
- Stay within the vendor’s supported op-set (e.g., for Apple ANE, Qualcomm Hexagon, etc.).
-
Model Size Constraints:
- Total model size must fit within NPU’s SRAM or limited DRAM window.
- Quantization (INT8 or INT4) is non-negotiable for many NPUs.
-
Toolchain Lock-in:
- Conversion pipelines (e.g., PyTorch → ONNX → CoreML) must be strictly validated.
- Vendor SDKs may lack transparency and debugging tools.
-
No Runtime Reallocation:
- Static allocation of KV cache, input tensors, and batch sizes is often required.
- This can lead to fragmentation or wasted space.
Summary Table: Key Pitfalls by Hardware
Platform | Key Pitfall | Best Practices |
---|---|---|
CPU | Memory-bound decode + small caches | Use quantization, cache-friendly layouts |
GPU | Underutilized during decoding | Use batching, speculative decoding, fused kernels |
TPU | Poor step-by-step token generation | Offload decode to CPU or use static batching |
NPU | Operator and memory constraints | Quantize, simplify architecture, preallocate buffers |
Further Reading
- Efficient Inference with Transformer Models on CPUs
- Speculative Decoding for Accelerated Transformer Inference
- Fast Transformers with Memory-Efficient Attention via KV Cache Optimization
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
- Intel Extension for PyTorch: Boosting Transformer Inference on CPUs
- FasterTransformer GitHub Repository (NVIDIA)
- vLLM: Easy and Fast LLM Serving with State-of-the-Art Throughput
- Deploying Transformer Models on Edge Devices with TensorRT
- Quantization Aware Training in PyTorch
- ONNX Runtime: Accelerating Transformer Inference
- Speculative Decoding in vLLM (Medium article)
- Running LLMs on Mobile: Lessons from Distilling and Quantizing GPT-2
- Optimizing LLM Serving on NVIDIA GPUs with TensorRT-LLM
- LLM INT4 Inference with ONNX Runtime
- Efficient Transformer Inference on Edge with EdgeTPU
Citation
If you found our work useful, please cite it as:
@article{Chadha2020DistilledOnDeviceTransformers,
title = {On-device Transformers},
author = {Chadha, Aman},
journal = {Distilled AI},
year = {2020},
note = {\url{https://aman.ai}}
}