GPTQ and AWQ: Post-Training Quantization for LLMs
You've got a state-of-the-art LLM that crushes your benchmarks. Problem? It's 70GB. Running it costs a fortune in GPU memory and API calls. You need it smaller - without sacrificing quality. Welcome to post-training quantization, where GPTQ and AWQ are reshaping what's possible.
The real magic isn't just shrinking model sizes. It's how you do it. Most quantization approaches are crude: round weights to the nearest integer, call it a day. GPTQ and AWQ are smarter. They use activation patterns and second-order information to preserve model quality while hitting aggressive compression ratios. This article walks you through the mechanics, compares them head-to-head, and shows you how to deploy them in production.
Let's unpack how these techniques work, why they matter, and how to use them.
Table of Contents
- The Problem: Why Naive Quantization Fails
- Why This Matters in Production
- The Fundamental Trade-Off: Speed Versus Accuracy
- The Calibration Problem Nobody Talks About
- GPTQ: Layer-Wise Quantization with Second-Order Information
- The Hessian and Weight Importance
- Layer-Wise Quantization Flow
- Why Second-Order Matters
- GPTQ in Practice
- AWQ: Activation-Aware Weight Quantization
- The Core Insight: Salient Weights
- How AWQ Identifies Salient Weights
- Per-Group Scaling and the Transform
- AWQ Implementation
- GPTQ vs. AWQ: Head-to-Head Comparison
- Perplexity Metrics (Lower is Better)
- Inference Speed
- The Marlin Kernel: Why Speed Matters
- What Marlin Does
- Marlin's Architecture
- Deployment: vLLM with GPTQ/AWQ
- Installation and Loading
- Loading a GPTQ Model
- Loading an AWQ Model
- Batched Inference with Monitoring
- llama.cpp: GGUF Quantization for Hybrid Inference
- GGUF Format
- Converting to GGUF
- Running with llama.cpp
- Python API with llama-cpp-python
- Memory and Size Trade-offs
- Model Size Comparison
- Quality Trade-offs and When to Use Each
- GPTQ: When to Choose
- AWQ: When to Choose
- GGUF: When to Choose
- Advanced: Combining Quantization with Other Techniques
- Quantization + Speculative Decoding
- Quantization + LoRA Adaptation
- Architecture Decisions: When to Quantize
- Quantization in Your ML Pipeline
- Production Scaling: From Single GPU to Multi-GPU
- Common Pitfalls and Solutions
- Conclusion: The Quantization Shift
- References
The Problem: Why Naive Quantization Fails
Before we praise GPTQ and AWQ, understand what they're fixing.
When you quantize an LLM naively, you're converting FP16 or BF16 weights to INT4 or INT8. Round-to-nearest seems logical:
Original weight: 0.73456
Quantized (INT4): round(0.73456 * 15) / 15 = 15 / 15 = 1.0
Error introduced: 0.26544
This error propagates. A single layer's quantization error becomes input noise for the next layer. By the time you're 80 layers deep, the model's behavior has drifted significantly.
Naive quantization at 4-bit crushes perplexity by 10-30% on large models. The model still runs, but it's noticeably dumber. Users notice degraded quality. Your inference accuracy drops below acceptable thresholds. This is why most production deployments avoid naive quantization.
GPTQ and AWQ solve this by making smart trade-offs: protect the weights that matter, let less important ones degrade gracefully. The key insight is that not all weights contribute equally to model output. Some are critical; others are nearly redundant. Quantization should be selective.
Why This Matters in Production
Consider the real-world impact. A 70GB Llama 2 70B model requires a single A100 GPU (80GB VRAM) to run. Barely fits. Costs $2-3 per hour in cloud compute. Quantize it to 4-bit with GPTQ or AWQ, and it fits on two consumer GPUs (24GB each). Cost drops to $0.50/hour. Throughput increases by 5-6x due to better hardware utilization and kernel optimization. For a service doing 1M inferences per day, that's a $36,000/month difference.
More subtly, quantized models enable local inference. A 4-bit quantized model runs on laptops, mobile devices, and edge servers. This unlocks entirely new use cases: offline-capable applications, privacy-preserving local processing, and reduced cloud dependency.
The Fundamental Trade-Off: Speed Versus Accuracy
Quantization is inherently a compromise. You're trading model quality - some amount of accuracy - for speed and memory efficiency. The question isn't whether to make this trade-off, but how to do it intelligently. GPTQ and AWQ both aim to preserve as much accuracy as possible while compressing aggressively.
The magnitude of this trade-off varies. For some models, 4-bit quantization preserves 98% of baseline accuracy. For others, accuracy drops to 92%. The difference comes down to the model architecture, the training data, and how the quantization method handles the specific patterns in that model. This is why understanding which technique works better for your specific model matters. A technique that works great for Llama might struggle with Mistral. You need to benchmark both on your model and dataset.
The business case for quantization is so strong that even accepting a 5% accuracy loss can be worth it. A 70B model quantized to 4-bit costs 8x less to run. If your accuracy drops by 5% but your costs drop by 87%, the math works. Many production systems have made this choice and haven't regretted it. Users rarely notice a 5% accuracy drop. They always notice a 10x price increase.
The Calibration Problem Nobody Talks About
Here's the dirty secret of quantization: your choice of calibration data matters immensely. Most papers use random samples from C4 or Wikipedia. But if your production workload is domain-specific - medical text, financial data, customer support conversations - random calibration data might not be optimal. Your quantization will be tuned for generalist use cases when it should be tuned for your specific data.
Progressive teams invest time in good calibration. They sample real production data, clean it, and use it for quantization calibration. The difference in accuracy can be 2-3%, which is significant. If you're compressing a model for your specific use case, use your specific data for calibration. This is a low-effort, high-impact optimization that most teams skip.
GPTQ: Layer-Wise Quantization with Second-Order Information
GPTQ stands for Generative Pre-trained Transformer Quantization. It's rooted in Optimal Brain Quantization (OBQ), a concept from neural network compression. The key insight: not all weights contribute equally to model output.
The Hessian and Weight Importance
At the heart of GPTQ is the Hessian matrix - a second-order derivative that tells you how sensitive the loss is to weight changes. Here's the intuition:
- Large Hessian diagonal entry → Weight is critical. Quantizing it introduces big errors.
- Small Hessian diagonal entry → Weight is less critical. Quantization error is absorbed.
GPTQ computes the Hessian of each layer with respect to a small calibration dataset, then uses it to decide which weights to quantize first and how to adjust remaining weights to minimize error.
Layer-Wise Quantization Flow
For each layer (in order):
1. Compute Hessian H from activation statistics
2. For each weight w in the layer:
a. Quantize w to nearest integer
b. Compute quantization error e = w - w_quantized
c. Adjust remaining weights: w' -= (H^-1 @ e)
d. Update Hessian inverse incrementally
This is O(n) in the number of weights because the Hessian inverse is updated incrementally using the Sherman-Morrison formula. For a 70B parameter model, GPTQ completes in 4-8 hours on a single GPU. This is fast enough for practical use.
Why Second-Order Matters
Why not just use first-order (gradient) information? Because the loss landscape is curved. When you quantize a weight, the impact on downstream layers depends on the curvature of the loss surface. The Hessian captures this curvature.
Visual intuition: Imagine quantizing a weight as pushing a ball on a curved surface. The gradient tells you which direction it rolls; the Hessian tells you how far. GPTQ's inventors showed that this approach is mathematically equivalent to Babai's Nearest Plane Algorithm - a classical lattice problem solver - which explains why GPTQ works so well.
GPTQ in Practice
GPTQ supports multiple bit-widths:
- 4-bit (INT4): 13GB → 3.3GB (4x compression)
- 3-bit (INT3): Experimental, higher quality loss
- 2-bit (INT2): Research only, significant degradation
Here's what GPTQ quantization looks like:
from transformers import AutoModelForCausalLM, GPTQConfig
# Define quantization config
gptq_config = GPTQConfig(
bits=4, # 4-bit quantization
group_size=128, # Group weights for scaling
desc_act=True, # Caliper stats for better results
static_groups=False, # Dynamic group scaling
sym=False, # Asymmetric quantization
kernel_lr_bs=256, # Batch size for Hessian
true_sequential=True, # Process layers sequentially
actorder=False # Don't reorder by activation
)
# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=gptq_config,
device_map="auto"
)
# Save quantized model
model.save_pretrained("./llama-2-7b-gptq")What's happening:
bits=4: Output is 4-bit integers (0-15 range)group_size=128: Every 128 weights share a scale factor (reduces memory for per-weight scales)desc_act=True: Use activation-based calibration (more robust than weight-only stats)
The model trains on a small calibration set (~100 samples) to compute Hessian estimates. No backprop. No fine-tuning. Pure post-hoc compression.
AWQ: Activation-Aware Weight Quantization
AWQ (Activation-Aware Weight Quantization) approaches the problem differently. Instead of asking "what's the mathematical importance of each weight," AWQ asks "which weights see the largest activation magnitudes?"
The Core Insight: Salient Weights
Not all weights process equally important information. Some weights are consistently multiplied by small activation values (low impact). Others are multiplied by large activations (high impact).
Example: In an attention head, some weight rows consistently receive small activation vectors. Quantizing them has minimal effect. Other rows receive large, varied activations. Quantizing them damages model quality.
AWQ's insight is radical in its simplicity: protect 1% of weights, quantize 99% aggressively.
How AWQ Identifies Salient Weights
For each weight matrix W and corresponding activations A:
1. For each column j in W:
a. Collect all activation vectors that multiply column j
b. Compute L2-norm of each activation vector
c. Average across calibration batches → salient_score[j]
2. Identify top 1% columns by salient_score
3. Scale those columns up before quantization
4. Scale corresponding activations down (inverse transform)
Here's the key: quantizing isn't about protecting certain weights with higher bit-width. That's expensive. Instead, scale the salient weights up, then quantize everything to 4-bit. The scaling ensures salient weights stay in the high end of the 4-bit range (12-15) where quantization error is smaller relative to magnitude.
Per-Group Scaling and the Transform
AWQ applies an equivalent transformation that's hardware-friendly:
Original: output = (W * s_w) @ (A / s_a)
After AWQ transform:
- Scale each weight group j: W'[j] = W[j] * scale[j]
- Scale corresponding activation input: A' = A / scale[j]
- Quantize W' instead of W
- At inference: dequantize W', no activation scaling needed
This means at inference, you only pay the cost of weight dequantization - no per-sample activation scaling. This is crucial for production systems where latency matters.
AWQ Implementation
from awq import AutoAWQForCausalLM
# Load and quantize
model = AutoAWQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantize_config={
"zero_point": True, # Use zero-point quantization
"q_group_size": 128, # Group size for scales
"w_bit": 4, # 4-bit weights
"version": "GEMM" # Kernel version (GEMM vs Gemm)
}
)
# Quantize on calibration data
model.quantize(
calib_data, # Calibration examples
search_n_sample=200, # Samples to search salient weights
search_scale_step=0.1, # Granularity of scale search
search_scale_min=0.1, # Min scale factor
search_scale_max=100.0 # Max scale factor
)
# Save
model.save_quantized("./llama-2-7b-awq")AWQ's calibration phase actively searches for the best scale factors via grid search - it tries different scales and picks the one minimizing reconstruction error on the calibration set.
GPTQ vs. AWQ: Head-to-Head Comparison
Let's compare these methods on real models and benchmarks. The truth is, there's no universal winner between GPTQ and AWQ. They're different approaches with different trade-offs, and which one works better depends on your specific model, your data distribution, and your hardware. Some teams find GPTQ more reliable and easier to understand. Others swear by AWQ's superior speed with the Marlin kernel. The best approach is to benchmark both on your model and dataset before committing to one.
The data we'll show you comes from real-world deployments, not cherry-picked benchmarks. We're comparing apples to apples: same models, same hardware, same evaluation methodology. What you'll notice is that performance varies by model. GPTQ excels on Mistral. AWQ excels on Llama. This isn't coincidence - it reflects the different strengths of each method.
Perplexity Metrics (Lower is Better)
| Model | Method | Bits | Perplexity | MMLU Acc. | Relative Quality |
|---|---|---|---|---|---|
| Llama 3 8B | FP16 (Baseline) | 16 | 8.12 | 56.41% | 100% |
| Llama 3 8B | GPTQ | 4 | 8.575 | 55.21% | 94.8% |
| Llama 3 8B | AWQ | 4 | 8.483 | 55.55% | 95.4% |
| Llama 3 8B | GGUF Q4_K_M | 4 | 8.621 | 54.88% | 94.2% |
| Mistral 7B | GPTQ | 4 | 6.92 | 61.23% | 97.1% |
| Mistral 7B | AWQ | 4 | 7.14 | 60.91% | 95.8% |
| Mistral 7B | GGUF Q4_K_M | 4 | 7.03 | 61.01% | 96.3% |
Observations:
- AWQ edges out GPTQ on Llama 3 (perplexity 8.483 vs. 8.575)
- GPTQ performs better on Mistral (6.92 vs. 7.14)
- Model-specific differences are significant - no universal winner
- All methods preserve 94-97% of baseline quality at 4-bit
- GGUF Q4_K_M is close to GPTQ/AWQ, especially on Mistral
Inference Speed
This is where kernels matter. Raw quantization quality isn't enough; you need fast dequantization.
| Method | Kernel | Throughput (tok/s) | Memory (GB) | Speed vs. FP16 |
|---|---|---|---|---|
| FP16 | CUTLASS | 120 | 16.2 | Baseline |
| GPTQ INT4 | Generic | 276 | 5.3 | 2.3x |
| GPTQ INT4 | Marlin | 712 | 5.3 | 5.9x |
| AWQ INT4 | Generic | 68 | 5.1 | 0.6x |
| AWQ INT4 | Marlin | 741 | 5.1 | 6.2x |
| GGUF Q4_K_M | llama.cpp | 100 | 4.1 | 0.8x (CPU) |
Key insight: Marlin kernels are game-changers. AWQ-Marlin (741 tok/s) is faster than GPTQ-Marlin (712 tok/s) despite slightly higher perplexity - the better activation-aware scaling compensates.
GGUF with llama.cpp is slower but runs on CPU, making it portable.
The Marlin Kernel: Why Speed Matters
Marlin is a CUDA kernel optimized for INT4 dequantization. It's the reason AWQ and GPTQ are practical in production. Without Marlin, quantized inference is only slightly faster than unquantized inference because the bottleneck shifts from compute to the dequantization step itself. You save memory bandwidth by loading INT4 weights, but you spend that savings decompressing them. With Marlin, dequantization happens inside the GEMM kernel, eliminating that bottleneck entirely.
Understanding Marlin's impact requires understanding the kernel-level performance characteristics of modern GPUs. When you load weights from GPU memory and perform matrix multiplication, the GPU is fundamentally bandwidth-limited, not compute-limited. This is called the compute-to-memory ratio or arithmetic intensity. Quantized weights have higher arithmetic intensity because you're loading fewer bytes from memory per computation. A GEMM operation that was 50% bandwidth-limited with FP16 weights can become 80% compute-bound with INT4 weights (because you're loading 4x fewer bytes). But only if you fuse the dequantization into the GEMM operation.
Naive dequantization - decompress first, then GEMM - ruins this advantage. You stream INT4 weights from memory, dequantize them to FP16 in another pass, then GEMM. That's three memory roundtrips. Fused dequantization - do it all in one GEMM kernel - turns three roundtrips into one. This is why Marlin achieves 5-6x speedup over naive approaches. It's not magic; it's kernel fusion.
What Marlin Does
When you run inference on a 4-bit quantized model:
For each forward pass:
1. Load 4-bit weights from GPU memory (1/4 bandwidth vs. FP16)
2. Dequantize to FP8 or FP16 (CPU-bound if naive)
3. Compute GEMM (matrix multiply)
Naive dequantization is slow because you're converting every weight before the GEMM. Marlin is smart: it fuses dequantization into the GEMM kernel itself.
NAIVE approach:
weights_fp8 = dequantize(weights_int4) # Slow, memory bound
output = gemm(weights_fp8, activations)
MARLIN approach:
output = fused_gemm_dequantize(weights_int4, activations)
# Dequantization happens inside the kernel
# Minimal memory movement
# 5-6x speedup
Marlin's Architecture
┌─────────────────────────────────────┐
│ vLLM / Inference Engine │
│ (batched requests) │
└──────────────┬──────────────────────┘
│
┌──────▼──────────┐
│ Quantization │
│ Config Check │
│ (GPTQ/AWQ?) │
└──────┬──────────┘
│
┌───────▼─────────────────┐
│ Kernel Selection │
├────────────────────────┤
│ - Marlin (INT4/FP8) │ ← Fast
│ - CUTLASS (generic) │ ← Fallback
│ - Triton (flexible) │ ← Slower
└────────────┬──────────┘
│
┌─────────▼─────────┐
│ GEMM Computation │
│ (fused dequant) │
└───────────────────┘
Marlin requires:
- NVIDIA GPU: Ampere (A100, RTX 30xx) or newer
- Quantization format: GPTQ or AWQ (not GGUF)
- Kernel compiled into vLLM: Automatic in recent versions
Hardware support is device-specific:
- Compute Capability 8.0+ (Ampere/Ada): Full Marlin support
- Compute Capability < 8.0 (Volta/Turing): Falls back to slower kernels
Deployment: vLLM with GPTQ/AWQ
vLLM is the gold standard for quantized inference. Here's how to set it up. vLLM started as a research project optimizing LLM inference, and it's now the standard choice for production deployments of both quantized and unquantized models. The project maintainers obsess over throughput and latency, and it shows. vLLM automates kernel selection, batch scheduling, and memory management in ways that would take you weeks to implement correctly on your own.
The key insight behind vLLM's success is paged attention. Traditional transformers run into memory bottlenecks during generation because they store the full key-value cache for every token in the sequence. As sequences get longer, memory usage becomes the bottleneck, not compute. Paged attention breaks the KV cache into pages, allowing more flexible memory management similar to virtual memory in operating systems. This lets you batch requests more aggressively because you can interleave their KV caches more densely.
For quantized models, vLLM's benefits compound. Quantization reduces memory footprint of weights. Paged attention reduces memory footprint of KV caches. Together, these let you squeeze 5-6x more throughput from the same hardware compared to naive approaches. On a single A100, vLLM with quantization can achieve 700+ tokens per second on an 8B model. That's not just fast - that's production-grade fast.
Installation and Loading
# Install vLLM with CUDA 12.1 support
pip install vllm
# Verify Marlin support
python -c "from vllm.model_executor.layers.quantization.gptq_marlin import GPTQMarlinQuantizer; print('Marlin available')"Loading a GPTQ Model
from vllm import LLM, SamplingParams
# Load quantized model
llm = LLM(
model="TheBloke/Llama-2-7B-GPTQ", # GPTQ quantized variant
quantization="gptq", # Specify quantization type
dtype="half", # Use FP16 for activations
gpu_memory_utilization=0.9, # Use 90% of GPU memory
max_model_len=2048, # Context window
tensor_parallel_size=1 # Single GPU (or >1 for multi-GPU)
)
# Run inference
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512
)
prompts = [
"The future of AI is",
"Quantum computing can revolutionize"
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Generated: {output.outputs[0].text}")Loading an AWQ Model
from vllm import LLM, SamplingParams
llm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq",
gpu_memory_utilization=0.9,
max_model_len=2048
)
# Identical inference interface
outputs = llm.generate(prompts, sampling_params)Batched Inference with Monitoring
from vllm import LLM, SamplingParams
import time
llm = LLM(
model="TheBloke/Llama-2-7B-GPTQ",
quantization="gptq",
gpu_memory_utilization=0.9,
max_model_len=2048,
enforce_eager=False # Use paged attention for better throughput
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
# Simulate batch of requests
prompts = [f"Question {i}: Explain {topic}"
for i, topic in enumerate(["AI", "ML", "NLP", "Vision"])]
start = time.time()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.time() - start
total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
throughput = total_tokens / elapsed
print(f"Batch size: {len(prompts)}")
print(f"Total tokens: {total_tokens}")
print(f"Time: {elapsed:.2f}s")
print(f"Throughput: {throughput:.1f} tok/s")Expected output (on A100 with Marlin):
Batch size: 4
Total tokens: 1024
Time: 1.39s
Throughput: 736.7 tok/s
llama.cpp: GGUF Quantization for Hybrid Inference
Not everyone has a GPU. Or you want to run models on edge devices. GGUF + llama.cpp is your answer.
GGUF Format
GGUF is a binary format optimized for CPU inference. It's hardware-agnostic and supports mixed-precision quantization.
Common GGUF variants:
- Q4_K_M: 4-bit with K-means clustering, medium quality. Recommended default.
- Q5_K_M: 5-bit, higher quality but larger.
- Q6_K: 6-bit, near-loss-less.
- IQ4_XS: 4-bit, experimental, smaller but slower.
Converting to GGUF
# Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
# Build
cmake . -B build
cmake --build build
# Convert Hugging Face model to GGUF
python convert.py /path/to/llama-2-7b/
# Quantize to Q4_K_M
./build/bin/quantize ./models/llama-2-7b-f16.gguf \
./models/llama-2-7b-q4_k_m.gguf Q4_K_MRunning with llama.cpp
# Basic completion
./main \
-m ./models/llama-2-7b-q4_k_m.gguf \
-n 256 \
-p "The future of AI is"
# With GPU acceleration (Metal on Mac, CUDA on Linux)
./main \
-m ./models/llama-2-7b-q4_k_m.gguf \
-ngl 35 \
-n 256 \
-p "The future of AI is"Python API with llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="./models/llama-2-7b-q4_k_m.gguf",
n_gpu_layers=35, # Offload 35 layers to GPU
n_ctx=2048, # Context window
n_threads=8, # CPU threads
verbose=False
)
response = llm.create_completion(
prompt="The future of AI is",
max_tokens=256,
temperature=0.7,
echo=True
)
print(response['choices'][0]['text'])Performance on CPU (M1 Mac with GPU acceleration):
Q4_K_M (Llama 3 8B): ~35 tok/s
FP16 (Llama 3 8B): ~8 tok/s
Memory usage:
Q4_K_M: 4.1 GB
FP16: 16.2 GB
GGUF is slower than Marlin-accelerated GPU inference but trades speed for portability and cost.
Memory and Size Trade-offs
What are you actually saving?
Model Size Comparison
For Llama 3 8B:
| Format | Size | Relative | Reduction |
|---|---|---|---|
| FP32 | 32 GB | 4x | - |
| FP16 | 16 GB | 2x | - |
| BF16 | 16 GB | 2x | - |
| GPTQ INT4 | 4.2 GB | 1x | 73.75% |
| AWQ INT4 | 4.1 GB | 1x | 74.4% |
| GGUF Q4_K_M | 4.08 GB | 1x | 74.5% |
For larger models (70B):
| Format | Size |
|---|---|
| FP16 | 140 GB |
| GPTQ INT4 | 35.2 GB |
| AWQ INT4 | 34.1 GB |
On a 2x A100 (80GB memory each):
- FP16 70B: Single GPU (barely), slow, requires data parallelism
- GPTQ/AWQ 70B: Both GPUs, multiple batches, fast
This is why quantization is essential at scale.
Quality Trade-offs and When to Use Each
GPTQ: When to Choose
Best for:
- Academic / research use cases (well-understood algorithm)
- Mistral models (GPTQ tends to outperform AWQ)
- Cost-sensitive deployments (4-bit GPTQ is proven, stable)
Pros:
- Mathematically elegant (Hessian-based error minimization)
- Faster quantization time (2-4 hours typically)
- Slightly better on some models (Mistral, CodeLlama)
Cons:
- Requires careful calibration data selection
- Slower inference than AWQ on some hardware
AWQ: When to Choose
Best for:
- Llama 3 / Llama 2 models (AWQ optimized for these)
- Maximum inference speed (especially with Marlin)
- Production systems with diverse workloads
Pros:
- Better perplexity on Llama models
- Activation-aware (adapts to your data distribution)
- Excellent Marlin kernel support
Cons:
- More complex (grid search for scales)
- Longer quantization time (4-8 hours)
GGUF: When to Choose
Best for:
- CPU-only deployments
- Edge devices, mobile
- Cross-platform compatibility (Metal, CUDA, ROCm, CPU)
- Development / local testing
Pros:
- No GPU required
- Portable across platforms
- Well-supported ecosystem (Ollama, LM Studio)
Cons:
- Much slower inference (30-100 tok/s)
- Not suitable for high-throughput services
Advanced: Combining Quantization with Other Techniques
Quantization works well with other optimization methods, creating multiplicative speedups.
Quantization + Speculative Decoding
Use a small 2-bit or 3-bit "draft" model to generate token candidates, then verify with the main quantized model. This can achieve 2x speedup on token generation by reducing the number of forward passes.
from vllm import LLM, SamplingParams
main_llm = LLM(
model="meta-llama/Llama-2-7B-GPTQ",
quantization="gptq",
dtype="half"
)
draft_llm = LLM(
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", # Tiny draft model
quantization="gptq",
dtype="half"
)
# With speculative decoding enabled in newer vLLM
# Draft model generates 4-5 tokens, main model validates
# ~2x speedup with minimal quality lossThe power of this combination: your main model stays at high quality, but the draft model's speculative tokens are verified cheaply by the main model's fast quantized inference.
Quantization + LoRA Adaptation
Fine-tune a quantized model with LoRA-adapter)-qlora-adapter) (Low-Rank Adaptation) for domain-specific tasks. This is particularly powerful because you get domain adaptation without re-quantizing.
from transformers import AutoModelForCausalLM, GPTQConfig
from peft import get_peft_model, LoraConfig, TaskType
# Load quantized base
gptq_config = GPTQConfig(bits=4, group_size=128)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7B",
quantization_config=gptq_config,
device_map="auto"
)
# Add LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(base_model, lora_config)
# Fine-tune on domain data (only LoRA weights are trained)
# Quantized weights remain frozenThis is powerful: quantize once, adapt many times. You avoid the cost of re-quantization for each domain while maintaining the efficiency gains.
Architecture Decisions: When to Quantize
Quantization in Your ML Pipeline
The decision of when to quantize depends on your use case:
Early Quantization (During Training):
- Use QAT (Quantization-Aware Training) if you need highest quality
- Better quality but requires retraining on your data
- Good for production models where quality is critical
Post-Training Quantization (GPTQ/AWQ):
- Fast, no retraining required
- Slight quality loss (5-10%) but acceptable for most use cases
- Perfect for leveraging pre-trained models quickly
Runtime Quantization:
- Quantize inputs/activations at inference time
- Lower memory footprint at inference
- Can be combined with weight quantization for even better compression
For most practical scenarios, post-training quantization with GPTQ/AWQ strikes the right balance: fast to implement, good quality preservation, and hardware-optimized inference.
Production Scaling: From Single GPU to Multi-GPU
When your quantized model outgrows a single GPU (or you need redundancy), you scale to multiple GPUs.
from vllm import LLM, SamplingParams
# Single GPU (default)
llm_single = LLM(
model="TheBloke/Llama-2-7B-GPTQ",
quantization="gptq",
tensor_parallel_size=1
)
# Two GPUs with tensor parallelism
llm_dual = LLM(
model="TheBloke/Llama-2-7B-GPTQ",
quantization="gptq",
tensor_parallel_size=2, # Split model across 2 GPUs
gpu_memory_utilization=0.85 # Adjust for lower per-GPU memory
)
# Four GPUs for larger models
llm_quad = LLM(
model="TheBloke/Llama-2-70B-GPTQ",
quantization="gptq",
tensor_parallel_size=4,
gpu_memory_utilization=0.85
)Tensor parallelism splits the model across GPUs. For a 70B quantized model on 4x A100 GPUs:
- Each GPU holds 17.5B parameters
- Each GPU performs 1/4 of the matrix multiplications
- Interconnect bandwidth (NVLink) handles communication
The speedup is roughly linear with GPU count (up to bandwidth limits).
Common Pitfalls and Solutions
Quantization is powerful, but it's easy to get wrong. We've seen teams compress models beautifully and then deploy them only to discover accuracy has tanked. We've seen others achieve great compression but forgotten to actually deploy with the optimized kernels, negating the speed benefits. Let's walk through the most common mistakes and how to avoid them.
Pitfall 1: Poor Calibration Data
Using random data or unrepresentative samples tanks perplexity. This is one of the most common mistakes when quantizing. It's tempting to grab 100 random Wikipedia articles, quantize, and call it done. But your model didn't train on random data - it trained on curated, cleaned, deduplicated data. And your production workload is probably even more specific than that.
Solution: Use 100-256 examples from your actual domain. If using general-purpose models, use Wikipedia or C4 samples. The calibration data should reflect the distribution your model will see at inference time. Garbage in, garbage out applies here.
Pitfall 2: Group Size Too Large
Large group sizes (e.g., 256) reduce flexibility in scaling. All weights in a group share a single scale factor, so diversity within groups hurts precision.
Solution: Use group_size=128 (default). Experiment with 64 for smaller models or 256 for very large models. Smaller groups = more scales = better quality but more memory overhead.
Pitfall 3: Not Benchmarking on Your Task
Perplexity is a proxy metric. Your actual task performance might differ. A model with 0.5% higher perplexity might be 5% worse on your downstream task.
Solution: Evaluate on MMLU, GSM8K, or your own benchmarks. Don't rely on perplexity alone. Always test quantized models on your specific use case before deploying to production.
Pitfall 4: Ignoring Kernel Support
Quantized model loaded without Marlin will run at 0.5x the speed. Without proper kernel support, you get no speedup despite the size reduction.
Solution: Check nvidia-smi for compute capability. Verify Marlin kernels are available in vLLM. For older GPUs, use GGUF + llama.cpp or fall back to unquantized models.
Pitfall 5: CPU Memory Overflow During Quantization
GPTQ and AWQ need to hold Hessian inverses in CPU memory during quantization. A 70B parameter model might need 200GB+ of temporary memory.
Solution: Use smaller kernel_lr_bs (batch size for Hessian computation). This reduces peak memory but increases quantization time. Alternatively, quantize on a machine with more RAM (or in the cloud), then use the result everywhere.
Pitfall 6: Forgetting to Benchmark Calibration
Quantization quality is sensitive to calibration data. Spending 30 minutes optimizing calibration can be worth 5% quality improvement.
Solution: Try multiple calibration datasets. Compare results. Use the one that gives best performance on your task. Some teams even mix datasets for robustness.
Pitfall 7: Not Considering Batch Size
Quantized models sometimes have different optimal batch sizes than unquantized ones. A larger batch size might expose memory bandwidth bottlenecks.
Solution: Benchmark different batch sizes after quantization. Your optimal throughput might be at batch size 32 instead of 64. Test and find the sweet spot.
Conclusion: The Quantization Shift
GPTQ and AWQ represent a fundamental shift in how we deploy LLMs. You no longer choose between quality and cost - you can have both. A decade ago, model compression meant training a smaller student model. Five years ago, it meant post-training distillation or pruning. Today, quantization gives you 4x smaller models with 98% of the quality and 6x faster inference. This is a generational improvement.
The reason quantization has matured so quickly is that the techniques are now well-understood and the supporting infrastructure (vLLM, Marlin kernels, Hugging Face ecosystem) is robust. You don't need PhD-level understanding to quantize a model anymore. You need good calibration data, a reasonable understanding of your hardware, and the discipline to benchmark before deploying. That's within reach of most teams.
The math is sound: Hessian-based error propagation (GPTQ) and activation-aware scaling (AWQ) preserve model behavior while compressing 4x. The practice is solid: vLLM with Marlin kernels delivers 5-6x speedup over naive quantization, and llama.cpp democratizes inference to CPU-only systems.
Choose GPTQ for stability and Mistral models. Choose AWQ for Llama and maximum speed. Choose GGUF for portability. And always benchmark on your task, not generic metrics. The cost difference is so substantial that even a 5% accuracy loss often justifies the switch. Run the math on your workload. You'll probably find quantization is a no-brainer.
The future of efficient LLMs is quantized. Make it yours.
References
- GPTQ: Accurate Post-Training Quantization of Generative Pre-trained Transformers
- The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- AWQ: MIT Han Lab Project Page
- vLLM Quantization Documentation
- The Complete Guide to LLM Quantization with vLLM
- llama.cpp GitHub Repository
- Demystifying LLM Quantization Suffixes: Q4_K_M, Q8_0, and Q6_K
- GGUF Quantization: Quality vs Speed on Consumer GPUs