You've got a state-of-the-art LLM that crushes your benchmarks. Problem? It's 70GB. Running it costs a fortune in GPU memory and API calls. You need it smaller - without sacrificing quality. Welcome to post-training quantization, where GPTQ and AWQ are reshaping what's possible.

The real magic isn't just shrinking model sizes. It's how you do it. Most quantization approaches are crude: round weights to the nearest integer, call it a day. GPTQ and AWQ are smarter. They use activation patterns and second-order information to preserve model quality while hitting aggressive compression ratios. This article walks you through the mechanics, compares them head-to-head, and shows you how to deploy them in production.

Let's unpack how these techniques work, why they matter, and how to use them.

The Problem: Why Naive Quantization Fails

Before we praise GPTQ and AWQ, understand what they're fixing.

When you quantize an LLM naively, you're converting FP16 or BF16 weights to INT4 or INT8. Round-to-nearest seems logical:

Original weight: 0.73456
Quantized (INT4): round(0.73456 * 15) / 15 = 15 / 15 = 1.0
Error introduced: 0.26544

This error propagates. A single layer's quantization error becomes input noise for the next layer. By the time you're 80 layers deep, the model's behavior has drifted significantly.

Naive quantization at 4-bit crushes perplexity by 10-30% on large models. The model still runs, but it's noticeably dumber. Users notice degraded quality. Your inference accuracy drops below acceptable thresholds. This is why most production deployments avoid naive quantization.

GPTQ and AWQ solve this by making smart trade-offs: protect the weights that matter, let less important ones degrade gracefully. The key insight is that not all weights contribute equally to model output. Some are critical; others are nearly redundant. Quantization should be selective.

Why This Matters in Production

Consider the real-world impact. A 70GB Llama 2 70B model requires a single A100 GPU (80GB VRAM) to run. Barely fits. Costs $2-3 per hour in cloud compute. Quantize it to 4-bit with GPTQ or AWQ, and it fits on two consumer GPUs (24GB each). Cost drops to $0.50/hour. Throughput increases by 5-6x due to better hardware utilization and kernel optimization. For a service doing 1M inferences per day, that's a $36,000/month difference.

More subtly, quantized models enable local inference. A 4-bit quantized model runs on laptops, mobile devices, and edge servers. This unlocks entirely new use cases: offline-capable applications, privacy-preserving local processing, and reduced cloud dependency.

The Fundamental Trade-Off: Speed Versus Accuracy

Quantization is inherently a compromise. You're trading model quality - some amount of accuracy - for speed and memory efficiency. The question isn't whether to make this trade-off, but how to do it intelligently. GPTQ and AWQ both aim to preserve as much accuracy as possible while compressing aggressively.

The magnitude of this trade-off varies. For some models, 4-bit quantization preserves 98% of baseline accuracy. For others, accuracy drops to 92%. The difference comes down to the model architecture, the training data, and how the quantization method handles the specific patterns in that model. This is why understanding which technique works better for your specific model matters. A technique that works great for Llama might struggle with Mistral. You need to benchmark both on your model and dataset.

The business case for quantization is so strong that even accepting a 5% accuracy loss can be worth it. A 70B model quantized to 4-bit costs 8x less to run. If your accuracy drops by 5% but your costs drop by 87%, the math works. Many production systems have made this choice and haven't regretted it. Users rarely notice a 5% accuracy drop. They always notice a 10x price increase.

The Calibration Problem Nobody Talks About

Here's the dirty secret of quantization: your choice of calibration data matters immensely. Most papers use random samples from C4 or Wikipedia. But if your production workload is domain-specific - medical text, financial data, customer support conversations - random calibration data might not be optimal. Your quantization will be tuned for generalist use cases when it should be tuned for your specific data.

Progressive teams invest time in good calibration. They sample real production data, clean it, and use it for quantization calibration. The difference in accuracy can be 2-3%, which is significant. If you're compressing a model for your specific use case, use your specific data for calibration. This is a low-effort, high-impact optimization that most teams skip.

GPTQ: Layer-Wise Quantization with Second-Order Information

GPTQ stands for Generative Pre-trained Transformer Quantization. It's rooted in Optimal Brain Quantization (OBQ), a concept from neural network compression. The key insight: not all weights contribute equally to model output.

The Hessian and Weight Importance

At the heart of GPTQ is the Hessian matrix - a second-order derivative that tells you how sensitive the loss is to weight changes. Here's the intuition:

Large Hessian diagonal entry → Weight is critical. Quantizing it introduces big errors.
Small Hessian diagonal entry → Weight is less critical. Quantization error is absorbed.

GPTQ computes the Hessian of each layer with respect to a small calibration dataset, then uses it to decide which weights to quantize first and how to adjust remaining weights to minimize error.

Layer-Wise Quantization Flow

For each layer (in order):
  1. Compute Hessian H from activation statistics
  2. For each weight w in the layer:
     a. Quantize w to nearest integer
     b. Compute quantization error e = w - w_quantized
     c. Adjust remaining weights: w' -= (H^-1 @ e)
     d. Update Hessian inverse incrementally

This is O(n) in the number of weights because the Hessian inverse is updated incrementally using the Sherman-Morrison formula. For a 70B parameter model, GPTQ completes in 4-8 hours on a single GPU. This is fast enough for practical use.

Why Second-Order Matters

Why not just use first-order (gradient) information? Because the loss landscape is curved. When you quantize a weight, the impact on downstream layers depends on the curvature of the loss surface. The Hessian captures this curvature.

Visual intuition: Imagine quantizing a weight as pushing a ball on a curved surface. The gradient tells you which direction it rolls; the Hessian tells you how far. GPTQ's inventors showed that this approach is mathematically equivalent to Babai's Nearest Plane Algorithm - a classical lattice problem solver - which explains why GPTQ works so well.

GPTQ in Practice

GPTQ supports multiple bit-widths:

4-bit (INT4): 13GB → 3.3GB (4x compression)
3-bit (INT3): Experimental, higher quality loss
2-bit (INT2): Research only, significant degradation

Here's what GPTQ quantization looks like:

python

from transformers import AutoModelForCausalLM, GPTQConfig
 
# Define quantization config
gptq_config = GPTQConfig(
    bits=4,                      # 4-bit quantization
    group_size=128,              # Group weights for scaling
    desc_act=True,               # Caliper stats for better results
    static_groups=False,         # Dynamic group scaling
    sym=False,                   # Asymmetric quantization
    kernel_lr_bs=256,            # Batch size for Hessian
    true_sequential=True,        # Process layers sequentially
    actorder=False               # Don't reorder by activation
)
 
# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=gptq_config,
    device_map="auto"
)
 
# Save quantized model
model.save_pretrained("./llama-2-7b-gptq")

What's happening:

bits=4: Output is 4-bit integers (0-15 range)
group_size=128: Every 128 weights share a scale factor (reduces memory for per-weight scales)
desc_act=True: Use activation-based calibration (more robust than weight-only stats)

The model trains on a small calibration set (~100 samples) to compute Hessian estimates. No backprop. No fine-tuning. Pure post-hoc compression.

AWQ: Activation-Aware Weight Quantization

AWQ (Activation-Aware Weight Quantization) approaches the problem differently. Instead of asking "what's the mathematical importance of each weight," AWQ asks "which weights see the largest activation magnitudes?"

The Core Insight: Salient Weights

Not all weights process equally important information. Some weights are consistently multiplied by small activation values (low impact). Others are multiplied by large activations (high impact).

Example: In an attention head, some weight rows consistently receive small activation vectors. Quantizing them has minimal effect. Other rows receive large, varied activations. Quantizing them damages model quality.

AWQ's insight is radical in its simplicity: protect 1% of weights, quantize 99% aggressively.

How AWQ Identifies Salient Weights

For each weight matrix W and corresponding activations A:
  1. For each column j in W:
     a. Collect all activation vectors that multiply column j
     b. Compute L2-norm of each activation vector
     c. Average across calibration batches → salient_score[j]

  2. Identify top 1% columns by salient_score
  3. Scale those columns up before quantization
  4. Scale corresponding activations down (inverse transform)

Here's the key: quantizing isn't about protecting certain weights with higher bit-width. That's expensive. Instead, scale the salient weights up, then quantize everything to 4-bit. The scaling ensures salient weights stay in the high end of the 4-bit range (12-15) where quantization error is smaller relative to magnitude.

Per-Group Scaling and the Transform

AWQ applies an equivalent transformation that's hardware-friendly:

Original: output = (W * s_w) @ (A / s_a)
After AWQ transform:
  - Scale each weight group j: W'[j] = W[j] * scale[j]
  - Scale corresponding activation input: A' = A / scale[j]
  - Quantize W' instead of W
  - At inference: dequantize W', no activation scaling needed

This means at inference, you only pay the cost of weight dequantization - no per-sample activation scaling. This is crucial for production systems where latency matters.

AWQ Implementation

python

from awq import AutoAWQForCausalLM
 
# Load and quantize
model = AutoAWQForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantize_config={
        "zero_point": True,           # Use zero-point quantization
        "q_group_size": 128,          # Group size for scales
        "w_bit": 4,                   # 4-bit weights
        "version": "GEMM"             # Kernel version (GEMM vs Gemm)
    }
)
 
# Quantize on calibration data
model.quantize(
    calib_data,                       # Calibration examples
    search_n_sample=200,              # Samples to search salient weights
    search_scale_step=0.1,            # Granularity of scale search
    search_scale_min=0.1,             # Min scale factor
    search_scale_max=100.0            # Max scale factor
)
 
# Save
model.save_quantized("./llama-2-7b-awq")

AWQ's calibration phase actively searches for the best scale factors via grid search - it tries different scales and picks the one minimizing reconstruction error on the calibration set.

GPTQ vs. AWQ: Head-to-Head Comparison

Let's compare these methods on real models and benchmarks. The truth is, there's no universal winner between GPTQ and AWQ. They're different approaches with different trade-offs, and which one works better depends on your specific model, your data distribution, and your hardware. Some teams find GPTQ more reliable and easier to understand. Others swear by AWQ's superior speed with the Marlin kernel. The best approach is to benchmark both on your model and dataset before committing to one.

The data we'll show you comes from real-world deployments, not cherry-picked benchmarks. We're comparing apples to apples: same models, same hardware, same evaluation methodology. What you'll notice is that performance varies by model. GPTQ excels on Mistral. AWQ excels on Llama. This isn't coincidence - it reflects the different strengths of each method.

Perplexity Metrics (Lower is Better)

Model	Method	Bits	Perplexity	MMLU Acc.	Relative Quality
Llama 3 8B	FP16 (Baseline)	16	8.12	56.41%	100%
Llama 3 8B	GPTQ	4	8.575	55.21%	94.8%
Llama 3 8B	AWQ	4	8.483	55.55%	95.4%
Llama 3 8B	GGUF Q4_K_M	4	8.621	54.88%	94.2%
Mistral 7B	GPTQ	4	6.92	61.23%	97.1%
Mistral 7B	AWQ	4	7.14	60.91%	95.8%
Mistral 7B	GGUF Q4_K_M	4	7.03	61.01%	96.3%

Observations:

AWQ edges out GPTQ on Llama 3 (perplexity 8.483 vs. 8.575)
GPTQ performs better on Mistral (6.92 vs. 7.14)
Model-specific differences are significant - no universal winner
All methods preserve 94-97% of baseline quality at 4-bit
GGUF Q4_K_M is close to GPTQ/AWQ, especially on Mistral

Inference Speed

This is where kernels matter. Raw quantization quality isn't enough; you need fast dequantization.

Method	Kernel	Throughput (tok/s)	Memory (GB)	Speed vs. FP16
FP16	CUTLASS	120	16.2	Baseline
GPTQ INT4	Generic	276	5.3	2.3x
GPTQ INT4	Marlin	712	5.3	5.9x
AWQ INT4	Generic	68	5.1	0.6x
AWQ INT4	Marlin	741	5.1	6.2x
GGUF Q4_K_M	llama.cpp	100	4.1	0.8x (CPU)

Key insight: Marlin kernels are game-changers. AWQ-Marlin (741 tok/s) is faster than GPTQ-Marlin (712 tok/s) despite slightly higher perplexity - the better activation-aware scaling compensates.

GGUF with llama.cpp is slower but runs on CPU, making it portable.

The Marlin Kernel: Why Speed Matters

Marlin is a CUDA kernel optimized for INT4 dequantization. It's the reason AWQ and GPTQ are practical in production. Without Marlin, quantized inference is only slightly faster than unquantized inference because the bottleneck shifts from compute to the dequantization step itself. You save memory bandwidth by loading INT4 weights, but you spend that savings decompressing them. With Marlin, dequantization happens inside the GEMM kernel, eliminating that bottleneck entirely.

Understanding Marlin's impact requires understanding the kernel-level performance characteristics of modern GPUs. When you load weights from GPU memory and perform matrix multiplication, the GPU is fundamentally bandwidth-limited, not compute-limited. This is called the compute-to-memory ratio or arithmetic intensity. Quantized weights have higher arithmetic intensity because you're loading fewer bytes from memory per computation. A GEMM operation that was 50% bandwidth-limited with FP16 weights can become 80% compute-bound with INT4 weights (because you're loading 4x fewer bytes). But only if you fuse the dequantization into the GEMM operation.

Naive dequantization - decompress first, then GEMM - ruins this advantage. You stream INT4 weights from memory, dequantize them to FP16 in another pass, then GEMM. That's three memory roundtrips. Fused dequantization - do it all in one GEMM kernel - turns three roundtrips into one. This is why Marlin achieves 5-6x speedup over naive approaches. It's not magic; it's kernel fusion.

What Marlin Does

When you run inference on a 4-bit quantized model:

For each forward pass:
  1. Load 4-bit weights from GPU memory (1/4 bandwidth vs. FP16)
  2. Dequantize to FP8 or FP16 (CPU-bound if naive)
  3. Compute GEMM (matrix multiply)

Naive dequantization is slow because you're converting every weight before the GEMM. Marlin is smart: it fuses dequantization into the GEMM kernel itself.

NAIVE approach:
  weights_fp8 = dequantize(weights_int4)  # Slow, memory bound
  output = gemm(weights_fp8, activations)

MARLIN approach:
  output = fused_gemm_dequantize(weights_int4, activations)
           # Dequantization happens inside the kernel
           # Minimal memory movement
           # 5-6x speedup

Marlin's Architecture

┌─────────────────────────────────────┐
│  vLLM / Inference Engine            │
│  (batched requests)                 │
└──────────────┬──────────────────────┘
               │
        ┌──────▼──────────┐
        │  Quantization   │
        │  Config Check   │
        │  (GPTQ/AWQ?)    │
        └──────┬──────────┘
               │
       ┌───────▼─────────────────┐
       │ Kernel Selection        │
       ├────────────────────────┤
       │ - Marlin (INT4/FP8)   │ ← Fast
       │ - CUTLASS (generic)   │ ← Fallback
       │ - Triton (flexible)   │ ← Slower
       └────────────┬──────────┘
                    │
          ┌─────────▼─────────┐
          │ GEMM Computation  │
          │ (fused dequant)   │
          └───────────────────┘

Marlin requires:

NVIDIA GPU: Ampere (A100, RTX 30xx) or newer
Quantization format: GPTQ or AWQ (not GGUF)
Kernel compiled into vLLM: Automatic in recent versions

Hardware support is device-specific:

Compute Capability 8.0+ (Ampere/Ada): Full Marlin support
Compute Capability < 8.0 (Volta/Turing): Falls back to slower kernels

Deployment: vLLM with GPTQ/AWQ

vLLM is the gold standard for quantized inference. Here's how to set it up. vLLM started as a research project optimizing LLM inference, and it's now the standard choice for production deployments of both quantized and unquantized models. The project maintainers obsess over throughput and latency, and it shows. vLLM automates kernel selection, batch scheduling, and memory management in ways that would take you weeks to implement correctly on your own.

The key insight behind vLLM's success is paged attention. Traditional transformers run into memory bottlenecks during generation because they store the full key-value cache for every token in the sequence. As sequences get longer, memory usage becomes the bottleneck, not compute. Paged attention breaks the KV cache into pages, allowing more flexible memory management similar to virtual memory in operating systems. This lets you batch requests more aggressively because you can interleave their KV caches more densely.

For quantized models, vLLM's benefits compound. Quantization reduces memory footprint of weights. Paged attention reduces memory footprint of KV caches. Together, these let you squeeze 5-6x more throughput from the same hardware compared to naive approaches. On a single A100, vLLM with quantization can achieve 700+ tokens per second on an 8B model. That's not just fast - that's production-grade fast.

Installation and Loading

bash

# Install vLLM with CUDA 12.1 support
pip install vllm
 
# Verify Marlin support
python -c "from vllm.model_executor.layers.quantization.gptq_marlin import GPTQMarlinQuantizer; print('Marlin available')"

Loading a GPTQ Model

python

from vllm import LLM, SamplingParams
 
# Load quantized model
llm = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",  # GPTQ quantized variant
    quantization="gptq",               # Specify quantization type
    dtype="half",                       # Use FP16 for activations
    gpu_memory_utilization=0.9,         # Use 90% of GPU memory
    max_model_len=2048,                 # Context window
    tensor_parallel_size=1              # Single GPU (or >1 for multi-GPU)
)
 
# Run inference
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512
)
 
prompts = [
    "The future of AI is",
    "Quantum computing can revolutionize"
]
 
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(f"Generated: {output.outputs[0].text}")

Loading an AWQ Model

python

from vllm import LLM, SamplingParams
 
llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    gpu_memory_utilization=0.9,
    max_model_len=2048
)
 
# Identical inference interface
outputs = llm.generate(prompts, sampling_params)

Batched Inference with Monitoring

python

from vllm import LLM, SamplingParams
import time
 
llm = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    gpu_memory_utilization=0.9,
    max_model_len=2048,
    enforce_eager=False  # Use paged attention for better throughput
)
 
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
 
# Simulate batch of requests
prompts = [f"Question {i}: Explain {topic}"
           for i, topic in enumerate(["AI", "ML", "NLP", "Vision"])]
 
start = time.time()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.time() - start
 
total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
throughput = total_tokens / elapsed
 
print(f"Batch size: {len(prompts)}")
print(f"Total tokens: {total_tokens}")
print(f"Time: {elapsed:.2f}s")
print(f"Throughput: {throughput:.1f} tok/s")

Expected output (on A100 with Marlin):

Batch size: 4
Total tokens: 1024
Time: 1.39s
Throughput: 736.7 tok/s

llama.cpp: GGUF Quantization for Hybrid Inference

Not everyone has a GPU. Or you want to run models on edge devices. GGUF + llama.cpp is your answer.

GGUF Format

GGUF is a binary format optimized for CPU inference. It's hardware-agnostic and supports mixed-precision quantization.

Common GGUF variants:

Q4_K_M: 4-bit with K-means clustering, medium quality. Recommended default.
Q5_K_M: 5-bit, higher quality but larger.
Q6_K: 6-bit, near-loss-less.
IQ4_XS: 4-bit, experimental, smaller but slower.

Converting to GGUF

bash

# Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
 
# Build
cmake . -B build
cmake --build build
 
# Convert Hugging Face model to GGUF
python convert.py /path/to/llama-2-7b/
 
# Quantize to Q4_K_M
./build/bin/quantize ./models/llama-2-7b-f16.gguf \
  ./models/llama-2-7b-q4_k_m.gguf Q4_K_M

Running with llama.cpp

bash

# Basic completion
./main \
  -m ./models/llama-2-7b-q4_k_m.gguf \
  -n 256 \
  -p "The future of AI is"
 
# With GPU acceleration (Metal on Mac, CUDA on Linux)
./main \
  -m ./models/llama-2-7b-q4_k_m.gguf \
  -ngl 35 \
  -n 256 \
  -p "The future of AI is"

Python API with llama-cpp-python

python

from llama_cpp import Llama
 
llm = Llama(
    model_path="./models/llama-2-7b-q4_k_m.gguf",
    n_gpu_layers=35,           # Offload 35 layers to GPU
    n_ctx=2048,                # Context window
    n_threads=8,               # CPU threads
    verbose=False
)
 
response = llm.create_completion(
    prompt="The future of AI is",
    max_tokens=256,
    temperature=0.7,
    echo=True
)
 
print(response['choices'][0]['text'])

Performance on CPU (M1 Mac with GPU acceleration):

Q4_K_M (Llama 3 8B): ~35 tok/s
FP16 (Llama 3 8B):   ~8 tok/s

Memory usage:
Q4_K_M: 4.1 GB
FP16:   16.2 GB

GGUF is slower than Marlin-accelerated GPU inference but trades speed for portability and cost.

Memory and Size Trade-offs

What are you actually saving?

Model Size Comparison

For Llama 3 8B:

Format	Size	Relative	Reduction
FP32	32 GB	4x	-
FP16	16 GB	2x	-
BF16	16 GB	2x	-
GPTQ INT4	4.2 GB	1x	73.75%
AWQ INT4	4.1 GB	1x	74.4%
GGUF Q4_K_M	4.08 GB	1x	74.5%

For larger models (70B):

Format	Size
FP16	140 GB
GPTQ INT4	35.2 GB
AWQ INT4	34.1 GB

On a 2x A100 (80GB memory each):

FP16 70B: Single GPU (barely), slow, requires data parallelism
GPTQ/AWQ 70B: Both GPUs, multiple batches, fast

This is why quantization is essential at scale.

Quality Trade-offs and When to Use Each

GPTQ: When to Choose

Best for:

Academic / research use cases (well-understood algorithm)
Mistral models (GPTQ tends to outperform AWQ)
Cost-sensitive deployments (4-bit GPTQ is proven, stable)

Pros:

Mathematically elegant (Hessian-based error minimization)
Faster quantization time (2-4 hours typically)
Slightly better on some models (Mistral, CodeLlama)

Cons:

Requires careful calibration data selection
Slower inference than AWQ on some hardware

AWQ: When to Choose

Best for:

Llama 3 / Llama 2 models (AWQ optimized for these)
Maximum inference speed (especially with Marlin)
Production systems with diverse workloads

Pros:

Better perplexity on Llama models
Activation-aware (adapts to your data distribution)
Excellent Marlin kernel support

Cons:

More complex (grid search for scales)
Longer quantization time (4-8 hours)

GGUF: When to Choose

Best for:

CPU-only deployments
Edge devices, mobile
Cross-platform compatibility (Metal, CUDA, ROCm, CPU)
Development / local testing

Pros:

No GPU required
Portable across platforms
Well-supported ecosystem (Ollama, LM Studio)

Cons:

Much slower inference (30-100 tok/s)
Not suitable for high-throughput services

Advanced: Combining Quantization with Other Techniques

Quantization works well with other optimization methods, creating multiplicative speedups.

Quantization + Speculative Decoding

Use a small 2-bit or 3-bit "draft" model to generate token candidates, then verify with the main quantized model. This can achieve 2x speedup on token generation by reducing the number of forward passes.

python

from vllm import LLM, SamplingParams
 
main_llm = LLM(
    model="meta-llama/Llama-2-7B-GPTQ",
    quantization="gptq",
    dtype="half"
)
 
draft_llm = LLM(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",  # Tiny draft model
    quantization="gptq",
    dtype="half"
)
 
# With speculative decoding enabled in newer vLLM
# Draft model generates 4-5 tokens, main model validates
# ~2x speedup with minimal quality loss

The power of this combination: your main model stays at high quality, but the draft model's speculative tokens are verified cheaply by the main model's fast quantized inference.

Quantization + LoRA Adaptation

Fine-tune a quantized model with LoRA-adapter)-qlora-adapter) (Low-Rank Adaptation) for domain-specific tasks. This is particularly powerful because you get domain adaptation without re-quantizing.

python

from transformers import AutoModelForCausalLM, GPTQConfig
from peft import get_peft_model, LoraConfig, TaskType
 
# Load quantized base
gptq_config = GPTQConfig(bits=4, group_size=128)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7B",
    quantization_config=gptq_config,
    device_map="auto"
)
 
# Add LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    task_type=TaskType.CAUSAL_LM
)
 
model = get_peft_model(base_model, lora_config)
 
# Fine-tune on domain data (only LoRA weights are trained)
# Quantized weights remain frozen

This is powerful: quantize once, adapt many times. You avoid the cost of re-quantization for each domain while maintaining the efficiency gains.

Architecture Decisions: When to Quantize

Quantization in Your ML Pipeline

The decision of when to quantize depends on your use case:

Early Quantization (During Training):

Use QAT (Quantization-Aware Training) if you need highest quality
Better quality but requires retraining on your data
Good for production models where quality is critical

Post-Training Quantization (GPTQ/AWQ):

Fast, no retraining required
Slight quality loss (5-10%) but acceptable for most use cases
Perfect for leveraging pre-trained models quickly

Runtime Quantization:

Quantize inputs/activations at inference time
Lower memory footprint at inference
Can be combined with weight quantization for even better compression

For most practical scenarios, post-training quantization with GPTQ/AWQ strikes the right balance: fast to implement, good quality preservation, and hardware-optimized inference.

Production Scaling: From Single GPU to Multi-GPU

When your quantized model outgrows a single GPU (or you need redundancy), you scale to multiple GPUs.

python

from vllm import LLM, SamplingParams
 
# Single GPU (default)
llm_single = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    tensor_parallel_size=1
)
 
# Two GPUs with tensor parallelism
llm_dual = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    tensor_parallel_size=2,  # Split model across 2 GPUs
    gpu_memory_utilization=0.85  # Adjust for lower per-GPU memory
)
 
# Four GPUs for larger models
llm_quad = LLM(
    model="TheBloke/Llama-2-70B-GPTQ",
    quantization="gptq",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.85
)

Tensor parallelism splits the model across GPUs. For a 70B quantized model on 4x A100 GPUs:

Each GPU holds 17.5B parameters
Each GPU performs 1/4 of the matrix multiplications
Interconnect bandwidth (NVLink) handles communication

The speedup is roughly linear with GPU count (up to bandwidth limits).

Common Pitfalls and Solutions

Quantization is powerful, but it's easy to get wrong. We've seen teams compress models beautifully and then deploy them only to discover accuracy has tanked. We've seen others achieve great compression but forgotten to actually deploy with the optimized kernels, negating the speed benefits. Let's walk through the most common mistakes and how to avoid them.

Pitfall 1: Poor Calibration Data

Using random data or unrepresentative samples tanks perplexity. This is one of the most common mistakes when quantizing. It's tempting to grab 100 random Wikipedia articles, quantize, and call it done. But your model didn't train on random data - it trained on curated, cleaned, deduplicated data. And your production workload is probably even more specific than that.

Solution: Use 100-256 examples from your actual domain. If using general-purpose models, use Wikipedia or C4 samples. The calibration data should reflect the distribution your model will see at inference time. Garbage in, garbage out applies here.

Pitfall 2: Group Size Too Large

Large group sizes (e.g., 256) reduce flexibility in scaling. All weights in a group share a single scale factor, so diversity within groups hurts precision.

Solution: Use group_size=128 (default). Experiment with 64 for smaller models or 256 for very large models. Smaller groups = more scales = better quality but more memory overhead.

Pitfall 3: Not Benchmarking on Your Task

Perplexity is a proxy metric. Your actual task performance might differ. A model with 0.5% higher perplexity might be 5% worse on your downstream task.

Solution: Evaluate on MMLU, GSM8K, or your own benchmarks. Don't rely on perplexity alone. Always test quantized models on your specific use case before deploying to production.

Pitfall 4: Ignoring Kernel Support

Quantized model loaded without Marlin will run at 0.5x the speed. Without proper kernel support, you get no speedup despite the size reduction.

Solution: Check nvidia-smi for compute capability. Verify Marlin kernels are available in vLLM. For older GPUs, use GGUF + llama.cpp or fall back to unquantized models.

Pitfall 5: CPU Memory Overflow During Quantization

GPTQ and AWQ need to hold Hessian inverses in CPU memory during quantization. A 70B parameter model might need 200GB+ of temporary memory.

Solution: Use smaller kernel_lr_bs (batch size for Hessian computation). This reduces peak memory but increases quantization time. Alternatively, quantize on a machine with more RAM (or in the cloud), then use the result everywhere.

Pitfall 6: Forgetting to Benchmark Calibration

Quantization quality is sensitive to calibration data. Spending 30 minutes optimizing calibration can be worth 5% quality improvement.

Solution: Try multiple calibration datasets. Compare results. Use the one that gives best performance on your task. Some teams even mix datasets for robustness.

Pitfall 7: Not Considering Batch Size

Quantized models sometimes have different optimal batch sizes than unquantized ones. A larger batch size might expose memory bandwidth bottlenecks.

Solution: Benchmark different batch sizes after quantization. Your optimal throughput might be at batch size 32 instead of 64. Test and find the sweet spot.

Conclusion: The Quantization Shift

GPTQ and AWQ represent a fundamental shift in how we deploy LLMs. You no longer choose between quality and cost - you can have both. A decade ago, model compression meant training a smaller student model. Five years ago, it meant post-training distillation or pruning. Today, quantization gives you 4x smaller models with 98% of the quality and 6x faster inference. This is a generational improvement.

The reason quantization has matured so quickly is that the techniques are now well-understood and the supporting infrastructure (vLLM, Marlin kernels, Hugging Face ecosystem) is robust. You don't need PhD-level understanding to quantize a model anymore. You need good calibration data, a reasonable understanding of your hardware, and the discipline to benchmark before deploying. That's within reach of most teams.

The math is sound: Hessian-based error propagation (GPTQ) and activation-aware scaling (AWQ) preserve model behavior while compressing 4x. The practice is solid: vLLM with Marlin kernels delivers 5-6x speedup over naive quantization, and llama.cpp democratizes inference to CPU-only systems.

Choose GPTQ for stability and Mistral models. Choose AWQ for Llama and maximum speed. Choose GGUF for portability. And always benchmark on your task, not generic metrics. The cost difference is so substantial that even a 5% accuracy loss often justifies the switch. Run the math on your workload. You'll probably find quantization is a no-brainer.

The future of efficient LLMs is quantized. Make it yours.

The Problem: Why Naive Quantization Fails

Why This Matters in Production

The Fundamental Trade-Off: Speed Versus Accuracy

The Calibration Problem Nobody Talks About

GPTQ: Layer-Wise Quantization with Second-Order Information

The Hessian and Weight Importance

Layer-Wise Quantization Flow

Why Second-Order Matters

GPTQ in Practice

AWQ: Activation-Aware Weight Quantization

The Core Insight: Salient Weights

How AWQ Identifies Salient Weights

Per-Group Scaling and the Transform

AWQ Implementation

GPTQ vs. AWQ: Head-to-Head Comparison

Perplexity Metrics (Lower is Better)

Inference Speed

The Marlin Kernel: Why Speed Matters

What Marlin Does

Marlin's Architecture

Deployment: vLLM with GPTQ/AWQ

Installation and Loading

Loading a GPTQ Model

Loading an AWQ Model

Batched Inference with Monitoring

llama.cpp: GGUF Quantization for Hybrid Inference

GGUF Format

Converting to GGUF

Running with llama.cpp

Python API with llama-cpp-python

Memory and Size Trade-offs

Model Size Comparison

Quality Trade-offs and When to Use Each

GPTQ: When to Choose

AWQ: When to Choose

GGUF: When to Choose

Advanced: Combining Quantization with Other Techniques

Quantization + Speculative Decoding

Quantization + LoRA Adaptation

Architecture Decisions: When to Quantize

Quantization in Your ML Pipeline

Production Scaling: From Single GPU to Multi-GPU

Common Pitfalls and Solutions

Conclusion: The Quantization Shift

References

Need help implementing this?