If you’re deploying Large Language Models (LLMs) in production, your GPU decision is the single biggest driver of throughput, cost, and the maximum model size you can support.
This guide breaks down NVIDIA’s three most-deployed data-center GPUs—T4, A10, and A100—with deep dives into their architecture, supported quantization levels (FP32, FP16, FP8, INT8, INT4), real-world token-per-second numbers, and cloud cost efficiency.
Overview: Why GPU Choice Matters
Modern LLMs—from Mistral to Llama-3 to Qwen-2.5—demand huge compute and memory. Picking the right GPU means:
- Higher throughput (more requests/sec)
- Lower latency
- Lower cost-per-token
- Support for larger or more models per node
- Flexibility for quantized (FP8, INT8, INT4) or full-precision (FP16, FP32) deployments
Whether you’re building a chatbot, a multi-modal assistant, or running document Q&A, everything comes down to two numbers: tokens-per-second (TPS) and dollars-per-month. Choosing the right GPU for your LLM workload ensures you get the best possible performance and lowest cost—while supporting the models (and model sizes) your users need.
But there’s more: modern inference engines like vLLM, TensorRT-LLM, and Ollama each have unique strengths—and limitations—when it comes to batching, quantization support, and tool-calling.
Let’s break down the key technical points, then give you a foolproof way to plan your own deployment.
GPU Architecture & Specs
GPU | Architecture | Launch | Tensor FP32 | FP16 TFLOPS | FP8/INT8/INT4 Supported | VRAM | Bandwidth |
---|---|---|---|---|---|---|---|
T4 | Turing | 2018 | 8.1 TF | 65 TF | INT8/INT4 only | 16GB | 320 GB/s |
A10 | Ampere | 2021 | 31 TF | 125 TF | FP8/INT8/INT4 | 24GB | 600 GB/s |
A100 40G | Ampere | 2020 | 156 TF | 312 TF | FP8/INT8/INT4 | 40GB | 1.6 TB/s |
Note: FP8 is supported natively only on A10 and A100 (Ampere+), not on T4.
Not all GPUs are created equal. Here’s what really affects LLM deployment:
- FP16/FP8/INT8/INT4 support: Lower-precision means smaller models and higher speed—if your card (and engine) support it.
- GPU RAM: Larger models (more parameters) and longer conversation windows need more memory. Quantization lets you squeeze in more.
- Tensor throughput: This is the real-world “muscle” for fast text generation, especially with large batches.
The NVIDIA Lineup:
- T4: Entry-level, 16 GB RAM, older Turing architecture, solid for small/medium models, excels at INT8, budget-friendly.
- A10: Modern Ampere, 24 GB RAM, up to 3× T4 speed, strong support for FP16/FP8/INT4, great value for most production cases.
- A100: Premium Ampere, 40–80 GB HBM2, huge bandwidth and tensor throughput, best for massive models and highest concurrency.
What Models (and Quantization Levels) Fit on Which GPU?
One of the most common mistakes is trying to deploy a model that simply doesn’t fit in GPU memory, or that leaves no space for prompt cache/KV cache.
Model Fitting Formula: The Calculator You Need
To estimate whether a model will fit in your GPU:
Model Memory (GB) = (Num Parameters × Precision (bytes)) ÷ (1024³)
Where:
- Num Parameters = e.g., 7B = 7 × 10⁹
- Precision (bytes): FP32 = 4, FP16/FP8/INT8 = 2/1/1, INT4 = 0.5
- Add at least 10–20% headroom for KV cache, CUDA, runtime buffers.
Example:
- Llama-2 7B at FP16:
= (7e9 × 2) ÷ (1024³) ≈ 13 GB - Llama-2 70B at INT4:
= (70e9 × 0.5) ÷ (1024³) ≈ 32.6 GB (needs sharding on A10 or A100)
Rule:
- If Model Memory + Overhead ≤ GPU VRAM, it fits!
- Otherwise, use multiple GPUs (sharding) or further quantize.
How Quantization (FP16, FP8, INT8, INT4) Changes the Game
Quantization reduces model size and increases throughput—often with negligible accuracy loss. Here’s how each works:
- FP16: Full-precision for weights; good quality, but limits model size to GPU VRAM.
- FP8/INT8: Half or quarter size vs FP16; minimal loss, supported best on A10/A100.
- INT4: Shrinks models to ¼ the size of FP16; enables running 70B+ models on mid-tier GPUs with modern inference engines.
Caution: Not all engines and models support every quantization—check your stack’s documentation!
Throughput Benchmarks: FP16, FP8, INT8, INT4
Model (quantization) | T4 (tok/s) | A10 (tok/s) | A100 (tok/s) |
---|---|---|---|
Llama-3 8B FP16 | 20 | 42 | 130 |
Mistral 7B FP16 | 12 | 30 | 95 |
Qwen-2.5 72B FP16 (8-way) | – | 55* | 120* |
Qwen-2.5 72B INT4 (2-way) | – | 38* | 100–120* |
Llama-3 70B FP16 (A100) | – | – | 45–80 |
Llama-3 70B INT4 | – | ~20 (with 2 A10) | 70–100 |
*“Multi-way” = model is split across 2–8 GPUs, typical for >30B models.
Observations:
- A10 is about 3× faster than T4, and delivers about ⅓ of A100’s raw throughput.
- Quantization (INT8, INT4, FP8) often increases throughput by 10–40% and lets you fit much larger models on each card.
- INT4/FP8 kernels run best on Ampere GPUs (A10, A100), while T4 is better at INT8 than INT4 due to kernel support.
- A10 delivers ~3× the TPS of T4, often for only 1.5–2× the price.
- A100 is the “supercar”: fastest, but far more expensive—only worth it for the highest loads or largest models.
Cost-Per-Token: The True Measure of GPU Value
It’s not just about speed, but also about dollars per million tokens generated.
Example (A10, INT4, 2 GPUs, 24×7 use):
- 38 tokens/sec × 86,400 sec/day = ~3.28M tokens/day
- $6.52/hour × 24 = $156/day
- Cost per million tokens: $156 / 3.28 = $0.48 / M tokens
Higher throughput = lower cost per token, if you keep GPUs busy!
Cloud Pricing & Cost-Per-Token
VM SKU | GPU | Hourly Rate | 30-Day Cost (24/7) |
---|---|---|---|
NV18ads | 1 × A10 | $1.60 | $1,152 |
NV72ads | 2 × A10 | $6.52 | $4,694 |
n1-standard-4 + T4 (GCP) | 1 × T4 | $0.55 | $396 |
ND40rs v2 | 1 × A100 | ~$12 | $8,640 |
Cost per 1 Million tokens (typical LLM serving)
Cluster | TPS | Daily $ | Tokens/day | $/M tok |
---|---|---|---|---|
2 × A10, INT4 | 38 | $156 | 98M | $1.59 |
1 × A100, INT4 | 120 | $288 | 300M | $0.96 |
1 × T4, INT8 | 13 | $13.20 | 34M | $0.38 |
Higher throughput = lower $/token IF you keep the GPU fully loaded.
Inference Stack Choices: vLLM vs TensorRT-LLM vs Ollama
What does your inference engine enable?
- vLLM: Great batching, paged attention, easy multi-GPU with Ray, simple to use, supports FP16/INT8/INT4. Best for flexible, rapid deployments.
- TensorRT-LLM: Highest raw TPS, especially for INT4/FP8. Complex initial setup (model compilation), full OpenAI-compatible tool calling (multi-call).
- Ollama: User-friendly, desktop/edge deployment, supports INT4/INT8 for small models, not suited for big concurrent server workloads.
Practical Planning: Step-By-Step
- Pick your model: Find out parameter count (e.g., 7B, 70B).
- Choose quantization: INT4 if you want to run big models; FP16 for max accuracy.
- Calculate model memory: Use the formula above.
- Add 20% for overhead: For attention cache, CUDA, etc.
- Compare to GPU VRAM: If not enough, use more GPUs (sharding) or smaller models/stronger quantization.
- Match to your inference stack: vLLM for fast deployment, TensorRT-LLM for highest TPS, Ollama for small/edge jobs.
Stack | Batching | Quant Support | Multi-GPU/Node | Tool Calling |
---|---|---|---|---|
vLLM | Yes | FP16/INT8/INT4 | Yes (Ray) | Single-call, guided |
TensorRT-LLM + Triton | Yes | FP8/INT8/INT4 | Yes | Multi-call, strict OpenAI |
Ollama | Limited | INT4/INT8 | No | Queue only |
Alternative inference stacks (no Ray)
Stack | GPU parallel back-end | K8s integration | When to choose |
---|---|---|---|
DeepSpeed-Inference | Megatron tensor- & pipeline-parallel; NCCL/MP | Helm chart / raw Deployment | Mixed FP8/INT8 kernels; huge text models (deepspeed.ai) |
TensorRT-LLM + Triton | Custom CUDA kernels; gRPC all-reduce daemons | Triton charts, KServe | Max TPS; needs model → TRT conversion (aws.amazon.com) |
Hugging Face TGI | Accelerate tensor-parallel (MPI/NCCL) | Official Helm chart | Smaller 7 – 70 B models; auto batching (huggingface.co) |
Horovod serving | MPI/NCCL; user-written loop | Any K8s | DIY; good if org already uses Horovod (github.com) |
KServe + Triton | Triton multi-instance | Native KServe | Good for multi-model fleets (alibabacloud.com) |
How do vLLM and TensorRT-LLM differ in their approach to tool and function calling?
The short answer is yes, both vLLM and TensorRT-LLM support the workflows required for tool and function calling. However, they do so with different philosophies and technical implementations, which makes one better suited for certain scenarios over the other.
Recent vLLM releases do expose the same “tool / function-calling” schema that OpenAI and TensorRT-LLM’s OpenAI-frontend use, but there are a few feature-gaps and behavioural differences you should know before switching stacks. vLLM implements tool calling through guided decoding (via the Outlines library) and accepts the tools
, tool_choice
, and tool_calls
fields in the /v1/chat/completions
endpoint just as GPT-4o or Triton do docs.vllm.aidocs.vllm.ai.
In practice it “passes through” any valid OpenAI-style request, guarantees syntactically correct JSON for the function name + arguments, and streams the response. What it does not yet do is (a) support every tool_choice
mode (auto
/none
are fine, required
has corner-cases), (b) inject the tool schema into the prompt for you, or (c) post-process multiple tool calls in one turn—those still sit on vLLM’s roadmap.
How vLLM’s tool calling works
Endpoint & request format
- Start your server exactly the same way you would for normal chat:
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-8b-Instruct
- POST a request that contains a
tools
array (OpenAI JSON schema) and, optionally, atool_choice
docs.vllm.aidocs.vllm.ai. vLLM runs guided decoding so the next assistant message will be a JSON blob with.
{
"role":"assistant",
"tool_calls":[
{"id":"call_01","name":"getWeather","arguments":{"city":"Doha"}}
]
}
Guarantees & current limits
- Structural validity—Outlines ensures every return value parses as JSON docs.vllm.ai.
- One-shot only—stream stops after first tool call; chaining tools requires orchestration in your app reddit.com.
required
edge-cases—if you settool_choice={"type":"function","function":{"name":"..."}}
you may still get a second call; GitHub issue #9991 tracks this github.com.- Strict schemas—libraries that validate the OpenAI spec (e.g., Vercel AI SDK) need “
type":"function"
” in each chunk; patch is pending github.com.
Comparison with TensorRT-LLM’s OpenAI frontend
Feature | vLLM 0.4+ | TensorRT-LLM 24.04 |
---|---|---|
Accepts tools , tool_choice , returns tool_calls | ✔ (single call) (docs.vllm.ai) | ✔ (multi-call supported) (docs.nvidia.com) |
Autogenerates function JSON if model ignores schema | ✖ (caller must prompt) (github.com) | ✔ (guided decoding with fallback) (docs.nvidia.com) |
Multiple tools in one turn | WIP [#1869] (github.com) | ✔ |
Streaming chunks pass strict OpenAI validators | type field patch pending [#16340] (github.com) | ✔ |
Backend | Python, PagedAttention; Ray optional | C++ Triton backend; TensorRT engines |
Typical decode TPS (A10 dual-GPU) | 38 tok/s (4-bit Qwen-72 B) | 48-55 tok/s same engine (github.com) |
When to keep vLLM vs. move to TensorRT-LLM
- Keep vLLM if you need rapid prototyping, frequent model swaps, or rely on Ray-native multi-model serving. Tool calling already covers the majority of LangChain/LCEL agent use-cases; you just add a prompt template that embeds your tool schema.
- Switch to TensorRT-LLM if you want:
- Max throughput / lowest latency on fixed models;
- Multiple tool calls per turn orchestrated inside the server;
- Guaranteed compliance with strict OpenAI-SDK validators out-of-the-box.
Both stacks expose the same /v1/chat/completions
shape once configured, so client code remains unchanged.
Bottom line
Tool/function calling is primarily an LLM behavioural feature, but it only works reliably when the inference engine cooperates by constraining decoding and parsing calls. vLLM and TensorRT-LLM both offer that support; TensorRT-LLM is faster and covers multi-call, while vLLM is lighter to set up and easier to tweak. Choose the engine that best matches your latency budget and ops skill-set; your LangChain (or any OpenAI-compatible) client code will keep working either way.
Recommendations: Tabular & Narrative
Scenario | Best GPU / Stack |
---|---|
Entry-level, 7B LLM, lowest cost | T4, INT8, Ollama |
Modern chatbot, 7–17B, 70B INT4 | A10, vLLM or TensorRT |
70B+ FP16 or >100 TPS | A100, TensorRT-LLM |
High concurrency, multi-tool agents | A10/A100, TensorRT-LLM |
Fine-tuning, frequent swaps | A10, vLLM |
Narrative summary:
- T4 is a budget choice, great for development or light inference, but not for large models or high throughput.
- A10 balances speed, memory, and price—making it the most pragmatic choice for most LLM deployments in 2024.
- A100 is overkill unless you’re pushing the largest models or extreme concurrency, but can deliver the lowest $/token if fully loaded.
- Quantization (especially INT4) lets you fit much larger models on any card—just don’t forget to check if your inference engine and use-case support it.
Footnotes:
Additional Reading
- Mistral OCR 2503: A Game-Changer in Unstructured Data Extraction
- Logistic Regression for Machine Learning
- Cost Function in Logistic Regression
- Maximum Likelihood Estimation (MLE) for Machine Learning
- ETL vs ELT: Choosing the Right Data Integration
- What is ELT & How Does It Work?
- What is ETL & How Does It Work?
- Data Integration for Businesses: Tools, Platform, and Technique
- What is Master Data Management?
- Check DeepSeek-R1 AI reasoning Papaer
OK, that’s it, we are done now. If you have any questions or suggestions, please feel free to comment. I’ll come up with more topics on Machine Learning and Data Engineering soon. Please also comment and subscribe if you like my work, any suggestions are welcome and appreciated.