AI Metrics Decoded: From Parameters to TOPS

AI Metrics Decoded: The Numbers That Actually Matter in Production Why You Need to Know This (Before Your First Production Incident) Picture this: your team picks a 70B parameter model for a new feature. It runs great on your MacBook. You push to production. The GPU bill arrives. Your manager is not happy. Or this: your AI API costs explode halfway through the month and nobody knows why. These are not horror stories. They happen to real engineers β usually the ones who skipped learning the core units of measurement behind AI systems. As a junior engineer, you're going to face questions like: "Can our GPU handle this model?" "Why is the response so slow?" "How many tokens are we burning per user per day?" "Should we use a 7B or 70B model for this use case?" Understanding the seven core metrics below gives you the language β and the instincts β to answer confidently. Let's break them down. π§ Category 1: Model Size β Parameters & Tokens Parameters What it is: The learned weights inside a neural network. Think of them as the "memory" of the model β numbers that get adjusted during training to capture patterns in data. The unit: Just a raw count. We usually express it in: M = millions (e.g., BERT = 110M) B = billions (e.g., LLaMA 3 8B, GPT-4 ~1.8T estimated) Why it matters to you: Parameter Count Approx. VRAM Needed (fp16) Typical Use Case 1Bβ3B ~4β6 GB Mobile / edge apps 7Bβ8B ~16 GB Single consumer GPU 13Bβ14B ~28 GB Single pro GPU (A100 40GB) 70B ~140 GB Multi-GPU setup 405B+ ~800 GB+ Cluster of H100s Rule of thumb: 1 billion parameters β 2 GB of VRAM in half-precision (fp16). Double it for full precision (fp32). More parameters = more capable model and more expensive to run. Always. Tokens What it is: The unit of text that a model reads and generates. Not words β fragments. Quick visual: Input text: "Learning AI is fun!" β Tokenizer Tokens: ["Learn"] ["ing"] [" AI"] [" is"] [" fun"] ["!"] Token count: 6 tokens Why it matters to you: API cost is billed per token (input + output separately). Context window is measured in tokens β the model can only "see" so much at once. Speed (TPS, covered below) is measured in tokens per second. # Quick check: how many tokens is your prompt? # Using tiktoken (OpenAI's tokenizer, also used by many OSS models) import tiktoken enc = tiktoken.get_encoding("cl100k_base") text = "Learning AI is fun!" tokens = enc.encode(text) print(f"Token count: {len(tokens)}") # β 6 print(f"Tokens: {tokens}") # β [71668, 287, 15592, 374, 2523, 0] Quick cheat sheet: 1 token β 0.75 English words 1,000 tokens β 750 words β ~1.5 pages Non-English text (Hindi, Mandarin, Arabic) uses 30β70% more tokens for the same content β‘ Category 2: Hardware Power β FLOPS vs. TOPS This is where a lot of junior engineers get confused. FLOPS and TOPS sound similar. They are not the same thing. FLOPS (Floating Point Operations Per Second) What it is: A measure of raw compute power for floating point arithmetic β the kind of math needed for training and running neural networks. The scale: Unit Value Context GFLOPS 10βΉ FLOPS Your laptop GPU TFLOPS 10ΒΉΒ² FLOPS Cloud GPUs (A100: ~312 TFLOPS) PFLOPS 10ΒΉβ΅ FLOPS Entire GPU clusters Used for: Server-scale training and inference. When someone says "the H100 delivers 989 TFLOPS of FP16 performance", this is what they mean. Common GPUs you'll actually use: GPU FP16 TFLOPS Best For RTX 4090 ~165 Local dev / fine-tuning A100 40GB ~312 Production inference H100 SXM ~989 Large-scale training TOPS (Tera Operations Per Second) What it is: Similar idea, but used for integer or mixed-precision operations on edge hardware and NPUs (Neural Processing Units). The key difference: FLOPS β Floating point math β GPUs / server chips β Training & inference at scale TOPS β Integer / INT8 math β NPUs / edge chips β On-device inference Real-world examples: Device TOPS Use Case Apple M4 Neural Engine ~38 TOPS On-device ML on MacBook Qualcomm Snapdragon X Elite ~45 TOPS AI PCs / laptops NVIDIA Jetson Orin ~275 TOPS Edge AI / robotics Google TPU v5e ~393 TOPS Cloud inference at scale When do you care about TOPS? When you're deploying a model to a phone, a laptop, or an embedded device β not a data centre. If you're picking a chip for on-device inference, TOPS is your number. ποΈ Category 3: Training Cost β FLOPs (Cumulative) Yes, confusingly, FLOPs (with a capital F, no "per second") is a different metric from FLOPS. What it is: The total number of floating point operations performed during an entire training run. It's a measure of compute budget, not hardware speed. The unit: Usually expressed as: PetaFLOPs (10ΒΉβ΅ operations) Or PetaFLOP/s-days β how many days at a given FLOPS rate the training took Real-world examples: Model Estimated Training FLOPs GPT-3 (175B) ~3.14 Γ 10Β²Β³ LLaMA 2 70B ~2.9 Γ 10Β²Β³ Gemini Ultra ~5 Γ 10Β²β΄ (estimated) Why it matters to you: Directly as a junior engineer, probably not yet. But understanding it helps you reason about: Why training a model from scratch is prohibitively expensive Why fine-tuning (starting from a pre-trained model) is so much cheaper Why companies like Anthropic and OpenAI have massive infrastructure teams Quick analogy: FLOPS (the hardware rate) is your car's horsepower. FLOPs (training cost) is the total miles driven on a road trip. One is speed, one is distance. π Category 4: Speed & Latency β TTFT, TPS, TPM These three are the metrics you'll track the most in production. They live in your dashboards, your SLAs, and your post-mortems. TTFT β Time To First Token What it is: How long (in milliseconds) from sending your request to receiving the first token of the response. Why it matters: This is what determines if your app feels fast. Even if the full response takes 10 seconds, a 200ms TTFT makes the experience feel responsive. It's the AI equivalent of "First Contentful Paint" in web dev. User sends prompt β [ ... processing ... ] β this duration is TTFT β First token arrives β streaming begins β user sees output Good TTFT benchmarks: Scenario Target TTFT Real-time chat < 300ms Interactive coding assistant < 500ms Background document processing < 2,000ms (acceptable) TPS β Tokens Per Second What it is: How many tokens the model generates per second during the response. Also called generation speed or throughput. Why it matters: TPS determines whether your streaming response feels smooth or painfully slow. A human reads at roughly 3β5 tokens per second comfortably. Models generating at < 10 TPS feel sluggish. Modern API servers target 50β150+ TPS for good UX. What affects TPS: Model size (bigger = slower per request) Hardware (H100 >> A100 >> consumer GPU) Batch size (serving multiple requests simultaneously reduces per-request TPS) Quantization (INT4/INT8 models run faster, with a small accuracy tradeoff) TPM β Tokens Per Minute What it is: Your rate limit from the API provider. The maximum number of tokens your account can process per minute. Why it matters: Hit your TPM limit and your requests start getting throttled or rejected with 429 Too Many Requests. This is a very common production issue for junior engineers on their first real deployment. # A common mistake: not accounting for TPM in batch jobs prompts = load_10000_prompts() # Each ~500 tokens for prompt in prompts: response = call_llm_api(prompt) # π¨ You'll hit TPM limit fast process(response) # Better approach: add rate limiting import time TPM_LIMIT = 40000 # tokens per minute (check your plan) tokens_this_minute = 0 minute_start = time.time() for prompt in prompts: estimated_tokens = len(prompt.split()) * 1.3 # rough estimate if tokens_this_minute + estimated_tokens > TPM_LIMIT: sleep_time = 60 - (time.time() - minute_start) if sleep_time > 0: time.sleep(sleep_time) tokens_this_minute = 0 minute_start = time.time() response = call_llm_api(prompt) tokens_this_minute += estimated_tokens process(response) π§ Senior Engineer's Note: How It All Connects Let me show you a real decision you'll face: "Should we use an 8B or 70B model?" Here's how the metrics interact: 8B Model 70B Model βββββββββββββββββββββββββββββββββββββββββββββββββ Parameters 8 billion 70 billion VRAM Required ~16 GB ~140 GB GPU Setup 1Γ A100 40GB 4Γ A100 40GB Est. TPS ~80β120 TPS ~15β30 TPS TTFT (A100) ~150ms ~400ms API Cost (est.) ~$0.15/M tokens ~$0.90/M tokens Quality Good Excellent βββββββββββββββββββββββββββββββββββββββββββββββββ The real-world math: Say your app handles 1,000 users/day, each generating ~2,000 tokens per session. Daily tokens = 1,000 users Γ 2,000 tokens = 2,000,000 tokens 8B model cost: 2M Γ $0.00015 = $0.30/day β $9/month 70B model cost: 2M Γ $0.00090 = $1.80/day β $54/month That's a 6Γ cost difference. For a startup, that matters. The senior engineer's question isn't "which model is better?" It's *"which model is good enough for this use case at this scale?"* Start with the smaller model. Benchmark it against your quality requirements. Scale up only if you have to. Quick Reference Cheat Sheet Metric Full Name Measures Typical Unit Parameters β Model size / capacity M, B, T Tokens β Text unit for I/O and cost count FLOPS Floating Point Ops/sec Hardware speed (server) TFLOPS TOPS Tera Operations/sec Hardware speed (edge/NPU) TOPS FLOPs Floating Point Ops (total) Training compute cost PetaFLOPs TTFT Time To First Token Latency / responsiveness milliseconds TPS Tokens Per Second Generation speed tokens/sec TPM Tokens Per Minute API rate limit tokens/min Where to Go Next You now have the vocabulary. Here's how to build on it: Experiment with tokenizers β platform.openai.com/tokenizer Benchmark models on your hardware β try llama.cpp or Ollama locally Track TTFT and TPS in your own apps β add timing logs around your API calls from day one Read model cards β every major model release includes parameter count, training FLOPs, and benchmark scores. They're not marketing fluff β they're specs. The engineers who understand these numbers don't just write code. They make better architectural decisions, avoid expensive surprises, and earn trust faster. That's the real reason to care. Got questions? Drop them in the comments.
Take Your Experience to the Next Level
NewDownload our mobile app for a faster and better experience.
Comments
0U
Join the discussion
Sign in to leave a comment