Model Parameters
Definition
Model parameters are the trainable weights in a neural network: the values in attention projection matrices, feed-forward network weight matrices, embedding tables, and layer normalization parameters that are updated during training to minimize the prediction loss. A '7 billion parameter' model has 7×10^9 such weights. These parameters collectively encode the model's knowledge—world facts, language patterns, reasoning heuristics, and coding skills—in a distributed, subsymbolic form. More parameters generally means more capacity to store knowledge and perform complex reasoning, which is why larger models achieve better benchmark scores. Each float16 parameter requires 2 bytes; a 7B model requires ~14GB just for weights.
Why It Matters
Parameter count is the most commonly cited model size metric and a primary determinant of quality, cost, and hardware requirements. A 7B model fits on a consumer GPU (8-16GB VRAM in int8); a 70B model requires a high-end server GPU (80GB+ VRAM in float16); a 405B model requires multiple GPUs. For 99helpers teams evaluating models, parameter count helps quickly filter options: 7B models are fast and cheap but have limited reasoning; 13B models offer a quality bump with manageable hardware; 70B models provide near-frontier quality for self-hosting; 400B+ models are typically accessed via API only. Quality doesn't scale uniformly with parameters—training data quality and fine-tuning matter equally.
How It Works
Parameter scaling in transformers: N_params ≈ 12 × d_model² × n_layers (dominant for large models, neglecting attention computation). A 7B model might have d_model=4096 and n_layers=32: 12 × 4096² × 32 ≈ 6.4B parameters. Each layer has attention weights (4 matrices of d_model × d_model), feed-forward weights (2 matrices of d_model × 4×d_model), and normalization parameters. The embedding table (vocab_size × d_model) adds ~0.5B parameters for typical vocabulary sizes. Memory for inference: model weights + activations + KV cache. Float16 weights: 2 bytes × N_params = ~14GB for 7B, ~140GB for 70B. INT4 quantization: ~0.5 bytes × N_params = ~3.5GB for 7B.
Parameter Scale Comparison
Capability (benchmark score)
Speed (tokens/sec, higher = faster)
Real-World Example
A 99helpers team evaluates three Llama-3 variants: 8B, 70B, and 405B. The 8B model (16GB in float16) runs on their 2x A10G servers at 80 tokens/second, costs $0.0001/query, and achieves 71% on their support benchmark. The 70B model (140GB in float16) requires their 4x A100 server at 25 tokens/second, costs $0.0008/query, and achieves 84% on their benchmark. The 405B model is accessed via API at $0.003/query, achieves 89% quality. They deploy 8B for simple FAQ queries (70% of volume), 70B for technical queries (25%), and 405B API for complex edge cases (5%)—optimizing the quality/cost tradeoff across query tiers.
Common Mistakes
- ✕Using parameter count as the sole quality proxy—training data quality, fine-tuning methodology, and architecture details all significantly affect quality independently of parameter count.
- ✕Confusing active parameters (in MoE models) with total parameters—Mixtral-8x7B has 47B total parameters but activates only ~13B per token; comparing active parameters is more relevant for inference cost.
- ✕Underestimating total memory requirements—model weights alone don't determine hardware requirements; KV cache, activations, and overhead can double total GPU memory usage.
Related Terms
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Scaling Laws
Scaling laws describe predictable mathematical relationships between LLM performance and scale—model size, training data, and compute—enabling researchers to forecast model capability improvements before building larger models.
Model Quantization
Model quantization reduces the numerical precision of LLM weights from 32-bit or 16-bit floats to 8-bit or 4-bit integers, dramatically reducing memory requirements and inference costs with minimal quality loss.
Mixture of Experts (MoE)
Mixture of Experts is an LLM architecture where a router dynamically activates only a subset of specialized 'expert' neural network layers for each token, achieving high model capacity while keeping per-token compute cost manageable.
Foundation Model
A foundation model is a large AI model trained on broad, diverse data that can be adapted to a wide range of downstream tasks through fine-tuning or prompting, serving as a base for many applications.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →