Large Language Models (LLMs)

Model Parameters

Definition

Model parameters are the trainable weights in a neural network: the values in attention projection matrices, feed-forward network weight matrices, embedding tables, and layer normalization parameters that are updated during training to minimize the prediction loss. A '7 billion parameter' model has 7×10^9 such weights. These parameters collectively encode the model's knowledge—world facts, language patterns, reasoning heuristics, and coding skills—in a distributed, subsymbolic form. More parameters generally means more capacity to store knowledge and perform complex reasoning, which is why larger models achieve better benchmark scores. Each float16 parameter requires 2 bytes; a 7B model requires ~14GB just for weights.

Why It Matters

Parameter count is the most commonly cited model size metric and a primary determinant of quality, cost, and hardware requirements. A 7B model fits on a consumer GPU (8-16GB VRAM in int8); a 70B model requires a high-end server GPU (80GB+ VRAM in float16); a 405B model requires multiple GPUs. For 99helpers teams evaluating models, parameter count helps quickly filter options: 7B models are fast and cheap but have limited reasoning; 13B models offer a quality bump with manageable hardware; 70B models provide near-frontier quality for self-hosting; 400B+ models are typically accessed via API only. Quality doesn't scale uniformly with parameters—training data quality and fine-tuning matter equally.

How It Works

Parameter scaling in transformers: N_params ≈ 12 × d_model² × n_layers (dominant for large models, neglecting attention computation). A 7B model might have d_model=4096 and n_layers=32: 12 × 4096² × 32 ≈ 6.4B parameters. Each layer has attention weights (4 matrices of d_model × d_model), feed-forward weights (2 matrices of d_model × 4×d_model), and normalization parameters. The embedding table (vocab_size × d_model) adds ~0.5B parameters for typical vocabulary sizes. Memory for inference: model weights + activations + KV cache. Float16 weights: 2 bytes × N_params = ~14GB for 7B, ~140GB for 70B. INT4 quantization: ~0.5 bytes × N_params = ~3.5GB for 7B.

Parameter Scale Comparison

7B
7 billion params
VRAM: ~14 GB
Cost: $
13B
13 billion params
VRAM: ~26 GB
Cost: $$
70B
70 billion params
VRAM: ~140 GB
Cost: $$$
405B+
405+ billion params
VRAM: ~800 GB+
Cost: $$$$

Capability (benchmark score)

7B
55%
13B
68%
70B
88%
405B+
98%

Speed (tokens/sec, higher = faster)

7B
95%
13B
80%
70B
45%
405B+
15%
More parameters = higher capability but more memory, slower inference, and higher cost

Real-World Example

A 99helpers team evaluates three Llama-3 variants: 8B, 70B, and 405B. The 8B model (16GB in float16) runs on their 2x A10G servers at 80 tokens/second, costs $0.0001/query, and achieves 71% on their support benchmark. The 70B model (140GB in float16) requires their 4x A100 server at 25 tokens/second, costs $0.0008/query, and achieves 84% on their benchmark. The 405B model is accessed via API at $0.003/query, achieves 89% quality. They deploy 8B for simple FAQ queries (70% of volume), 70B for technical queries (25%), and 405B API for complex edge cases (5%)—optimizing the quality/cost tradeoff across query tiers.

Common Mistakes

  • Using parameter count as the sole quality proxy—training data quality, fine-tuning methodology, and architecture details all significantly affect quality independently of parameter count.
  • Confusing active parameters (in MoE models) with total parameters—Mixtral-8x7B has 47B total parameters but activates only ~13B per token; comparing active parameters is more relevant for inference cost.
  • Underestimating total memory requirements—model weights alone don't determine hardware requirements; KV cache, activations, and overhead can double total GPU memory usage.

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Model Parameters? Model Parameters Definition & Guide | 99helpers | 99helpers.com