Open-Source and Open-Weight Models

Open-weight models (Llama, Mistral, DeepSeek, Qwen, Gemma, Phi) are now credible production choices for most tasks — covering model selection, licensing, hardware requirements, and the specific strengths of each family.

Updated Invalid Date·

open-source llama mistral deepseek qwen gemma phi open-weights

The models you can download, self-host, and fine-tune. The open ecosystem has caught up to frontier proprietary models on many benchmarks. Open models are now a credible production choice for most use cases.

Why Open Models

Cost: no per-token API charges. At scale (billions of tokens/month), self-hosting beats API cost by 10-100x. Privacy: data never leaves your infrastructure. Control: fine-tune, quantise, serve however you want. Latency: local inference eliminates network round trips. No rate limits: burst as hard as your hardware allows.

Tradeoffs: operational overhead (serving infra, updates, monitoring), VRAM costs, no vendor SLA.

The Llama Family (Meta)

The anchor of the open ecosystem. Meta releases open weights under a custom license (most commercial use allowed; restrictions on apps with 700M+ monthly users).

Llama 3.1

Sizes:       8B, 70B, 405B
Context:     128K tokens
Training:    15T tokens (multilingual)
Instruction: Meta-Llama-3.1-{8,70,405}B-Instruct
License:     Meta Llama 3 Community License (commercial OK)

8B: fits on a single RTX 4090 (24GB) in BF16, or a 16GB GPU in INT4. Best small open model for many tasks. 70B: state-of-the-art open model at the 70B tier. Matches GPT-3.5 on most benchmarks. 405B: first open model competitive with GPT-4 at launch. Requires multi-GPU (8× A100 for BF16).

from transformers import pipeline

# Quick start
pipe = pipeline(
    "text-generation",
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    device_map="auto",
    torch_dtype="auto",
)
messages = [{"role": "user", "content": "Explain attention in one paragraph."}]
result = pipe(messages, max_new_tokens=300)

Llama 3.2 Vision

Multimodal Llama. 11B and 90B variants. First competitive open multimodal model.

Llama 3.3 70B

Released late 2024. Improved instruction following, 70B at near-405B quality on reasoning.

Mistral Family

European lab. All models MIT or Apache 2.0 licensed. Architecture innovations (sliding window attention, GQA) influenced later models.

Mistral 7B

Params:   7.3B
Context:  32K (sliding window: 4K local attention)
License:  Apache 2.0
VRAM:     14GB BF16, 5GB INT4

Strong baseline. First model to beat LLaMA 2 13B at 7B params.

Mixtral 8x7B

Total params:   46.7B
Active params:  12.9B (2 of 8 experts active)
Context:        32K
License:        Apache 2.0
VRAM:           90GB BF16, 24GB INT4

MoE architecture: 8 expert FFN layers, router selects 2 per token. Competitive with GPT-3.5, significantly faster than a dense 47B model.

# Via HuggingFace
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Codestral

Mistral's code model. 22B, context 32K. Supports fill-in-the-middle (FIM) for code completion. Best open code model at the 22B tier.

DeepSeek Family

Chinese lab. Caused significant disruption in early 2025 by achieving frontier reasoning at fraction of the cost.

DeepSeek V3

Total params:   671B MoE
Active params:  37B (token routing)
Context:        128K
License:        MIT
Training cost:  ~$5.6M claimed (vs estimated $100M+ for comparable proprietary models)

Competitive with Claude Sonnet 4.6 and GPT-4o on coding and reasoning.

DeepSeek R1

Architecture:  671B MoE, reasoning model
Training:      GRPO with verifiable rewards (math/code) — no human labels
License:       MIT
Performance:   o1-level on AIME, MATH, SWE-bench
API cost:      96% cheaper than o1

Key insight: high-quality reasoning emerged from GRPO with rule-based rewards. No human preference data required. Changed understanding of what's needed for reasoning models.

DeepSeek R1 Distilled

Smaller models distilled from R1's outputs:

Model	Base	AIME 2024
R1-Distill-Qwen-7B	Qwen 2.5 7B	55.5%
R1-Distill-Qwen-14B	Qwen 2.5 14B	69.7%
R1-Distill-Qwen-32B	Qwen 2.5 32B	72.6%
R1-Distill-Llama-70B	Llama 3.1 70B	70.0%

7B distill outperforms GPT-4o on math benchmarks.

Qwen Family (Alibaba)

Strong multilingual performance, especially Chinese-English. Apache 2.0 license.

Qwen 2.5

Sizes:    0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
Context:  128K
License:  Apache 2.0
Strong:   Chinese + English, math, code

Qwen 2.5 72B is competitive with Llama 3.1 70B on English benchmarks and significantly better on Chinese tasks.

Qwen 2.5-Coder

Sizes:   1.5B, 3B, 7B, 14B, 32B, 72B
Context: 128K
Best:    Code generation and completion

Competitive with GPT-4o on HumanEval at the 32B tier. Best open code model family as of April 2026.

QwQ-32B

Qwen's reasoning model. Competitive with DeepSeek R1 on math, open weights.

Gemma (Google)

Apache 2.0 licensed. Lighter-weight models with strong quality-to-size ratio.

Gemma 3

Sizes:    1B, 4B, 12B, 27B
Context:  128K (1M token on some variants)
License:  Gemma Terms of Use (essentially Apache 2.0 for most use)

Gemma 3 4B fits on a Raspberry Pi 5 (INT4). 27B competitive with Llama 3.1 70B on some benchmarks.

Phi (Microsoft)

Focus: maximum capability at minimum size.

Phi-4

Params:   14B
Context:  16K
License:  MIT
Strong:   Reasoning, math, code

Best 14B model. Trained primarily on synthetic "textbook quality" data rather than web scrapes.

Phi-3.5

Sizes:  3.8B (mini), 7B
Strong: On-device deployment (iPhone, Pixel)

Choosing an Open Model

Need	Recommended
Best quality, run locally	Llama 3.1 70B or Qwen 2.5 72B
Reasoning (math/code)	DeepSeek R1 Distill 32B or QwQ-32B
Smallest footprint	Phi-4 14B or Qwen 2.5 7B
MoE efficiency	Mixtral 8x7B or DeepSeek V3
Code generation	Qwen 2.5-Coder 32B or Codestral
Chinese language	Qwen 2.5 72B
Mobile/edge	Gemma 3 4B or Phi-3.5 mini
Commercially safest license	MIT/Apache 2.0: DeepSeek, Qwen, Mistral, Phi

Running Open Models

See infra/inference-serving for vLLM (production) and llama.cpp (local). See infra/gpu-hardware for GPU requirements. See infra/huggingface for loading with transformers.

Key Facts

Llama 3.1 8B: fits on RTX 4090 in BF16; competitive with GPT-3.5 on most benchmarks
Llama Community License: commercial OK up to 700M monthly active users
Mixtral 8x7B: 46.7B total params, 12.9B active; competitive with GPT-3.5 at lower compute
DeepSeek V3 training cost: ~$5.6M claimed (vs estimated $100M+ for comparable proprietary models)
DeepSeek R1 API: 96% cheaper than o1; MIT license
DeepSeek R1 Distill Qwen-7B: 55.5% AIME 2024 (outperforms GPT-4o on math)
Qwen 2.5 72B: competitive with Llama 3.1 70B on English; significantly better on Chinese tasks
Below ~10M tokens/month: API is cheaper than self-hosting any open model

Connections

llms/model-families — proprietary + open model families compared side by side
landscape/model-timeline — when each model was released
landscape/ai-labs — the companies behind each model family
fine-tuning/decision-framework — when to fine-tune open models vs use them as-is
infra/inference-serving — vLLM and llama.cpp for serving open models in production
infra/gpu-hardware — VRAM requirements for each model size tier
data/model-cards — HuggingFace Hub requires a model card for every published model

Open Questions

When does Llama 4 ship and does it maintain Meta's track record of best-in-tier quality?
How does QwQ-32B reasoning quality compare to DeepSeek R1 on code tasks specifically?
At what model size does the quality gap between open and frontier proprietary models become unacceptable for production use?