Llama 4 is out. 👀 Llama 4 Maverick (402B) and Scout (109B) - natively multimodal, ...

Why Nostr? What is Njump?

Don't ₿elieve the Hype 🦊

npub1nx…aa6q8

2025-04-05 20:20:28

Llama 4 is out. 👀
Llama 4 Maverick (402B) and Scout (109B) - natively multimodal, multilingual and scaled to 10 MILLION context! BEATS DeepSeek v3🔥

Llama 4 Maverick:

> 17B active parameters, 128 experts, 400B total parameters > Beats GPT-4o & Gemini 2.0 Flash, competitive with DeepSeek v3 at half the active parameters > 1417 ELO on LMArena (chat performance). > Optimized for image understanding, reasoning, and multilingual tasks

Llama 4 Scout:

> 17B active parameters, 16 experts, 109B total parameters
> Best-in-class multimodal model for its size, fits on a single H100 GPU (with Int4 quantization)
> 10M token context window
> Outperforms Gemma 3, Gemini 2.0 Flash-Lite, Mistral 3.1 on benchmarks

Architecture & Innovations

> Mixture-of-Experts (MoE):
First natively multimodal Llama models with MoE
> Llama 4 Maverick: 128 experts, shared expert + routed experts for better efficiency.

Native Multimodality & Early Fusion:
> Jointly pre-trained on text, images, video (30T+ tokens, 2x Llama 3)
> MetaCLIP-based vision encoder, optimized for LLM integration
> Supports multi-image inputs (up to 8 tested, 48 pre-trained)

Long Context & iRoPE Architecture:
> 10M token support (Llama 4 Scout)
> Interleaved attention layers (no positional embeddings)
> Temperature-scaled attention for better length generalization

Training Efficiency:
> FP8 precision (390 TFLOPs/GPU on 32K GPUs for Behemoth)
> MetaP technique: Auto-tuning hyperparameters (learning rates, initialization)

Revamped Pipeline:
> Lightweight Supervised Fine-Tuning (SFT) → Online RL → Lightweight DPO
> Hard-prompt filtering (50%+ easy data removed) for better reasoning/coding
> Continuous Online RL: Adaptive filtering for medium/hard prompts

Author Public Key

npub1nxa4tywfz9nqp7z9zp7nr7d4nchhclsf58lcqt5y782rmf2hefjquaa6q8

Seen on

wss://nostr.noderunners.network wss://nostr.wine wss://relay.damus.io

Show more details

Published at

2025-04-05 20:20:28

Kind type

1 Short Text Note

Event JSON

{ "id": "57ab5dc43426a2a814d8718e98184875f61600653a361df625ccd442e23127de", "pubkey": "99bb5591c9116600f845107d31f9b59e2f7c7e09a1ff802e84f1d43da557ca64", "created_at": 1743884428, "kind": 1, "tags": [], "content": "Llama 4 is out. 👀\nLlama 4 Maverick (402B) and Scout (109B) - natively multimodal, multilingual and scaled to 10 MILLION context! BEATS DeepSeek v3🔥\n\nLlama 4 Maverick:\n\n\u003e 17B active parameters, 128 experts, 400B total parameters \u003e Beats GPT-4o \u0026 Gemini 2.0 Flash, competitive with DeepSeek v3 at half the active parameters \u003e 1417 ELO on LMArena (chat performance). \u003e Optimized for image understanding, reasoning, and multilingual tasks\n\nLlama 4 Scout:\n\n\u003e 17B active parameters, 16 experts, 109B total parameters\n\u003e Best-in-class multimodal model for its size, fits on a single H100 GPU (with Int4 quantization)\n\u003e 10M token context window\n\u003e Outperforms Gemma 3, Gemini 2.0 Flash-Lite, Mistral 3.1 on benchmarks\n\nArchitecture \u0026 Innovations\n\n\u003e Mixture-of-Experts (MoE):\nFirst natively multimodal Llama models with MoE\n\u003e Llama 4 Maverick: 128 experts, shared expert + routed experts for better efficiency.\n\nNative Multimodality \u0026 Early Fusion:\n\u003e Jointly pre-trained on text, images, video (30T+ tokens, 2x Llama 3)\n\u003e MetaCLIP-based vision encoder, optimized for LLM integration\n\u003e Supports multi-image inputs (up to 8 tested, 48 pre-trained)\n\nLong Context \u0026 iRoPE Architecture:\n\u003e 10M token support (Llama 4 Scout)\n\u003e Interleaved attention layers (no positional embeddings)\n\u003e Temperature-scaled attention for better length generalization\n\nTraining Efficiency:\n\u003e FP8 precision (390 TFLOPs/GPU on 32K GPUs for Behemoth)\n\u003e MetaP technique: Auto-tuning hyperparameters (learning rates, initialization)\n\nRevamped Pipeline:\n\u003e Lightweight Supervised Fine-Tuning (SFT) → Online RL → Lightweight DPO\n\u003e Hard-prompt filtering (50%+ easy data removed) for better reasoning/coding\n\u003e Continuous Online RL: Adaptive filtering for medium/hard prompts\n", "sig": "179feddc5ba17dd1e74455c032f041ae56157d22dc36e2194cc4e1feefbfdd5a17d9c3c63b7b6f4104e1cc3426babc27c883c056c6eeae1b51a9bccf26675e76" }