Llama 4 Scout 17b 16e Instruct

Meta Generally Available

Meta's Llama 4 Scout is a 17 billion parameter model with 16 experts that is natively multimodal. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.

Context

131K

tokens

Input

$0.27

per MTok

Output

$0.85

per MTok

Model Page Try It API Docs

About

Advanced Capabilities

Structured Outputs

JSON schema-constrained generation

Multi-turn Tool Calling

Chained tool calls in one session

Agentic Workload Ready

Tool use + structured output combined

Vision Input

Accepts image inputs

Code Examples

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/meta/llama-4-scout-17b-16e-instruct \
  -H "Authorization: Bearer $CLOUDFLARE_AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Explain quantum entanglement in one sentence." }
    ]
  }'

import os, requests

ACCOUNT_ID = os.environ["CLOUDFLARE_ACCOUNT_ID"]
TOKEN      = os.environ["CLOUDFLARE_AUTH_TOKEN"]

response = requests.post(
    f"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/llama-4-scout-17b-16e-instruct",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={
        "messages": [
            {"role": "system",  "content": "You are a helpful assistant."},
            {"role": "user",    "content": "Explain quantum entanglement in one sentence."},
        ],
    },
)
print(response.json())

interface Env { AI: Ai }

export default {
  async fetch(req: Request, env: Env): Promise<Response> {
    const messages = [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user",   content: "Explain quantum entanglement in one sentence." },
    ];
    const response = await env.AI.run("@cf/meta/llama-4-scout-17b-16e-instruct", { messages });
    return Response.json(response);
  },
};

API Parameters

Temperature: 0 – 5

Name	Type	Default	Description
`messages` required	array	—	An array of message objects representing the conversation history.
`prompt` required	string	—	The input text prompt for the model to generate a response.
`requests` required	array	—	—
`frequency_penalty`	number	—	Decreases the likelihood of the model repeating the same lines verbatim.
`functions` deprecated	array	—	Deprecated. Use tools.
`guided_json`	object	—	JSON schema that should be fulfilled for the response.
`max_tokens`	integer	256	The maximum number of tokens to generate in the response.
`presence_penalty`	number	—	Increases the likelihood of the model introducing new topics.
`raw`	boolean	false	If true, a chat template is not applied and you must adhere to the specific model's expected formatting.
`repetition_penalty`	number	—	Penalty for repeated tokens; higher values discourage repetition.
`response_format`	object	—	Constrain output to a JSON schema or an enum (structured outputs).
`seed`	integer	—	Random seed for reproducibility of the generation.
`stream`	boolean	false	If true, the response will be streamed back incrementally using SSE, Server Sent Events.
`temperature`	number	0.15	Controls the randomness of the output; higher values produce more random results.
`tools`	array	—	A list of tools available for the assistant to use.
`top_k`	integer	—	Limits the AI to choose from the top 'k' most probable words. Lower values make responses more focused; higher values introduce more variety and potential surprises.
`top_p`	number	—	Adjusts the creativity of the AI's responses by controlling how many possible words it considers. Lower values make outputs more predictable; higher values allow for more varied and creative responses.

Sourced from the model's published API schema.