Llama 3.2 1B Instruct

Meta Generally Available

Llama 3.2 1B is a 1-billion-parameter language model focused on efficiently performing natural language tasks, such as summarization, dialogue, and multilingual text analysis. Its smaller size allows it to operate...

Context

60K

tokens

Input

$0.027

per MTok

Output

$0.20

per MTok

Model Page Try It API Docs

About

Modalities

Input

Text

Output

Text

Advanced Capabilities

Structured Outputs

JSON schema-constrained generation

Code Examples

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/llama-3.2-1b-instruct \
  -H "Authorization: Bearer $CLOUDFLARE_AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Explain quantum entanglement in one sentence." }
    ]
  }'

import os, requests

ACCOUNT_ID = os.environ["CLOUDFLARE_ACCOUNT_ID"]
TOKEN      = os.environ["CLOUDFLARE_AUTH_TOKEN"]

response = requests.post(
    f"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/llama-3.2-1b-instruct",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={
        "messages": [
            {"role": "system",  "content": "You are a helpful assistant."},
            {"role": "user",    "content": "Explain quantum entanglement in one sentence."},
        ],
    },
)
print(response.json())

interface Env { AI: Ai }

export default {
  async fetch(req: Request, env: Env): Promise<Response> {
    const messages = [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user",   content: "Explain quantum entanglement in one sentence." },
    ];
    const response = await env.AI.run("llama-3.2-1b-instruct", { messages });
    return Response.json(response);
  },
};

API Parameters

Temperature: 0 – 5

Name	Type	Default	Description
`messages` required	array	—	An array of message objects representing the conversation history.
`prompt` required	string	—	The input text prompt for the model to generate a response.
`frequency_penalty`	number	—	Decreases the likelihood of the model repeating the same lines verbatim.
`functions` deprecated	array	—	Deprecated. Use tools.
`lora`	string	—	Name of the LoRA (Low-Rank Adaptation) model to fine-tune the base model.
`max_tokens`	integer	256	The maximum number of tokens to generate in the response.
`presence_penalty`	number	—	Increases the likelihood of the model introducing new topics.
`raw`	boolean	false	If true, a chat template is not applied and you must adhere to the specific model's expected formatting.
`repetition_penalty`	number	—	Penalty for repeated tokens; higher values discourage repetition.
`response_format`	object	—	Constrain output to a JSON schema or an enum (structured outputs).
`seed`	integer	—	Random seed for reproducibility of the generation.
`stream`	boolean	false	If true, the response will be streamed back incrementally using SSE, Server Sent Events.
`temperature`	number	0.6	Controls the randomness of the output; higher values produce more random results.
`tools`	array	—	A list of tools available for the assistant to use.
`top_k`	integer	—	Limits the AI to choose from the top 'k' most probable words. Lower values make responses more focused; higher values introduce more variety and potential surprises.
`top_p`	number	—	Adjusts the creativity of the AI's responses by controlling how many possible words it considers. Lower values make outputs more predictable; higher values allow for more varied and creative responses.

Sourced from the model's published API schema.