gpt-oss-120b

OpenAI Generally Available

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized...

Context

131K

tokens

Input

$0.039

per MTok

Output

$0.18

per MTok

Model Page Try It API Docs

About

Modalities

Input

Text

Output

Text

Advanced Capabilities

Multi-turn Tool Calling

Chained tool calls in one session

Code Examples

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/gpt-oss-120b \
  -H "Authorization: Bearer $CLOUDFLARE_AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Explain quantum entanglement in one sentence." }
    ]
  }'

import os, requests

ACCOUNT_ID = os.environ["CLOUDFLARE_ACCOUNT_ID"]
TOKEN      = os.environ["CLOUDFLARE_AUTH_TOKEN"]

response = requests.post(
    f"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/gpt-oss-120b",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={
        "messages": [
            {"role": "system",  "content": "You are a helpful assistant."},
            {"role": "user",    "content": "Explain quantum entanglement in one sentence."},
        ],
    },
)
print(response.json())

interface Env { AI: Ai }

export default {
  async fetch(req: Request, env: Env): Promise<Response> {
    const messages = [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user",   content: "Explain quantum entanglement in one sentence." },
    ];
    const response = await env.AI.run("gpt-oss-120b", { messages });
    return Response.json(response);
  },
};

API Parameters

Name	Type	Default	Description
`input` required	one of	—	Responses API Input messages. Refer to OpenAI Responses API docs to learn more about supported content types
`messages` required	array	—	An array of message objects representing the conversation history.
`prompt` required	string	—	The input text prompt for the model to generate a response.
`requests` required	array	—	—
`frequency_penalty`	number	—	Decreases the likelihood of the model repeating the same lines verbatim.
`functions` deprecated	array	—	Deprecated. Use tools.
`lora`	string	—	Name of the LoRA (Low-Rank Adaptation) model to fine-tune the base model.
`max_tokens`	integer	256	The maximum number of tokens to generate in the response.
`presence_penalty`	number	—	Increases the likelihood of the model introducing new topics.
`raw`	boolean	false	If true, a chat template is not applied and you must adhere to the specific model's expected formatting.
`reasoning`	object	—	Configuration for extended-thinking / reasoning mode.
`repetition_penalty`	number	—	Penalty for repeated tokens; higher values discourage repetition.
`response_format`	object	—	Constrain output to a JSON schema or an enum (structured outputs).
`seed`	integer	—	Random seed for reproducibility of the generation.
`stream`	boolean	false	If true, the response will be streamed back incrementally using SSE, Server Sent Events.
`temperature`	number	0.6	Controls the randomness of the output; higher values produce more random results.
`tools`	array	—	A list of tools available for the assistant to use.
`top_k`	integer	—	Limits the AI to choose from the top 'k' most probable words. Lower values make responses more focused; higher values introduce more variety and potential surprises.
`top_p`	number	—	Adjusts the creativity of the AI's responses by controlling how many possible words it considers. Lower values make outputs more predictable; higher values allow for more varied and creative responses.

Sourced from the model's published API schema.