Nemotron 3 120b A12b

Other Generally Available

NVIDIA Nemotron 3 Super is a hybrid MoE model with leading accuracy for multi-agent applications and specialized agentic AI systems.

Context

256K

tokens

Input

$0.50

per MTok

Output

$1.50

per MTok

Model Page Try It API Docs

About

NVIDIA Nemotron 3 Super is a hybrid MoE model with leading accuracy for multi-agent applications and specialized agentic AI systems.

Advanced Capabilities

Structured Outputs

JSON schema-constrained generation

Multi-turn Tool Calling

Chained tool calls in one session

Agentic Workload Ready

Tool use + structured output combined

Parallel Tool Calls

Multiple tools invoked in one turn

Code Examples

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/nvidia/nemotron-3-120b-a12b \
  -H "Authorization: Bearer $CLOUDFLARE_AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Explain quantum entanglement in one sentence." }
    ]
  }'

import os, requests

ACCOUNT_ID = os.environ["CLOUDFLARE_ACCOUNT_ID"]
TOKEN      = os.environ["CLOUDFLARE_AUTH_TOKEN"]

response = requests.post(
    f"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/nvidia/nemotron-3-120b-a12b",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={
        "messages": [
            {"role": "system",  "content": "You are a helpful assistant."},
            {"role": "user",    "content": "Explain quantum entanglement in one sentence."},
        ],
    },
)
print(response.json())

interface Env { AI: Ai }

export default {
  async fetch(req: Request, env: Env): Promise<Response> {
    const messages = [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user",   content: "Explain quantum entanglement in one sentence." },
    ];
    const response = await env.AI.run("@cf/nvidia/nemotron-3-120b-a12b", { messages });
    return Response.json(response);
  },
};

API Parameters

Name	Type	Default	Description
`messages` required	array	—	A list of messages comprising the conversation so far.
`prompt` required	string	—	The input text prompt for the model to generate a response.
`audio`	one of	—	Audio-output configuration (voice + format) when modalities includes "audio".
`chat_template_kwargs`	object	—	Provider-specific keyword arguments for the chat template.
`frequency_penalty`	one of	0	Penalizes new tokens based on their existing frequency in the text so far.
`function_call` deprecated	one of	—	Deprecated. Use tool_choice.
`functions` deprecated	array	—	Deprecated. Use tools.
`logit_bias`	one of	—	Modify the likelihood of specified tokens appearing in the completion. Maps token IDs to bias values from -100 to 100.
`logprobs`	one of	false	Whether to return log probabilities of the output tokens.
`max_completion_tokens`	one of	—	An upper bound for the number of tokens that can be generated for a completion.
`max_tokens` deprecated	one of	—	Deprecated in favor of max_completion_tokens. The maximum number of tokens to generate.
`metadata`	one of	—	Set of 16 key-value pairs that can be attached to the object.
`modalities`	one of	—	Output types requested from the model (e.g. ['text'] or ['text', 'audio']).
`model`	string	—	ID of the model to use (e.g. '@cf/zai-org/glm-4.7-flash, etc').
`n`	one of	1	How many chat completion choices to generate for each input message.
`parallel_tool_calls`	boolean	true	Whether to enable parallel function calling during tool use.
`prediction`	one of	—	Predicted output content for accelerated decoding.
`presence_penalty`	one of	0	Penalizes new tokens based on whether they appear in the text so far.
`reasoning_effort`	one of	—	Constrains effort on reasoning for reasoning models (o1, o3-mini, etc.).
`response_format`	one of	—	Constrain output to a JSON schema or an enum (structured outputs).
`seed`	one of	—	If specified, the system will make a best effort to sample deterministically.
`service_tier`	one of	auto	Specifies the processing type used for serving the request.
`stop`	one of	—	Up to 4 sequences where the API will stop generating further tokens.
`store`	one of	false	Whether to store the output for model distillation / evals.
`stream`	one of	false	If true, partial message deltas will be sent as server-sent events.
`stream_options`	one of	—	Options for the streaming response (e.g. include_usage).
`temperature`	one of	1	Sampling temperature between 0 and 2.
`tool_choice`	one of	—	Controls which (if any) tool is called: "none", "auto", "required", or a specific tool.
`tools`	array	—	A list of tools the model may call.
`top_logprobs`	one of	—	How many top log probabilities to return at each token position (0-20). Requires logprobs=true.
`top_p`	one of	1	Nucleus sampling: considers the results of the tokens with top_p probability mass.
`user`	string	—	A unique identifier representing your end-user, for abuse monitoring.
`web_search_options`	one of	—	Configuration for web-search tool augmentation.

Sourced from the model's published API schema.