Nemotron 3 120b A12b

Other Generally Available

NVIDIA Nemotron 3 Super is a hybrid MoE model with leading accuracy for multi-agent applications and specialized agentic AI systems.

Context
256K
tokens
Input
$0.50
per MTok
Output
$1.50
per MTok

About

NVIDIA Nemotron 3 Super is a hybrid MoE model with leading accuracy for multi-agent applications and specialized agentic AI systems.

Advanced Capabilities

Structured Outputs
JSON schema-constrained generation
Multi-turn Tool Calling
Chained tool calls in one session
Agentic Workload Ready
Tool use + structured output combined
Parallel Tool Calls
Multiple tools invoked in one turn

Code Examples

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/nvidia/nemotron-3-120b-a12b \
  -H "Authorization: Bearer $CLOUDFLARE_AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Explain quantum entanglement in one sentence." }
    ]
  }'

API Parameters

Name Type Description
messages required array A list of messages comprising the conversation so far.
prompt required string The input text prompt for the model to generate a response.
audio one of Audio-output configuration (voice + format) when modalities includes "audio".
chat_template_kwargs object Provider-specific keyword arguments for the chat template.
frequency_penalty one of Penalizes new tokens based on their existing frequency in the text so far.
function_call deprecated one of Deprecated. Use tool_choice.
functions deprecated array Deprecated. Use tools.
logit_bias one of Modify the likelihood of specified tokens appearing in the completion. Maps token IDs to bias values from -100 to 100.
logprobs one of Whether to return log probabilities of the output tokens.
max_completion_tokens one of An upper bound for the number of tokens that can be generated for a completion.
max_tokens deprecated one of Deprecated in favor of max_completion_tokens. The maximum number of tokens to generate.
metadata one of Set of 16 key-value pairs that can be attached to the object.
modalities one of Output types requested from the model (e.g. ['text'] or ['text', 'audio']).
model string ID of the model to use (e.g. '@cf/zai-org/glm-4.7-flash, etc').
n one of How many chat completion choices to generate for each input message.
parallel_tool_calls boolean Whether to enable parallel function calling during tool use.
prediction one of Predicted output content for accelerated decoding.
presence_penalty one of Penalizes new tokens based on whether they appear in the text so far.
reasoning_effort one of Constrains effort on reasoning for reasoning models (o1, o3-mini, etc.).
response_format one of Constrain output to a JSON schema or an enum (structured outputs).
seed one of If specified, the system will make a best effort to sample deterministically.
service_tier one of Specifies the processing type used for serving the request.
stop one of Up to 4 sequences where the API will stop generating further tokens.
store one of Whether to store the output for model distillation / evals.
stream one of If true, partial message deltas will be sent as server-sent events.
stream_options one of Options for the streaming response (e.g. include_usage).
temperature one of Sampling temperature between 0 and 2.
tool_choice one of Controls which (if any) tool is called: "none", "auto", "required", or a specific tool.
tools array A list of tools the model may call.
top_logprobs one of How many top log probabilities to return at each token position (0-20). Requires logprobs=true.
top_p one of Nucleus sampling: considers the results of the tokens with top_p probability mass.
user string A unique identifier representing your end-user, for abuse monitoring.
web_search_options one of Configuration for web-search tool augmentation.

Sourced from the model's published API schema.