Skip to content

Adding gpt-oss guides #1982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
674 changes: 674 additions & 0 deletions articles/gpt-oss/fine-tune-transfomers.ipynb

Large diffs are not rendered by default.

123 changes: 123 additions & 0 deletions articles/gpt-oss/handle-raw-cot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# How to handle the raw chain of thought in gpt-oss

The [gpt-oss models](https://openai.com/open-models) provide access to a raw chain of thought (CoT) meant for analysis and safety research by model implementors, but it’s also crucial for the performance of tool calling, as tool calls can be performed as part of the CoT. At the same time, the raw CoT might contain potentially harmful content or could reveal information to users that the person implementing the model might not intend (like rules specified in the instructions given to the model). You therefore should not show raw CoT to end users.

## Harmony / chat template handling

The model encodes its raw CoT as part of our [harmony response format](https://cookbook.openai.com/articles/openai-harmony). If you are authoring your own chat templates or are handling tokens directly, make sure to [check out harmony guide first](https://cookbook.openai.com/articles/openai-harmony).

To summarize a couple of things:

1. CoT will be issued to the `analysis` channel
2. After a message to the `final` channel in a subsequent sampling turn all `analysis` messages should be dropped. Function calls to the `commentary` channel can remain
3. If the last message by the assistant was a tool call of any type, the analysis messages until the previous `final` message should be preserved on subsequent sampling until a `final` message gets issued

## Chat Completions API

If you are implementing a Chat Completions API, there is no official spec for handling chain of thought in the published OpenAI specs, as our hosted models will not offer this feature for the time being. We ask you to follow [the following convention from OpenRouter instead](https://openrouter.ai/docs/use-cases/reasoning-tokens). Including:

1. Raw CoT will be returned as part of the response unless `reasoning: { exclude: true }` is specified as part of the request. [See details here](https://openrouter.ai/docs/use-cases/reasoning-tokens#legacy-parameters)
2. The raw CoT is exposed as a `reasoning` property on the message in the output
3. For delta events the delta has a `reasoning` property
4. On subsequent turns you should be able to receive the previous reasoning (as `reasoning`) and handle it in accordance with the behavior specified in the chat template section above.

When in doubt, please follow the convention / behavior of the OpenRouter implementation.

## Responses API

For the Responses API we augmented our Responses API spec to cover this case. Below are the changes to the spec as type definitions. At a high level we are:

1. Introducing a new `content` property on `reasoning`. This allows a reasoning `summary` that could be displayed to the end user to be returned at the same time as the raw CoT (which should not be shown to the end user, but which might be helpful for interpretability research).
2. Introducing a new content type called `reasoning_text`
3. Introducing two new events `response.reasoning_text.delta` to stream the deltas of the raw CoT and `response.reasoning_text.done` to indicate a turn of CoT to be completed
4. On subsequent turns you should be able to receive the previous reasoning and handle it in accordance with the behavior specified in the chat template section above.

**Item type changes**

```typescript
type ReasoningItem = {
id: string;
type: "reasoning";
summary: SummaryContent[];
// new
content: ReasoningTextContent[];
};

type ReasoningTextContent = {
type: "reasoning_text";
text: string;
};

type ReasoningTextDeltaEvent = {
type: "response.reasoning_text.delta";
sequence_number: number;
item_id: string;
output_index: number;
content_index: number;
delta: string;
};

type ReasoningTextDoneEvent = {
type: "response.reasoning_text.done";
sequence_number: number;
item_id: string;
output_index: number;
content_index: number;
text: string;
};
```

**Event changes**

```typescript
...
{
type: "response.content_part.added"
...
}
{
type: "response.reasoning_text.delta",
sequence_number: 14,
item_id: "rs_67f47a642e788191aec9b5c1a35ab3c3016f2c95937d6e91",
output_index: 0,
content_index: 0,
delta: "The "
}
...
{
type: "response.reasoning_text.done",
sequence_number: 18,
item_id: "rs_67f47a642e788191aec9b5c1a35ab3c3016f2c95937d6e91",
output_index: 0,
content_index: 0,
text: "The user asked me to think"
}
```

**Example responses output**

```typescript
"output": [
{
"type": "reasoning",
"id": "rs_67f47a642e788191aec9b5c1a35ab3c3016f2c95937d6e91",
"summary": [
{
"type": "summary_text",
"text": "**Calculating volume of gold for Pluto layer**\n\nStarting with the approximation..."
}
],
"content": [
{
"type": "reasoning_text",
"text": "The user asked me to think..."
}
]
}
]

```

## Displaying raw CoT to end-users

If you are providing a chat interface to users, you should not show the raw CoT because it might contain potentially harmful content or other information that you might not intend to show to users (like, for example, instructions in the developer message). Instead, we recommend showing a summarized CoT, similar to our production implementations in the API or ChatGPT, where a summarizer model reviews and blocks harmful content from being shown.
163 changes: 163 additions & 0 deletions articles/gpt-oss/run-locally-ollama.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# How to run gpt-oss locally with Ollama

Want to get [**OpenAI gpt-oss**](https://openai.com/open-models) running on your own hardware? This guide will walk you through how to use [Ollama](https://ollama.ai) to set up **gpt-oss-20b** or **gpt-oss-120b** locally, to chat with it offline, use it through an API, and even connect it to the Agents SDK.

Note that this guide is meant for consumer hardware, like running a model on a PC or Mac. For server applications with dedicated GPUs like NVIDIA’s H100s, [check out our vLLM guide](https://cookbook.openai.com/articles/gpt-oss/run-vllm).

## Pick your model

Ollama supports both model sizes of gpt-oss:

- **`gpt-oss-20b`**
- The smaller model
- Best with **≥16GB VRAM** or **unified memory**
- Perfect for higher-end consumer GPUs or Apple Silicon Macs
- **`gpt-oss-120b`**
- Our larger full-sized model
- Best with **≥60GB VRAM** or **unified memory**
- Ideal for multi-GPU or beefy workstation setup

**A couple of notes:**

- These models ship **MXFP4 quantized** out the box and there is currently no other quantization
- You _can_ offload to CPU if you’re short on VRAM, but expect it to run slower.

## Quick setup

1. **Install Ollama** → [Get it here](https://ollama.com/download)
2. **Pull the model you want:**

```shell
# For 20B
ollama pull gpt-oss:20b

# For 120B
ollama pull gpt-oss:120b
```

## Chat with gpt-oss

Ready to talk to the model? You can fire up a chat in the app or the terminal:

```shell
ollama run gpt-oss:20b
```

Ollama applies a **chat template** out of the box that mimics the [OpenAI harmony format](https://example.com/harmony-docs). Type your message and start the conversation.

## Use the API

Ollama exposes a **Chat Completions-compatible API**, so you can use the OpenAI SDK without changing much. Here’s a Python example:

```py
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:11434/v1", # Local Ollama API
api_key="ollama" # Dummy key
)

response = client.chat.completions.create(
model="gpt-oss:20b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what MXFP4 quantization is."}
]
)

print(response.choices[0].message.content)
```

If you’ve used the OpenAI SDK before, this will feel instantly familiar.

Alternatively, you can use the Ollama SDKs in [Python](https://github.com/ollama/ollama-python) or [JavaScript](https://github.com/ollama/ollama-js) directly.

## Using tools (function calling)

Ollama can:

- Call functions
- Use a **built-in browser tool** (in the app)

Example of invoking a function via Chat Completions:

```py
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather in a given city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
},
},
}
]

response = client.chat.completions.create(
model="gpt-oss:20b",
messages=[{"role": "user", "content": "What's the weather in Berlin right now?"}],
tools=tools
)

print(response.choices[0].message)
```

Since the models can perform tool calling as part of the chain-of-thought (CoT) it’s important for you to return the reasoning returned by the API back into a subsequent call to a tool call where you provide the answer until the model reaches a final answer.

## Responses API workarounds

Ollama doesn’t (yet) support the **Responses API** natively.

If you do want to use the Responses API you can use [**Hugging Face’s `Responses.js` proxy**](https://github.com/huggingface/responses.js) to convert Chat Completions to Responses API.

For basic use cases you can also [**run our example Python server with Ollama as the backend.**](https://github.com/openai/gpt-oss?tab=readme-ov-file#responses-api) This server is a basic example server and does not have the

```shell
pip install gpt-oss
python -m gpt_oss.responses_api.serve \
--inference_backend=ollama \
--checkpoint gpt-oss:20b
```

## Agents SDK integration

Want to use gpt-oss with OpenAI’s **Agents SDK**?

Both Agents SDK enable you to override the OpenAI base client to point to Ollama using Chat Completions or your Responses.js proxy for your local models. Alternatively, you can use the built-in functionality to point the Agents SDK against third party models.

- **Python:** Use [LiteLLM](https://openai.github.io/openai-agents-python/models/litellm/) to proxy to Ollama through LiteLLM
- **TypeScript:** Use [AI SDK](https://openai.github.io/openai-agents-js/extensions/ai-sdk/) with the [ollama adapter](https://ai-sdk.dev/providers/community-providers/ollama)

Here’s a Python Agents SDK example using LiteLLM:

```py
import asyncio
from agents import Agent, Runner, function_tool, set_tracing_disabled
from agents.extensions.models.litellm_model import LitellmModel

set_tracing_disabled(True)

@function_tool
def get_weather(city: str):
print(f"[debug] getting weather for {city}")
return f"The weather in {city} is sunny."


async def main(model: str, api_key: str):
agent = Agent(
name="Assistant",
instructions="You only respond in haikus.",
model=LitellmModel(model="ollama/gpt-oss:120b", api_key=api_key),
tools=[get_weather],
)

result = await Runner.run(agent, "What's the weather in Tokyo?")
print(result.final_output)

if __name__ == "__main__":
asyncio.run(main())
```
Loading