Skip to content

Add NVIDIA TensorRT-LLM optimization guide for GPT-OSS models #1983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
219 changes: 219 additions & 0 deletions articles/run-nvidia.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Optimizing OpenAI GPT-OSS Models with NVIDIA TensorRT-LLM"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook provides a step-by-step guide on how to optimizing `gpt-oss` models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.\n",
"\n",
"\n",
"TensorRT-LLM supports both models:\n",
"- `gpt-oss-20b`\n",
"- `gpt-oss-120b`\n",
"\n",
"In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/blogs/tech_blog) deployment guide."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Hardware\n",
"To run the 20B model and the TensorRT-LLM build process, you will need an NVIDIA GPU with at least 20 GB of VRAM.\n",
"\n",
"> Recommended GPUs: NVIDIA RTX 50 Series (e.g.RTX 5090), NVIDIA H100, or L40S.\n",
"\n",
"### Software\n",
"- CUDA Toolkit 12.8 or later\n",
"- Python 3.12 or later\n",
"- Access to the Orangina model checkpoint from Hugging Face"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installling TensorRT-LLM"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using NGC\n",
"\n",
"Pull the pre-built TensorRT-LLM container for GPT-OSS from NVIDIA NGC.\n",
"This is the easiest way to get started and ensures all dependencies are included.\n",
"\n",
"`docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`\n",
"`docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`\n",
"\n",
"## Using Docker (build from source)\n",
"\n",
"Alternatively, you can build the TensorRT-LLM container from source.\n",
"This is useful if you want to modify the source code or use a custom branch.\n",
"See the official instructions here: https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker\n",
"\n",
"The following commands will install required dependencies, clone the repository,\n",
"check out the GPT-OSS feature branch, and build the Docker container:\n",
" ```\n",
"#Update package lists and install required system packages\n",
"sudo apt-get update && sudo apt-get -y install git git-lfs build-essential cmake\n",
"\n",
"# Initialize Git LFS (Large File Storage) for handling large model files\n",
"git lfs install\n",
"\n",
"# Clone the TensorRT-LLM repository\n",
"git clone https://github.com/NVIDIA/TensorRT-LLM.git\n",
"cd TensorRT-LLM\n",
"\n",
"# Check out the branch with GPT-OSS support\n",
"git checkout feat/gpt-oss\n",
"\n",
"# Initialize and update submodules (required for build)\n",
"git submodule update --init --recursive\n",
"\n",
"# Pull large files (e.g., model weights) managed by Git LFS\n",
"git lfs pull\n",
"\n",
"# Build the release Docker image\n",
"make -C docker release_build\n",
"\n",
"# Run the built Docker container\n",
"make -C docker release_run \n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TensorRT-LLM will be available through pip soon"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Verifying TensorRT-LLM Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from tensorrt_llm import LLM, SamplingParams"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Utilizing TensorRT-LLM Python API"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the next code cell, we will demonstrate how to use the TensorRT-LLM Python API to:\n",
"1. Download the specified model weights from Hugging Face (using your HF_TOKEN for authentication).\n",
"2. Automatically build the TensorRT engine for your GPU architecture if it does not already exist.\n",
"3. Load the model and prepare it for inference.\n",
"4. Run a simple text generation example to verify everything is working.\n",
"\n",
"**Note**: The first run may take several minutes as it downloads the model and builds the engine.\n",
"Subsequent runs will be much faster, as the engine will be cached."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"llm = LLM(model=\"openai/gpt-oss-20b\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prompts = [\"Hello, my name is\", \"The capital of France is\"]\n",
"sampling_params = SamplingParams(temperature=0.8, top_p=0.95)\n",
"for output in llm.generate(prompts, sampling_params):\n",
" print(f\"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion and Next Steps\n",
"Congratulations! You have successfully optimized and run a large language model using the TensorRT-LLM Python API.\n",
"\n",
"In this notebook, you have learned how to:\n",
"- Set up your environment with the necessary dependencies.\n",
"- Use the `tensorrt_llm.LLM` API to download a model from the Hugging Face Hub.\n",
"- Automatically build a high-performance TensorRT engine tailored to your GPU.\n",
"- Run inference with the optimized model.\n",
"\n",
"\n",
"You can explore more advanced features to further improve performance and efficiency:\n",
"\n",
"- Benchmarking: Try running a [benchmark](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/benchmarking-default-performance.html#benchmarking-with-trtllm-bench) to compare the latency and throughput of the TensorRT-LLM engine against the original Hugging Face model. You can do this by iterating over a larger number of prompts and measuring the execution time.\n",
"\n",
"- Quantization: TensorRT-LLM [supports](https://github.com/NVIDIA/TensorRT-Model-Optimizer) various quantization techniques (like INT8 or FP8) to reduce model size and accelerate inference with minimal impact on accuracy. This is a powerful feature for deploying models on resource-constrained hardware.\n",
"\n",
"- Deploy with NVIDIA Dynamo: For production environments, you can deploy your TensorRT-LLM engine using the [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/latest/) for robust, scalable, and multi-model serving.\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
5 changes: 5 additions & 0 deletions authors.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

# You can optionally customize how your information shows up cookbook.openai.com over here.
# If your information is not present here, it will be pulled from your GitHub profile.
jayrodge:
name: "Jay Rodge"
website: "https://www.linkedin.com/in/jayrodge/"
avatar: "https://developer-blogs.nvidia.com/wp-content/uploads/2024/05/Jay-Rodge.png"

rajpathak-openai:
name: "Raj Pathak"
website: "https://www.linkedin.com/in/rajpathakopenai/"
Expand Down
12 changes: 12 additions & 0 deletions registry.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,17 @@
# should build pages for, and indicates metadata such as tags, creation date and
# authors for each page.


- title: Using NVIDIA TensorRT-LLM to run the 20B model
path: examples/articles/run-nvidia.ipynb
date: 2025-08-05
authors:
- jayrodge
tags:
- gpt-oss
- open-models


- title: Fine-tuning with gpt-oss and Hugging Face Transformers
path: articles/gpt-oss/fine-tune-transfomers.ipynb
date: 2025-08-05
Expand Down Expand Up @@ -61,6 +72,7 @@
- gpt-oss
- harmony


- title: Temporal Agents with Knowledge Graphs
path: examples/partners/temporal_agents_with_knowledge_graphs/temporal_agents_with_knowledge_graphs.ipynb
date: 2025-07-22
Expand Down