-
Notifications
You must be signed in to change notification settings - Fork 6.2k
Description
Describe the bug
When I test the mit-han-lab/nunchaku-flux.1-kontext-dev model, it runs normally in a non-concurrent scenario, but throws an error when I try to run it with concurrent requests.
My GPU is a single RTX 4090D.
How can I enable multi-concurrency support on a single GPU?
Thank you in advance for your help.
Here is my error message:
[2025-08-08 17:14:50.242] [info] Initializing QuantizedFluxModel on device 0
[2025-08-08 17:14:50.382] [info] Loading partial weights from pytorch
[2025-08-08 17:14:51.445] [info] Done.
Injecting quantized module
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 99.47it/s]
Loading pipeline components...: 57%|████████████████████████████████████████████████████████████████████████████████████████▌ | 4/7 [00:00<00:00, 28.54it/s]You set add_prefix_space
. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 19.02it/s]
Generation height
and width
have been adjusted to 752 and 1360 to fit the model requirements.
Generation height
and width
have been adjusted to 880 and 1168 to fit the model requirements.
43%|███████████████████████████████████████████████████████████████████████████████▎ | 12/28 [00:17<00:23, 1.45s/it]
57%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 16/28 [00:18<00:13, 1.17s/it]
处理图像时出错: index 29 is out of bounds for dimension 0 with size 29
处理图像时出错: index 29 is out of bounds for dimension 0 with size 29
Reproduction
import torch
from diffusers import FluxKontextPipeline
from diffusers.utils import load_image
from concurrent.futures import ThreadPoolExecutor
from nunchaku import NunchakuFluxTransformer2dModel
from nunchaku.utils import get_precision
import time
def get_result(image_path,pipeline):
time_begin = time.time()
image = load_image(
image_path
).convert("RGB")
size = image.size
large_now = 1440
small_now = round(1440 * (min(size)/max(size)) /32) * 32
width,height = (large_now,small_now) \
if size[0]>size[1] else (small_now,large_now)
prompt = "Remove the watermark from the picture"
image = pipeline(
image=image,
prompt=prompt,
guidance_scale=2.5,
num_inference_steps=28,
height=height,
width=width,
).images[0]
image.save(image_path[:-4]+"_result.png")
def nunchaku_test(concurrency,pipeline):
test_images = ["房型图水印.jpg", "卧室水印.png"] * concurrency
test_images = test_images[:concurrency]
overall_start = time.time()
with ThreadPoolExecutor(max_workers=concurrency) as executor:
futures = [executor.submit(get_result, img_path, pipeline) for img_path in test_images]
results = []
for future in futures:
try:
results.append(future.result())
except Exception as e:
print(f"处理图像时出错: {e}")
overall_time = time.time() - overall_start
if __name__ == '__main__':
transformer = NunchakuFluxTransformer2dModel.from_pretrained(
f"/root/autodl-tmp/nunchaku-flux.1-kontext-dev/svdq-{get_precision()}_r32-flux.1-kontext-dev.safetensors"
)
pipeline = FluxKontextPipeline.from_pretrained(
"/root/autodl-tmp/FLUX.1-Kontext-dev", transformer=transformer, torch_dtype=torch.bfloat16
).to("cuda")
nunchaku_test(pipeline,2)
nunchaku_test(pipeline,4)
Logs
System Info
~/FLUX.1-Kontext-Dev-nunchaku# diffusers-cli env
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
- 🤗 Diffusers version: 0.35.0.dev0
- Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.12.3
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.33.1
- Transformers version: 4.53.0
- Accelerate version: 1.8.1
- PEFT version: not installed
- Bitsandbytes version: not installed
- Safetensors version: 0.5.3
- xFormers version: not installed
- Accelerator: NVIDIA GeForce RTX 4090 D, 24564 MiB
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
No response