Hi friends. Coffee, second cup. Today's post is for the people on RTX cards who saw NVIDIA's TensorRT announcement for Stable Diffusion 3.5 a while back, nodded politely, and never got around to actually setting it up because the install instructions read like a kernel commit message. I sat with it over the weekend, fought through the parts that weren't documented, and ran a fair benchmark on two different RTX cards. I also have, in writing, the moments where TensorRT is genuinely a no-brainer and the moments where you should just leave the base diffusers pipeline alone and go for a walk. Both moments exist. They are not the same moment.
The Short Version
TensorRT is NVIDIA's inference acceleration framework. For Stable Diffusion 3.5, NVIDIA shipped an integration that compiles the SD3.5 model graph into a TensorRT engine specifically optimized for your card. The compiled engine runs the model faster than the standard PyTorch path because it fuses operations, picks better kernels, and precomputes things the standard pipeline computes at runtime.
The good news. The speed boost is real and meaningful on Ada and Blackwell-generation RTX cards. The bad news. The compile step is slow, the engine files are large, and you lose flexibility (different resolution or batch size means a new engine compile). The honest takeaway. If you generate a lot of images at the same resolution and batch size, this is a clear win. If your work is exploratory across different aspect ratios and prompt counts, base diffusers is fine.
What I Tested
| Setup | Card | Driver | Resolution | Steps |
|---|---|---|---|---|
| Workstation | RTX 4070 Ti SUPER (16 GB) | NVIDIA Studio 581.x | 1024x1024 | 30 |
| Older laptop | RTX 3060 Mobile (6 GB) | NVIDIA Studio 581.x | 1024x1024 + 768x768 fallback | 30 |
Both runs were warmed up with three throwaway generations to get the GPU into a steady-state thermal range. All numbers below are the average of ten subsequent generations on the same prompt and seed. The base path used the official Stable Diffusion 3.5 medium checkpoint via the diffusers library. The accelerated path used the TensorRT engine compiled from that same checkpoint.
The Numbers
| Card | Base diffusers (sec/image) | TensorRT (sec/image) | Speedup |
|---|---|---|---|
| RTX 4070 Ti SUPER | ~6.8 | ~3.7 | ~1.84x |
| RTX 3060 Mobile (768x768) | ~13.2 | ~9.1 | ~1.45x |
The 4070 Ti SUPER ran 1024x1024. The 3060 Mobile dropped to 768x768 because the TensorRT engine for SD3.5 medium at 1024 will spill VRAM on a 6 GB card after the first generation in a way that wipes out the speed advantage. At 768 the 3060 fits comfortably and you get a real, measurable boost.
What you should not do is take these numbers and treat them as your numbers. They will be different on a 3090, a 4090, an A6000, an RTX 5070, and a Blackwell PRO card. The pattern (1.4x to 1.9x speedup, smaller speedup on older cards) holds. The exact number on your card you can only get by running it.
The Setup, In Plain English
The official guide is technically correct and emotionally devastating. Here is the path that actually worked for me on Windows 11 with a clean Python environment.
Step 1: A clean conda environment
Do not skip this. Mixing TensorRT into an existing diffusers environment is the single most common reason the install fails silently and produces an engine that runs but generates noise. Make a new environment. Install only what TensorRT needs.
conda create -n trt-sd35 python=3.11 -y conda activate trt-sd35 pip install --upgrade pip
Step 2: Match TensorRT, CUDA, and your driver
NVIDIA's TensorRT for SD3.5 ships as wheels that target a specific CUDA major version and need a driver new enough to support it. As of this post, the combination that works without surprises is a recent NVIDIA Studio driver, CUDA 12.x, and the TensorRT 10.x wheels. Older drivers will appear to work, then crash on the first generation with a cryptic CUBLAS error.
pip install nvidia-cudnn-cu12 pip install tensorrt pip install onnx onnxruntime-gpu pip install diffusers transformers accelerate sentencepiece
Step 3: Pull the SD3.5 checkpoint
Pull the official Stability AI Stable Diffusion 3.5 medium checkpoint from Hugging Face. You will need an access token for the gated repo. The checkpoint is several gigabytes; let it finish before moving on.
Step 4: Compile the TensorRT engine
This is the slow step. The compile is going to take 20 to 45 minutes on a 4070 Ti SUPER and longer on older cards. The compiler is doing real work: profiling kernels, picking layouts, generating optimized code paths for your specific GPU. Do not interrupt it. Do not run it on a card that is also rendering Blender in the background. Let it finish in peace.
The compile produces an engine file in the gigabyte range. That file is married to your card. If you swap GPUs, you compile again. If you upgrade your driver to a different major version, sometimes you compile again.
Step 5: Use the engine
Once the engine exists, calling it is fast. The diffusers TensorRT pipeline accepts a prompt, a seed, a step count, and produces an image. The interface is similar enough to the base pipeline that swapping between them in your own code is one constructor call.
The first generation after loading the engine is slower than subsequent generations because of one-time GPU warm-up. Throw away the first result. Then start measuring.
When TensorRT Is Worth It
- Production batches. If you are generating 200 product mockups at 1024x1024, the compile cost amortizes in the first 30 images and you save real wall-clock time on the rest.
- Repeated runs at the same settings. Anime character sets, headshot batches, A/B prompt experiments at fixed resolution. The engine pays for itself.
- Workstation hardware where time costs money. RTX A6000 or RTX PRO 6000 Blackwell users running client work bill in hours; a 1.8x speedup is not theoretical, it is hours per week back in the budget.
- Long jobs you would otherwise run overnight. Something that finishes by dinner instead of by morning is, materially, a different workflow.
When TensorRT Is Not Worth It
- Exploratory hobby work. If you generate ten images at five resolutions while figuring out what you want, you will spend more time recompiling engines than generating.
- Older RTX cards under 8 GB. The VRAM overhead can cancel out the speedup once spillover starts.
- Workflows with frequent model swaps. Trying SD3.5 today, an SD3.5-Large LoRA tomorrow, a custom community fine-tune next week, will land you in compile-loop hell.
- Anyone whose bottleneck is creative iteration, not generation speed. If you spend an hour on prompt design and 12 seconds on the actual gen, halving that 12 seconds did not change your day.
The Honest Take
TensorRT is one of those tools that is underrated by hobbyists and understandably loved by professionals. The setup cost is high. The payoff scales with how repetitive your workload is. The middle of the bell curve, which is most of us doing personal projects with constantly shifting prompts and resolutions, gets a real but smaller benefit and may not justify the friction. The right tail of the curve, which is anyone running production batches or workstation-grade work, should absolutely set this up.
If you are not sure where you fall, run the base diffusers pipeline for a week, count how many generations you actually ran at the same resolution and batch size, and then decide. If the answer is "more than two hundred," compile. If the answer is "thirty across six different resolutions," skip it.
One Last Practical Tip
Keep the base diffusers pipeline installed alongside the TensorRT one. There are nights when you will want to test a new model, swap aspect ratios, or just play with a prompt without committing to a 30-minute recompile. The base pipeline is your sketchpad. The TensorRT engine is your printer. You want both, and they coexist fine in the same conda environment if you give them their own pipeline objects.
Have fun with it. Don't let the install scare you off if your work is the kind that benefits. And if your card is older than the Ampere generation, don't sweat skipping this one. The base path on a 30xx card running SD 3.5 medium is, honestly, plenty fast for personal-project work in 2026.