\
  The most prestigious law school admissions discussion board in the world.
BackRefresh Options Favorite

NSAM's built the AI from hell with some Nvidia GPUs, go ahead and doubt it

...
Jared Baumeister
  02/26/26
specs and config
wangfei
  02/27/26
2x 3090s, a 5090, and a 5060 ti + i5-14400 with 128gb DDR5. ...
Jared Baumeister
  02/27/26
what mb. how so many pcie lanes?
wangfei
  02/27/26
Gigabyte z790 with PCIe bifurcation on the top PCIe 5.0 slot...
Jared Baumeister
  02/27/26
damn. what is total gpu vram? does llama see it all? how man...
wangfei
  02/27/26
The PCIe 5.0 slot is split into x8x8, and the 5090 only uses...
Jared Baumeister
  02/27/26
so 5090 and 5060 split 5.0 8x8? and the 3090s running on 4.0...
wangfei
  02/27/26
The way Blackwell does KV offloading is black magic. The 509...
Jared Baumeister
  02/27/26
ok fuck it. im buying a 5090 tmr. only running 4090 right no...
wangfei
  02/27/26
the MSIs are really good. I have the Gaming X Trio, which is...
Jared Baumeister
  02/27/26
...
wangfei
  02/27/26
also this magic KV offloading requires llama.cpp, and only w...
Jared Baumeister
  02/27/26
Buy an RTX Pro 6000. It's far better than dual 5090s.
Taylor Swift is not a hobby she is a lifestyle
  02/27/26
Oh wow my stalker can use Google!
Jared Baumeister
  02/27/26
i would and agree, but i dont want to dump that much $ into ...
wangfei
  02/27/26
how important are the extra GPUs? am i gonna be happy with ...
Lab Diamond Dallas Trump
  02/27/26
No idea. I also haven't probed the limits of the 5090 yet. A...
Jared Baumeister
  02/27/26
PS if you're using llama.cpp you can ask Claude how to tune ...
Jared Baumeister
  02/27/26
that card is 1/3 the price if you buy from straight from chi...
wangfei
  02/27/26


Poast new message in this thread



Reply Favorite

Date: February 26th, 2026 11:52 PM
Author: Jared Baumeister



(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698772)



Reply Favorite

Date: February 27th, 2026 12:20 AM
Author: wangfei

specs and config

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698808)



Reply Favorite

Date: February 27th, 2026 12:21 AM
Author: Jared Baumeister

2x 3090s, a 5090, and a 5060 ti + i5-14400 with 128gb DDR5. llama.cpp in a Debian 12 container

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698810)



Reply Favorite

Date: February 27th, 2026 12:22 AM
Author: wangfei

what mb. how so many pcie lanes?

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698811)



Reply Favorite

Date: February 27th, 2026 12:27 AM
Author: Jared Baumeister

Gigabyte z790 with PCIe bifurcation on the top PCIe 5.0 slot, plus two x16-size PCIe 4.0x4 slots on the mobo. I think all the Gigabyte z790 motherboards give you lanes out the ass on the slots, even the budget series

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698814)



Reply Favorite

Date: February 27th, 2026 12:32 AM
Author: wangfei

damn. what is total gpu vram? does llama see it all? how many parameter u can run?

edit, wait u run 5090 in 4.0 pcie slot?

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698820)



Reply Favorite

Date: February 27th, 2026 12:43 AM
Author: Jared Baumeister

The PCIe 5.0 slot is split into x8x8, and the 5090 only uses 5.0x8. But it doesn't matter because you're never going to be limited by bandwidth until you drop below PCIe 3.0. It's just not an issue because the GPUs aren't sending/receiving that much data to begin with. I rarely see any GPU spike over 900 MiB/s in nvtop.

By far the biggest difference is Blackwell vs non-Blackwell, but it doesn't matter to me because I have multiple Debian containers with different GPU passthrough configs. So if I want to load big 70b models on the 3090s, and I just need another 8-12gb of VRAM, I can put the 5060ti in that container and give it the extra 16gb. Right now that's what I'm doing because the 5090 runs so well by itself. But I can also move the 5060 ti to that container if I need more than 32gb and I want to keep Blackwell features. And of course I can put all four in one container for 96gb, though I've seen no need to do that so far. Deepseek 4 is a wildcard, I have no idea what to expect

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698830)



Reply Favorite

Date: February 27th, 2026 12:50 AM
Author: wangfei

so 5090 and 5060 split 5.0 8x8? and the 3090s running on 4.0 lanes? does llama see aggregate vram or you running containers that can only see portion of total vram? i am confused.

edit, saw your last 2 sentences got it. damn just slam everything into one container and see what you can do.

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698833)



Reply Favorite

Date: February 27th, 2026 1:07 AM
Author: Jared Baumeister

The way Blackwell does KV offloading is black magic. The 5090 by itself will run 48gb models no problem. It just populates the VRAM and then it only populates <1gb of system RAM. I have no idea how to account for the missing 16gb. How can a 48gb model only use 32gb of VRAM and no system RAM?



(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698841)



Reply Favorite

Date: February 27th, 2026 1:13 AM
Author: wangfei

ok fuck it. im buying a 5090 tmr. only running 4090 right now.

edit, ive been trying to snipe a 5090 FE, no luck. i will just get whatever like an asus.

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698849)



Reply Favorite

Date: February 27th, 2026 1:28 AM
Author: Jared Baumeister

the MSIs are really good. I have the Gaming X Trio, which is in between the Ventus and the Suprim. The FEs are considered inferior and more likely to overheat or malfunction. Some Asus cards have fit issues with the power connectors too, need to research that

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698855)



Reply Favorite

Date: February 27th, 2026 1:47 AM
Author: wangfei



(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698869)



Reply Favorite

Date: February 27th, 2026 1:33 AM
Author: Jared Baumeister

also this magic KV offloading requires llama.cpp, and only works with gguf files. vLLM and SGLang won't do it. Ollama will do it but it's literally 1/10 the speed of llama.cpp with Blackwell

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698862)



Reply Favorite

Date: February 27th, 2026 10:11 AM
Author: Taylor Swift is not a hobby she is a lifestyle (πŸ‡ΊπŸ‡Έ πŸ‡΅πŸ‡±)

Buy an RTX Pro 6000. It's far better than dual 5090s.

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49699206)



Reply Favorite

Date: February 27th, 2026 1:06 PM
Author: Jared Baumeister

Oh wow my stalker can use Google!

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49699699)



Reply Favorite

Date: February 27th, 2026 4:04 PM
Author: wangfei

i would and agree, but i dont want to dump that much $ into my llm, which is only a hobby. i am willing to wait out the inevitable price drops on nvidias enterprise stuff during ai data center upgrade cycles. we are just at the very beginning of this.

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49700462)



Reply Favorite

Date: February 27th, 2026 2:54 AM
Author: Lab Diamond Dallas Trump

how important are the extra GPUs? am i gonna be happy with the performance of a local model if i set up a 5090?

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49698903)



Reply Favorite

Date: February 27th, 2026 7:31 AM
Author: Jared Baumeister

No idea. I also haven't probed the limits of the 5090 yet. All I will say that I was initially disappointed with its performance in Ollama, and I didn't see big gains until I started using llama.cpp. You HAVE to compile llama.cpp and let it build whatever features it thinks it needs into the binaries

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49699009)



Reply Favorite

Date: February 27th, 2026 9:28 AM
Author: Jared Baumeister

PS if you're using llama.cpp you can ask Claude how to tune the parameters to your particular situation.

You should also phave all drivers installed before you compile llama.cpp, so that it detects and installs the right modules

Also, if you have a mix of GPUs with different amounts of VRAM, you have tell llama.cpp how many layers to offload to each one and how many layers (if any) go to the CPU. It's such a grind that I'm making a spreadsheet of scripts for launching different models in different configurations

This is the PCIe bifurcation card I use. Even though it says it's only 4.0, it shows up as 5.0x8 in nvtop

https://a.co/d/07XNtyYc

Finally, all of these GPUs can be drastically power limited so that they run off one PSU. The 3080s can be run at 200W (400Wx combined), the 5060 uses 170W so you're only at 370W, then power limit the 5090 to 380W. So total draw is only 750W (and my CPU can only pull 65W, so it's not a problem to run them on one PSU. I'm using a 1600W PSU but I could probably get by with 1000W. Performance isn't an issue unless you're gaming, and cooler temps extend longevity

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49699135)



Reply Favorite

Date: February 27th, 2026 4:07 PM
Author: wangfei

that card is 1/3 the price if you buy from straight from china/aliexpress.

(http://www.autoadmit.com/thread.php?thread_id=5838844&forum_id=2Firm#49700478)