An Inferno on the Head of a Pin

Hi, I’m Nick from Stack Overflow! Hope I can clarify some bits here:

While we tried this, we didn’t end up going with the 4 GPU in 1U config. Luckily we were able to test various configurations in Dell’s test lab. The way you fit 4 in a 1U server is in a Dell C4130 chassis. The 4 cards are not consumer cards, but the professional Tesla variants. They’re higher power, more memory, and more importantly: come in both airflow directions. The 4 cards are installed across the front bezel, and the airflow is reverse of what you’d buy in any consumer card.

Here’s my tweet stream from first researching it, which includes pictures. Here’s a clear view:

Note that every other card is upside down, but those are 300W cards. It takes dual 1600W PSUs to power the rig.

What we actually ended up going with (we’re finally ordering hardware this week for many things) is a Dell R730 chassis. Since we can put cards in the back of it, instead of doing Tesla cards which are about $5,000 a piece in a server that starts far higher (IIRC $20K with a single card was our starting point), we can use consumer cards instead. In a R730 (a 2U server), one PCIe config allows 2 full-height cards in the back. We can use consumer GTX 1080s, which are $700 a piece and about 60-65% as fast for our use case. Marc Gravell has detailed test data we can share, we tested each scenario with singles and multiples of K40s, K80s, M40s, GTX 980s, and GTX 1080s.

In a R730, we can’t order it with the higher than 140W TDP processors and the GPU config we want because of how Dell limits thermal configuration. But they don’t take into account a GTX 1080 has a 180W TDP and not a 300W TDP of the Tesla models they sell (remember, times 2, so 240W less cooling needed). Given this, we couldn’t (by default) directly order the Intel E5-2687W v4 processors we want (which are 160W TDP) for the workloads on these boxes. Luckily, we can work around this by just ordering the server and GPU kit separate and we’ll install it ourselves.

The first server should be in place in the February/March timeframe if all goes well, and we’ll of course share the adventure. Happy to answer more questions as they arise.

Here are some extra informational links:

4 Likes

I found a couple when I searched for 1U Liquid Cooling

Sounds interesting.

1 Like

Read the Why Ruby blog post first. There’s no question that Ruby benefits immensely from high clock speed CPUs. Every single bit of data or research we’ve ever done supports this conclusion.

2 Likes

LOLed on Indoor enthusiast

IMO as long as the enterprise can sustain it in the long run the language does not really matter. For one I still recommend whatever the client usually has and can a hold of easily that tends to be Java just because developers and application server administrators are a dime a dozen in comparison. Of course I’d sooner push those off to cloud based infrastructures for any non-sensitive and organization critical data (e.g. Google Analytics, ZenDesk support).

Most enterprises are not really up to the scale of Twitter, StackOverflow or Facebook when it comes to public data load. However, they should learn lessons from them. Especially how they can handle scale with reasonable cost. Things like enterprise caches, NoSQL data stores, CDNing, RESTful OAuth2 protected APIs.

Some do, and some overkill.

Can you explain the benefit of the GPUs for your application? Is there a previous article on this? AFAIK, the server is used to run a web app. How does serving a web app benefit from multiple GPUs?

Thanks

1 Like

I am guessing it must be some high level math where they do the same bunch of math operations with just different data. If that were the case they’d be using something like CUDA or similar and program a calculator on the GPU and just send a bunch of different data to it.

The GPUs can do them in parallel much faster than a general purpose CPU.

I am here to tell you that Intel’s TDP figure of 140 watts for the 6 core version of this CPU is a terrible, scurrilous lie!

TDP isn’t max power consumption. It’s more like, for the application we intend this processor for (server), this is how much heat output you (the machine designer) should design for. They don’t really tell you the conditions that the TDP figure was chosen for. Ark says:

Thermal Design Power (TDP) represents the average power, in watts, the processor dissipates when operating at Base Frequency with all cores active under an Intel-defined, high-complexity workload.

This gets really fuzzy for things like laptops where a 5W TDP CPU can draw 40W for short periods, limited primarily by the laptop’s ability to remove the consequent heat.

My curiosity about the GPUs led me to listen to this episode https://scaleyourcode.com/interviews/interview/31 where the GPUs are briefly discussed. They have a custom tag engine that would fit in RAM of a CPU. They ported this engine to fit in the RAM of a GPU.

2 Likes

Ruby is one of the slowest script languages out there. Almost anything is better than Ruby.

For instance, Heroku, which came into existence to support Ruby projects, doesn’t use any Ruby. They used to, but it was too much of a waste of hardware.

Over 1KW per RU? That’s kinda insane. I mean, sure, you could design carefully and probably push enough air through to cool the individual machine, but then you’re talking about 40-60KW per rack. Powering and cooling that kind of density seems rather… challenging… compared to just having 2U or 4U servers with the same specs.

Yes, but:

Intel says both of these are 140w TDP, despite one of them having 50% more cores (4 → 6) at only marginally lower clock speeds. I am here to tell you that’s… kinda blatantly not true. Or alt-facts as we’re saying these days, apparently.

This is also why clock speed decreases quite linearly with number of cores, otherwise the sockets would melt.

1 Like

Ah! But that’s exactly my point! TDP isn’t “this is how much heat the CPU puts out”. It’s this is how much heat your machine should be able to handle.

I admit, it’s not so clear for server CPUs. The relationship between base clock speed (another alt-fact!), TDP and performance is clearer on laptop CPUs, where you have the exact same silicon sold at base clock speeds proportional to the TDP. It can all run at 3-ish GHz, but with a 5W thermal solution, you’re only going to run continuously at 1GHz without overheating. With a 35W thermal solution, you can run nonstop at 2.4GHz. That’s what TDP is telling you.

I suspect that in the server situation they just mark it all 140W for simplicity. There aren’t many off-the-shelf servers that have a 290W thermal solution.

Just kind of an FYI here. Your power draw on embedded systems is quite high. I’m working on 2 projects right now. One has a power budget of 5mA at 3.5V (17.5mW). The other has a power budget of 15uA at 3V. That’s 45 MICRO watts. That’s 7 orders of magnitude span from the monstrous beasts down to the itty-bitty little peon.

1 Like

What a great opportunity for some creative 3D printing, custom fan/CPU shrouds!

Have you looked into pizeoelectric fans? The one i am most familiar with is GE’s DCJ. Order of magnitude higher airflow, order of magnitude lower power consumption, space and cost. The were developped for keeping LED’s cool in industrial lighting applications: fantastic technology.

1 Like

Interesting time line. Technology has matured over the time. It is not just about the Hardware, but softwares as well.

The major changes as we see are the ability of Hardware to process the things faster and are getting compact in size. Compactness is such that once the servers use to be of a size of a room and now you have processors on your watch as well. Same is with software where once they use to process text and that too after writing a fair amount of code, and now it is about less code and more processing. Technology has truly evolved to an extent that there is no looking back.

Moving forward, we have IoT, AR, Machine Learning, Data Science and other things that will make technology even more powerful.

In related news, just built up 6 more of these with the i7-7700k instead of the i7-6700k.

Note higher base and turbo clock speeds, but same tdp of 91w.

Unfortunately, two out the three new servers that I tested indeed hit 80c temp warnings under an old (and AVX2-free) version of mprime. This is odd because I never saw that with the i7-6700k in about ~20 builds!

I even tried fancier thermal paste and adjusting the shroud (shroud was pre-installed more or less correctly, but I cut one edge off to ensure it fit really tight) to no effect – same high temp warnings after 10 minutes of mprime.

The mobo has a new BIOS to accommodate kaby lake versus the older skylake, so it’s possible temp setting defaults changed. But more likely we are just hitting the thermal limits (again…) of what can be put in a 1U case safely.