An Inferno on the Head of a Pin

codinghorror · January 17, 2017, 11:37am

Today's computers place more and more heat-generating transistors in an ever shrinking space. Your CPU power watt budget might go from:

This is a companion discussion topic for the original entry at https://blog.codinghorror.com/an-inferno-on-the-head-of-a-pin/

Juzam · January 17, 2017, 12:09pm

That’s three orders of magnitude.

10e-1,0,1,2,3 isn’t that 5 orders of magnitude (or a difference of 4 orders of magnitude)?

thw0rted · January 17, 2017, 1:00pm

I can’t even begin to fathom how you’d lay out a 1U system with 4x300W GPUs. Consumer desktop cards have gone from 1 to 2 to now usually 2.5 slots wide, each, and if there’s a clean way to fit even a 2-slot card into a “normal” 1U case I haven’t seen it. Where on earth do you put 4 of them?!?

jloisel · January 17, 2017, 1:23pm

You have yet another option: undervolt the CPU. Altough this requires to have a motherboard which allows to adjust vcore, as you build 1U servers, most ATX motherboards will fit. You also stated in a previous blog post you didn’t use ECC memory so it shouldn’t be a problem to use a standard ATX motherboard.

Undervolting can greatly reduce CPU power consumption, just give it a try at home. I have a Xeon X5650 (6-cores) running at stock frequency with 0.987vcore, instead of 1.22vcore stock. Considering vcore is squared when calculating thermal dissipation, it’s a huge factor.

jms_embedded · January 17, 2017, 2:35pm

Jeff: Welcome to the world of thermal management! I’m surprised that there aren’t standard kits for doing what you’re doing; keeping ICs cool is standard practice and even the cheap Chromebooks now have the technology for pulling heat away from the processor:

jms_embedded · January 17, 2017, 2:37pm

You’ve got to be really careful with “undervolting” or “overclocking”; they’re outside the standard specs for the processors and are the hardware equivalent to relying on undefined behavior of a compiler: technically anything can happen aside from what the processor manufacturer specifies as normal. With the undervoltage situation you are increasing the probability of bit errors.

maxigs · January 17, 2017, 2:38pm

Nice read, reminded me of the good old times, back when i was active in the overclocking and extreme-cooling communities.
There was really a lot going on in how to squeeze the last couple of extra mhz out of processors, and trying to get them cooler and cooler or the cooling to be more effective.

There is a lot ways to get better cooling than with “standard” coolers. No idea what is on the market these days, as i’ve been out too long. Some folks took apart freezers to use the cooling system on their CPUs.
Mainstream solution was coolers with heat pipes or water-cooling, as the best ways to get the heat away from the source (CPU/GPU) and then rid of it where more space is available.

jloisel · January 17, 2017, 3:56pm

Sure, undervolting can make the system unstable, but it doesn’t if you have made some minimal tests. Running prime95 for 24h and lowering vcore gradually is the key. Most processors today can have -0.1v offset without any effect on stability. I’ve done it for years on workstations running 24h/7 without any issue.

timbray · January 17, 2017, 4:25pm

Here I am sitting at my desk deep in the bowels of AWS, thinking that the population of people who know this stuff and don’t work in the cloud biz is probably declining monotonically.

mshappe · January 17, 2017, 5:11pm

@timbray I think it’s one of the downsides of the cloud computing “movement”–the knowledge of this stuff is becoming increasingly concentrated. Another interesting side-effect: I know several talented system administration types who are now having trouble finding work unless they want to relocate to work for a cloud-hosting company!

@codinghorror Love this kind of analysis. It’s not something I have any kind of time for, myself, but it’s still fascinating to see. I also think your point about “not every workload benefits from more cores” can’t be stressed enough!

trajano · January 17, 2017, 5:18pm

Although it’s usually an easy solution to just throw hardware at a problem, sometimes proper software architectural designs would help reduce the need.

For example I would use an enterprise cache to offload a lot of work to smaller machines for readonly non-transactionally important data.

I would segregate document oriented data, large files vs tabular SQL data into different data stores (rather than dumping everything into one data store).

I clone / transform the data from real stores into archival / reporting stores that can get analyzed outside the system.

I do not store log files too long or at all on the servers instead pass it down the line to a log aggregator. Which in turn stores the data in another place that can be analyzed as well.

Analysis is done primarily on an enterprise search system such as elasticsearch which can scale up with large amounts of data.

Set up an enterprise service bus and focus more on asynchronous / non-blocking processing of events to allow it to scale better and have a single contact point for all the applications. If needed just buy this as a DataPower appliance so it’s less headache later on.

trajano · January 17, 2017, 5:26pm

If you come from large enterprises like finance or public sector where they have been pushed to keep operations in silos and made ridiculously slow due to concerns about security and general lack of drive to upgrade their skills once they have a job. That in turn causes slower reaction to business needs and new technologies that come in very quickly. Combine that with the fact that these large enterprises are NOT an IT shop hence do not really have the resources to understand and keep up with latest IT trends … offloading the work to cloud services will be more beneficial for them.

It’s not just IT though, think of it this way, how many companies would use FedEx or UPS to send their parcels vs having their own fleet of vehicles and delivery men.

The reality is companies are realizing that IT especially on the infrastructure side is becoming more of a utility than something that needs to be managed in-house. What needs to be managed in house is the DevOps portion and in the DevOps portion one would likely just dump the notion of physically administering the servers in favor of just managing virtual machines and smaller ones would be able to get away with just docker images reducing the need for specialized UNIX administrators and instead have more [eventually] dime-a-dozen docker configurators that are part of the DevOps teams.

jsharf · January 17, 2017, 6:15pm

I was thinking the same thing.

I have to imagine that they’re very different from desktop GPUs. Perhaps they’re custom-made GPUs with a better form factor for 1U servers.

mshappe · January 17, 2017, 6:46pm

Oh, I understand the reasons and even agree with most of them, but there are consequences, and one of them is the concentration we’ve both noticed. There are also, as you indirectly point out, a relatively few people who actually understand the magic of efficient transport logistics, because there are companies who specialise in it, and generally the people who work for those companies learn about it.

For those of us who need access to such services, this is a good thing–I know how to be a system administrator, but I’d just as soon focus on writing software and being able to just push that software out to some receptacle that’s ready to receive it and run it, and not worry about running that receptacle.

Jay_Koerner · January 17, 2017, 7:21pm

300w cards no, that would be near exclusive to 3-4u servers, but i have seen 1u servers with 4 single slot 150w(pcie alone allows 150w to be pulled through the slot) quadro/firepro cards, 2 in the back 2 in the front with a riserboard to connect them

Robert_Slaney · January 17, 2017, 10:32pm

Have you considered a water cooling solution instead of air. Then you could consolidate your radiators for the entire stack

Bense · January 18, 2017, 4:30am

I always wanted to do watercooling on a 1U.

I ran into a lot of this when I was playing with lots of BOINC projects. Inspired me to build this.

codinghorror · January 18, 2017, 12:03pm

I don’t think many (any?) SuperMicro server motherboard BIOSes implement the ability to undervolt?

If your server will be idle most of the time the low power states on Haswell and beyond are quite good, so undervolting won’t buy you much there. Under load, undervolting would help… but it would also require a lot of stability testing to verify that you are still stable under heavy loads with reduced voltage.

Which would be fine, if cloud prices and performance could get a bit closer to rack-it-yourself. There is more fuss, absolutely, but if you need super high performance, clouds are awful at this currently. (They are great at “spin me up a thousand of X”, if you can afford it, and if your stuff actually scales that way.)

In a 1U server? Has anyone ever done that, and what would it look like? I’m not aware of any water cooling in 1U servers.

codinghorror · January 18, 2017, 12:08pm

Aha @thw0rted I found a video of the Dell C4130 Nick referenced. Here’s the layout.

It’s SUUUUPER long (deep?), so it can indeed accommodate up to four full size GPUs and two CPUs!

Dell put up a bunch of videos about it, if you want to see more: https://www.youtube.com/results?search_query=dell+poweredge+c4130

ralphtice · January 18, 2017, 3:37pm

I can’t get past your premise that your workload is better suited to fewer cores of higher clock speed. You linked to discourse.org, which is a Rails application. The only piece of application architecture that would benefit from clock speed is Redis, but that shouldn’t be where your main bottleneck is anyway.

So what point are you trying to make regarding clock speed and your workload?