Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"We suspect that GMP's extremely tight loops around MULX make the Zen 5 cores use much more power than specified, making cooling solutions inadequate."

I feel like if this was heat related, the overall CPU temperature should still somewhat slowly creep up, thereby giving everything enough time for thermal throttling. But their discoloration sure looks like a thermal issue, so I wonder why the safety features of the CPU didn't catch this...



I'm guessing the temperature could increase quite fast (milliseconds or less) in heavy duty areas, especially when going scalar-to-dense-vector operations.

My best understanding of the avx-512 'power license' debacle on Intel CPUs was that the processor was actually watching the instruction stream and computing heuristics to lower core frequency before reaching avx512 or dense-avx2 instructions. I guessed they knew or worried that even a short large-vector stint would fry stuff...

Apparently voltage and thermal sensor have vastly improved and looking at the crazy swings on NVIDIA GPU's clocks seem to agree with this :-)


Are we talking "slowly" in a relative sense? A silicon die of this size has a thermal mass (guessing) around 10⁻³ J/K but a power dissipation rate over 200W, so it can rise from room temperature to junction temperature limits almost instantly.


People without a background in electronics don't appreciate what modern CPUs and GPUs are doing: the amount of current flowing through these devices is just mind blowing. With adequate cooling, a Ryzen 9 9950X is handling somewhere in the neighborhood of 150-200 amps under high load.


I initially scoffed at the 150-200 amps. But I know core voltage is usually in the neighbourhood of 1V so to draw 200W, you really would have to basically be moving 200A of current. That's wild.


Yup. P=IV is really surprising when you get to high power parts at low core voltages. Needless to say, you need lots of transistors and phases on voltage conversion, and you need lots and lots of plane area.

(And,... 200A is the average when dissipating 200W. So how high are the switching currents? ;)


AMD's desktop CPUs are still running at a bit more than 1V; 1.3-1.4V is what you'll see at the high end of the clock speed range. But power draw can easily be in the 250–300W range if you turn on the "PBO" automatic overclocking mode, so 200A is not really the upper bound.


And you're pushing that many amps across a piece of silicon roughly the size of your thumbnail all said and done.


A spot welder, basically.


What's really wild is with all the power scaling features the regulators have to step from zero to hundreds of amps in microseconds with very little overshoot. The power design for these modern systems is demanding.


They said it took months for each CPU to fail. Both systems used the same inadequate heatsink/fan. Then there's also the lower-end motherboards (they are not "top-quality", the brand means nothing) and the miniscule 450W power supply used in the initial configuration, which are confusingly paired with a 16-core CPU and 64/96GB of RAM.

It doesn't strike me as odd that running an extremely power-heavy load for months continuously on such configurations eventually failed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: