Intel's CPUs Are Failing, ft. Wendell of Level1 Techs

recursive_recursion [they/them]@programming.dev · 4 months ago

Intel's CPUs Are Failing, ft. Wendell of Level1 Techs

schizo@forum.uncomfortable.business · 4 months ago

The last generation has been a total mess for both Intel and AMD.

AMD had motherboards frying CPUs, crazy stupid post issues due to DDR5 memory training (and my personal build fails to post like 25% of the time due to this exact same stupid shit), and just generally less than a totally reliable experience compared to previous gens.

Intel has much the same set of problems on their 13/14th gen stuff: dead chips, memory training issues, instability.

Wonder if it’s just a fluke that both x86 vendors are having a shitty generation at the same time, or if something else is at play.

iopq@lemmy.world · 4 months ago

Because they are pushing their chips even harder. AMD literally pegs them at the maximum temperature these days. It’s basically factory overclocking for both companies. Of course it’s going to run into issues, voltage + temperature fries chips

bamboo@lemm.ee · 4 months ago

Factory overclocking is a marketing term. Overclocking means running a processor above its specified speed, but if it intentionally ships that way from the factory it is by definition operating within specification.

Melonpoly@lemmy.world · 4 months ago

Fair point though factory overclocking has been a thing for years with base and boost speeds on Intel and and cpus. I guess they’re just pushing them a little too much.

iopq@lemmy.world · 4 months ago

Sure, but the spec is not in line with what the silicon can take, leading to degradation and stability issues

Buffalox@lemmy.world · 4 months ago

Memory training:
https://www.crucial.com/support/articles-faq-memory/ddr5-memory-training

leave the system powered to complete this process, which in some instances has been seen to take up to 15 minutes.

Have you tried this?

If you don’t let it finish, the system will continue to POST with unstable values.

Victor@lemmy.world · 4 months ago

During this process the system firmware is configuring itself for the newly installed memory. LEDs on the motherboard or computer may or may not be active during this process. On-screen symptoms of this may be a black screen or the system pausing on a manufacturer splash screen.

If this is happening, just leave the system powered to complete this process, which in some instances has been seen to take up to 15 minutes. If this is successful the system will either begin operating normally after the elapsed time, or may require a reboot but will work normally once this is done.

The UX for this seems to be absolute shit. The system seems to hang, and give no indication of something going on? And in the end, the system may need a reboot to complete the process? It better give some indication when it’s complete then, or else.

Buffalox@lemmy.world · 4 months ago

The UX for this seems to be absolute shit.

Absolutely, this is decidedly user hostile design.

FrozenHandle@lemmy.frozeninferno.xyz · 4 months ago

It’s just the easiest way to do this. Memory training is a very early step in the boot process. Firmware only has the CPU cache available as memory and most hardware in the system isn’t initialized yet. Most of this isn’t even done by the UEFI firmware itself, but by calling a binary blob provided by the CPU manufacturer, for intel it is called FSP and AMD i believe it is AGESA. I’d have to check, but I believe at the point memory training is running the PCIe bus has not even been brought up and scanned, so video output in this phase would require extensive reengineering of the early boot process from both the CPU manufacturer, firmware vendors and the board manufacturer. PCIe has DMA so making that work without memory might be a challenge. There are three easy to implement solutions though: post codes if your mainboard has a display for them, serial output if the board has a serial port (though this needs another device to read the messages) and the cheapest solution could be a flashing LED on the board labeled memory training in progress.

Buffalox@lemmy.world · 4 months ago

Flashing LED would be great IMO. And a HUGE improvement.

Victor@lemmy.world · 4 months ago

Not to mention hearing about it through word of mouth… Just 🤦‍♂️

Tramort@programming.dev · 4 months ago

Holy crap. Never heard of this. Thanks!

yggstyle@lemmy.world · 4 months ago

My biggest complaint is that there should be a visual indication of this process. Many users are utterly unaware it is going on.

EleventhHour@lemmy.world · 4 months ago

Maybe x86/x64 has reached the end of its development lifecycle, and both companies are at the point where they simply can’t squeeze any more out of it, so every trick they try results in these abnormalities?

I dunno.

Incogni@lemmy.world · edit-2 4 months ago

In regards to the memory training: have you double-checked how much Ram your CPU actually supports, at what frequencies? For example even the 7950X3D supports only DDR5-3600 when you put more than 2 bars of ram in, leading to issues with memory training taking long/not posting/instability if you enable any form of overclocking in that scenario. I had that problem before and switching from 4 bars to two fixed everything. Just in case this might be your issue as well.

schizo@forum.uncomfortable.business · 4 months ago

It’s pair of 16gb 6000mt/s sticks that i just run at stock 4800mt, primarily because the BIOS fails to post every 3rd or so time, shits itself, and resets to defaults. I’ve quit fucking with it because, frankly, it’s fast enough and going into the bios requires a 2nd reboot and memory retrain, which will fail 50% of the time, and lead to the bios resetting itself, which leads to needing to reconfigure it which…

When the system is up, it’s perfectly stable, and stays fine through sleep states and whatever else until I have to reboot it for whatever reason (updates, mostly).

But honestly, if the memory controller can’t handle dual-channel 4800mt/s ram, then it’s really really fucked, because that’s the bare minimum in terms of support.

I’d also add I have 3 mobile AMD based devices with DDR5, none of which exhibit ANY of this nonsense. Makes me think their desktop platform may well be legitimately defective, given how many people have this issue, and how it doesn’t seem to be universal across even their own product stacks.

(And, yes, two of the mobile devices have removable ram, so it’s not some soldered vs dimm thing)

olympicyes@lemmy.world · 4 months ago

Look at the heat sinks on gen 5 SSDs. To me the marginal speed benefit of the platform introduces a lot of problems you have to deal with like heat. I would’ve preferred they just focus on bandwidth to allow users to get the same performance as Gen 4 with half the PCI lanes.

MightyCuriosity@sh.itjust.works · 4 months ago

To be fair I joined a couple of months after the release but I have 0 of these issues. Maybe time to update your BIOS?

bruhduh@lemmy.world · 4 months ago

Intel problem is that they keep pushing extensions race while AMD proved with their Ryzen series that if you keep your instruction set to a minimum, then your CPUs will be energy efficient, even arm proved this by pushing extensions too far like intel and getting overheating chips

sugar_in_your_tea@sh.itjust.works · 4 months ago

The overhead of additional instructions isn’t the issue, they often translate those instructions into a smaller set of actual operations. It’s not like they have a special circuit for every instruction, a lot of instructions translate to a pipeline of multiple, modular circuits.

The actual silicon will look more like ARM despite having a very large difference in instruction set sizes.

bruhduh@lemmy.world · 4 months ago

Then why AMD is more efficient then intel and arm nowadays?

sugar_in_your_tea@sh.itjust.works · 4 months ago

That depends on what you mean, but here are a few reasonable explanations:

Intel’s chips are still on their Intel 7 process (similar to TSMC’s 7nm process), whereas AMD is using TSMC’s 4nm process, so AMD’s CPUs are 2 nodes ahead; smaller process generally means more transistors in the same area, as well as lower power usage per clock
AMD’s chiplet architecture makes it easier for them to move the CPU bits to a smaller arch, and the IO bits can stay on a cheaper arch (e.g. AMD uses 4nm for the cores, 6nm for the IO die); this increases yields and dramatically reduces costs, so AMD can invest more in architectural improvements
ARM prioritizes battery life over performance, so performance per watt won’t be great at the high end, but it’ll probably win at the low end; they also don’t make their own chips (just designs), so comparing process nodes is meaningless
AMD focuses on different aspects of computing than either Intel or ARM, so perhaps they’ve just done a better job optimizing for what you care about

Anyway, that’s my take.

TheGrandNagus@lemmy.world · edit-2 4 months ago

And for AMD’s 3D v-cache chips, there’s an enormous energy benefit, as taking stuff from the (much larger than usual) cache is far more energy efficient than constantly going back and forwards to RAM.

bruhduh@lemmy.world · 4 months ago

Thank you for detailed explanation

sauce@lemmy.dbzer0.com · 4 months ago

Correction, meteor lake’s (Intel 14th gen) CPU tile is on the Intel 4 process (though admittedly that’s a 7nm euv process). And they’ve also moved to a chiplet design. (CPU, GPU and IO are on 3 different processes)

Dyf_Tfh@lemmy.sdf.org · 4 months ago

This isn’t true anymore, Intel dropped AVX512 since they moved to Big+Small cores design while AMD actually implemented it with Zen 4.

Holzkohlen@feddit.de · 4 months ago

Does AMD just keep winning or what? I just don’t want CPU’s with a 500 Watt tdp.

Nighed@sffa.community · 4 months ago

Both intel and AMD are running the same instruction set though are they not? (Cross licensing x86/x64)

bruhduh@lemmy.world · edit-2 4 months ago

Look up gcc x86 options https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html , intel have twice as big instruction set, and with expansion of instructions of arm, risc architecture ain’t saving it from overheating now because of aforementioned, now bloated instruction set