AMD and AM5 – Just don’t use EXPO, it might destroy your LUKS devices

Everyone frightened by the title? Okay, it may be a bit exaggregated, but hear me out, this might be an interesting read for you. Have you ever heard about AM5 and DDR5 beeing unstable, expecially with more than one DIMM per Slot populated? Most likely yes, this is a quite widespread information, and basically everyone has accepted to use only one DIMM per channel, if you want to even get close to the advertiesed RAM-speed by the memory modules. But have you ever heared, EXPO profiles harming your LUKS-devices? Me neither, so here goes the story, that almost ruined my sunday…

Sunday evening, time for some gaming. Gaming on Linux has come very far lately, expecially thanks to valve and proton. For the first time in many years i build myself a gaming rig, but having abbandoned Windows entirely since ~15 years or so, and exclusively using Linux, I of course did this on my gaming-rig,too. I’m running ArchLinux btw… . The machine itself is quite capable by todays standards, running an AMD Ryzen 9 5900x, an AMD RX7900XTX and 4x16GB of RGB DDR5-6000 Ram by Corsair. All in a nice, custom-watercooled see-through-case, Displaying it’s Pixels on a 5120×1440 Ultrawide Monitor with Freesync. So far, nothing special; an expensive Computer, but not special whatsoever.

On my initial installation I enabled EXPO in my ASUS-X670E-Mainboard and ran memtest86+ multiple times, to find out, what memory configuration is stable, and does not produce errors or instability. As already noted, my memory Kit was advertised with 6000MT, but the highest stable Clock i was able to archieve, was 4600MT, and I even stepped down one tick more to 4400MT to be on the safe side… or so i thought.

This Memory-Speeds are quite a shame on it’s own, especially because 5200MT is by todays standards (in mid-2023) basically considered as the default Memory Clock, that should never ever make any Problems whatsoever. However, gaming is second to me after other memory-intensive workloads (3d-CAD-design, image editing, clustering/virtualization, software-compiling), and I knew the Problems beforehand running 2 DIMMs per channel, but capacity was much more important to me. And to be fair: If I am building a nice-looking Computer with RGB-all-the-way, and spend some money anyways, then I don’t want to look into empty memory slots on the board.

So back to the sunday afternoon i was talking about. So I decided to play a round of „ELEX 2“ (great game btw), and for some reason my Computer froze. This doesn’t happen very often to me, but whatever, AMDs Linux-Drivers for GPUs are making a considerable Progress, but aren’t neccessarily 100% stable all the time. So whatever, I rebooted, unfortunately crashes do sometimes happen.

The PC rebooted, UEFI-Prompt…GRUB-Prompt…Plymouth-LUKS-Prompt…and wrong password…uhm, wait what? Typed again, and again, and again…

I don’t tend to panic in such situations, because it doesn’t help, and it was very unlikely, that the LUKS hader became corrupted, as nothing touches it in normal operation. Beside that I run my OS from the following setup: 2xNVMEs in RAID1(mdadm)->Luks->LVM, which is very fault-agnostic in the first place.

First try: press ESC (drop to text-based environent), enterpw: no change, still wrong password. Pressing CTRL+c gives you an exceptional interesting Error-Message:

The only mention whatsoever that I have found of this particular error-message was in the sourcecode for the initramfs-hook for luks in ArchLinux, and it happens after at the end of a lenghty if/else case-switcher as a catch-all-condition in one of the substeps after providing the password. Not very helpful unfortunately, and more important: It does not really verify that luks thinks, that the PW is actual correct.

Before the crash I was only started playing my video-game, so I did not perform an upgrade in the meanwhile, so no new kernel, no new nothing. It worked 15 Minutes ago before the crash with the exact same setup.

So I was dropped to the emergency shell, and continued from there. Cryptsetup was not able to unlock my Volume. The underlying raid-structure has been auto-assembled in early boot, and was consistent. A dump of the header looked fine, checksums and secondary checksums looked fine. So no luck with the emergency console. Continuation with a live-medium then. I haven’t had wiped the arch-installer-usb stick yet, so i conveniently used this one. maybe the luks-userland-toolkit has some error, but on the Live-ISO was the exact same environment with which I assembled the disks and partitions in the first place.  But…no luck either.

At this point 2 hours of esoteric google-foo comes into place when I discovered a single statement, that made me have my „this explains it all“-Moment:

It could be bad ram. cryptsetup with argon2 is sensitive to ram issues

And believe it or not: This was indeed the problem! I think everyone have had a situation somewhen where one is absolutely sure, to have the password beeing entered correctly, but it didn’t work, but on the second try it did. Did you ever expect, that you might even HAVE entered your password correct in the first place, and your memory just decided to bit-flip? Me neither. So if in doubt, put you password in a file or variable, and hammer luksOpen --test-passphrase with it. If the results are not the same in 100% of the cases, you might want to de-tune your RAM-Settings… If the password-prompt is very sensitive to RAM-Errors, I wouldn’t bet, that they are totally unable to corrupt your filesystem in the long run…

PS: some sample-snippets, and also the relieving hint both came from here: https://unix.stackexchange.com/questions/746513/cryptsetup-verification-in-luksopen-is-non-deterministic-when-reading-the-passw

safe computing everyone.

Edit 2024-06-14:

Meanwhile I dug some levels deeper onto this topic, and I explored this topic a litte bit further during debugging for my regular day-job and  wrote a script for easier debugging, which turned out to be quite useful, so I also share it here:

This script should run without problems on any Linux as long as dd, pwgen and luks are installed (limited to linux only, because luks is linux-exclusive, hence the „L“ in the name). In short, it creates a 1GB-File and formats it with luks. After that it tries to decrypt the volume with the known-working password. With argon2 this comparison maps huge chunks of RAM (in favor of not using alot of cpu-time instead), and due to the nature of cryptographic software, any bit-flip will result in an error where we are able to observe it this time.

The output will be similar to this:

You should be using reasonable high values like 100 or so. This method is way quicker than memtest and is able to find errors where memtest does not (even though mostly errors limited to too high DRAM frequencies, not defective ram in general). On a x570-Mainboard with DDR4 and also fully populated RAM (4 sticks @ 3200MT) i was able to detect an unstable configuration (with XMP enabled) throwing errors in >20% of the runs using the script above, so it’s usecase is not purely academic, but is really able to find instabilities that are not prominent enough make the system unbootable, but are severe enough to have some kind of a real-world-impact. I added this script to my ventoy-stick, and from now on will use it always on workstations where you tend to enable one-click-overclicking-profiles for memory to ensure it’s stability.

-zeus

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert