Everyone frightened by the title? Okay, it may be a bit exaggregated, but hear me out, this might be an interesting read for you. Have you ever heard about AM5 and DDR5 beeing unstable, expecially with more than one DIMM per Slot populated? Most likely yes, this is a quite widespread information, and basically everyone has accepted to use only one DIMM per channel, if you want to even get close to the advertiesed RAM-speed by the memory modules. But have you ever heared, EXPO profiles harming your LUKS-devices? Me neither, so here goes the story, that almost ruined my sunday…
Sunday evening, time for some gaming. Gaming on Linux has come very far lately, expecially thanks to valve and proton. For the first time in many years i build myself a gaming rig, but having abbandoned Windows entirely since ~15 years or so, and exclusively using Linux, I of course did this on my gaming-rig,too. I’m running ArchLinux btw… . The machine itself is quite capable by todays standards, running an AMD Ryzen 9 5900x, an AMD RX7900XTX and 4x16GB of RGB DDR5-6000 Ram by Corsair. All in a nice, custom-watercooled see-through-case, Displaying it’s Pixels on a 5120×1440 Ultrawide Monitor with Freesync. So far, nothing special; an expensive Computer, but not special whatsoever.
On my initial installation I enabled EXPO in my ASUS-X670E-Mainboard and ran memtest86+ multiple times, to find out, what memory configuration is stable, and does not produce errors or instability. As already noted, my memory Kit was advertised with 6000MT, but the highest stable Clock i was able to archieve, was 4600MT, and I even stepped down one tick more to 4400MT to be on the safe side… or so i thought.
This Memory-Speeds are quite a shame on it’s own, especially because 5200MT is by todays standards (in mid-2023) basically considered as the default Memory Clock, that should never ever make any Problems whatsoever. However, gaming is second to me after other memory-intensive workloads (3d-CAD-design, image editing, clustering/virtualization, software-compiling), and I knew the Problems beforehand running 2 DIMMs per channel, but capacity was much more important to me. And to be fair: If I am building a nice-looking Computer with RGB-all-the-way, and spend some money anyways, then I don’t want to look into empty memory slots on the board.
So back to the sunday afternoon i was talking about. So I decided to play a round of „ELEX 2“ (great game btw), and for some reason my Computer froze. This doesn’t happen very often to me, but whatever, AMDs Linux-Drivers for GPUs are making a considerable Progress, but aren’t neccessarily 100% stable all the time. So whatever, I rebooted, crashes do sometimes happen.
The PC rebooted, UEFI-Prompt…GRUB-Prompt…Plymouth-LUKS-Prompt…and wrong password…uhm, wait what? Typed again, and again, and again…
I don’t tend to panic in such situations, because it doesn’t help, and it was very unlikely, that the LUKS hader became corrupted, as nothing touches it in normal operation. Beside that I run my OS from the following setup: 2xNVMEs in RAID1(mdadm)->Luks->LVM, which is very fault-agnostic in the first place.
First try: press ESC (drop to text-based environent), enterpw: no change, still wrong password. Pressing CTRL+c gives you an exceptional interesting Error-Message:
ERROR: password succeeded but cryptroot creation failed, aborting
The only mention whatsoever that I have found of this particular error-message was in the sourcecode for the initramfs-hook for luks in ArchLinux, and it happens after at the end of a lenghty if/else case-switcher as a catch-all-condition in one of the substeps after providing the password. Not very helpful unfortunately, and more important: It does not really verify that luks thinks, that the PW is actual correct.
Before the crash I immediately started playing, so I did not perform an Upgrade in the meanwhile, so no new kernel, no new nothing. It worked 15 Minutes ago before the crash with the exact same setup.
So I was dropped to the emergency shell, and continued from there. Cryptsetup was not able to unlock my Volume. The underlying raid-structure has been auto-assembled in early boot, and was consitent. A dump of the header looked fine, checksums and secondary checksums looked fine. So no luck with the emergency console. Continuation with a live-medium then. I haven’t had wiped the arch-installer-usb stick yet, so i conveniently used this one. maybe the luks-userland-toolkit has some error, but on the Live-ISO was the exact same environment with which I assembled the disk initially. But…no luck either.
At this point 2 hours of edge-google-foo comes into place when I discoveres a single Statement, that made me have my „this explains it all“-Moment:
It could be bad ram. cryptsetup with argon2 is sensitive to ram issues
And believe it or not: This was indeed the Problem! I think everyone had a situation where one is absolutely sure, to have the password beeing entered correctly, but it didn’t work, but on the second thy it did. Did you ever expect, that you might even be correct with this, and you HAVE typed it correctly, and your memory just decided to bitflip? I certainly did not. So if in doubt, put you pw in a file or variable, and hammer
luksOpen --test-passphrase with it. If the results are not the same in 100% of the cases, you might want to de-tune your RAM-Settings… If the password-prompt is very sensitive to RAM-Errors, I wouldn’t bet, that they are totally unable to corrupt your filesystem in the long run…
PS: some sample-snippets, and also the relieving hint both came from here: https://unix.stackexchange.com/questions/746513/cryptsetup-verification-in-luksopen-is-non-deterministic-when-reading-the-passw
safe computing everyone.