Xfs apparently wasn't the problem. The remaining suspects include the kernel software raid driver, lvm2, and the kernel's softlockup detection/reboot code itself. If the kernel incorrectly detects a softlockup and tries to reboot and fails (seems unlikely), that could also explain the symptoms.
Second hand reports from the colo staff were that, a week ago Sunday, sysrq+t indicated the system was stuck in software raid code. However, neither Friday nor today have I been able to get sysrq hotkeys to work after the crashes. Just a message about a detected softlockup, and then one saying it's rebooting in 6 seconds... which it obviously didn't.
For now I've disabled the kernel panic and reboot after a softlockup. Since it wasn't successfully rebooting anyway, there's no sense in having that turned on.
The attachment upload issues and usercp problems were due to bad temporary directory permissions after switching those temp directories from xfs to ext3; those permissions are both fixed.
Originally Posted by peetzakilla
FiringLine needs to move to Mac!
FreeBSD is a possibility. It's a trade off between the prospect of possible future downtime due to crashes and forced immediate downtime to switch OSes.