We recently talked about booting Linux from Really Big hard drives using GPT and a special boot partition. We thought we’d step back a bit and talk about why this is necessary, and how EFI bootloading differs from the classic BIOS boot.

The case for GPT

As we mentioned previously, you need to use GPT on disks larger than 2TB in size. The limit arises from the historical use of CHS addressing in the BIOS.

In short, GPT lets you can create partitions larger than 2TB. MBR partitioning can’t do that. If you’re using a hardware RAID card you can carve out multiple less-than-2TB virtual disks and stick to MBR partitioning, and then glue them back together with LVM.

It’s not ideal, but it works well. However, you’re out of luck if you don’t have a hardware RAID card to abstract away the details of your 3TB drives.

How GPT affects booting

MBR partition tables have a fixed format that defines partitions using the CHS addressing scheme, and later using LBA. Because the MBR is a fixed size, it can’t define partitions on a disk larger than 2TB – the numbers are too big to fit!

The GUID Partition Table (GPT) format was created to get around those limits. It can handle disks up to 8 zettabytes in size – roughly a million million gigabytes.

Okay, you say, we’ll use GPT and everything is fine and dandy. You’d be right, but trouble arises when we want to actually boot the system. GPT was intended for use with EFI, a smart replacement for the classic BIOS that can read GPT. The usual bootloader embedded in the MBR isn’t smart enough to do that.

Some documentation out there suggests that you need EFI hardware to boot GPT disks. It’s simply not true, we just need some way to meet in the middle.

We need a smarter bootloader

The bootloader, GRUB in our case, is a bridge between the BIOS and the OS. The BIOS locates the bootloader and hands over control, which then locates the OS and begins loading it into memory.

We can satisfy the BIOS by putting a valid partition table and bootloader in the first sector of the drive. The partition table isn’t used, but defines a single partition spanning the entire disk to ward off non-GPT-aware utilities.

Now we define the real partition table, the GPT. It’s situated just after the first sector and contains the entries we care about, like a stonking great 30TB partition on our newest systems. We put the OS and all our data into GPT partitions, now GRUB just needs to be able to find it.

The GRUB bootloader code goes into the MBR as usual, but it needs to be able to find and read the GPT in order to boot the OS. This is new functionality in GRUB2 that wasn’t around in “Legacy GRUB”.

Difficulties in finding the OS

The MBR only has room for 446 bytes of executable code, which isn’t much. GRUB puts its “stage 1” code there, which is sufficient to load another chunk of code on the disk and start running it, but not enough to prepare the system and load the whole OS.

Stage 2 is GRUB proper, it has all the smarts necessary to prepare the system for a real OS (read up on the Multiboot Specification if you’re curious). Stage 1 can load stage 2 directly, but more often it goes via “stage 1.5”.

This is because stage 2 usually lives on a filesystem, but stage 1 only knows about numbered disk blocks. Stage 1.5 knows how to read filesystems, so it can locate stage 2 if it happens to move. As long as stage 1.5 stays put, GRUB will eventually locate stage 2 and be able to load the OS.

How to stash stage1.5 safely

On MBR systems, GRUB’s stage1.5 usually lives in the gap between the MBR and the first defined partition. This gap exists because the first partition has historically been defined to start after the first track, 63 sectors (31.5 KiB) in length. It’s something of a gentlemen’s agreement that this gap is never touched.

No such thing exists when using GPT. The GPT table itself is at least 16 KiB in size, which when added to the protective MBR (512B) and GPT header (512B) means the first partition starts at sector 34, immediately following the GPT table.

GPT solves this by defining a proper home for stuff like stage1.5 – it’s called the BIOS Boot Partition (BBP) and is explicitly designed for such usage. Arbitrary data can be dumped in the BBP and won’t be touched because it’s a properly defined partition.

Now stage1 can point to stage1.5, confident that it won’t be touched. On our systems we’ve opted to place the BBP at sector 2048, aligning it to a 1-MiB boundary, and also means the GPT table could be extended in future if need be.

In future we expect that proper EFI hardware will become commonplace and these problems will disappear. For now, this is the solution, and it’s pretty elegant and clean.