I've been playing a bit with BTRFS lately. The file system has checksums of both metadata and data so it can offer some guarantees about data integrity and I decided to take a closer look at it hoping I could see if it can keep any of the data that we people try keep on our unstable hardware.
Preparing test data
For data I'll be using a snapshot of raspbian lite root. I mounted the image and packed the root into a tar to reproduce test easily and also compute sha1 sums of all the files to be able to verify them:
# losetup -f 2017-03-02-raspbian-jessie-lite.img # losetup -a /dev/loop0: [2051]:203934014 (/home/user/rpi-image-build/2017-03-02-raspbian-jessie-lite.img) # kpartx -av /dev/loop0 add map loop0p1 (254:0): 0 129024 linear 7:0 8192 add map loop0p2 (254:1): 0 2584576 linear 7:0 137216 # mkdir rpi_root # mount /dev/mapper/loop0p2 rpi_root # (cd rpi_root && tar zcf ../root.tar.gz . && find . -type f -exec sha1sum {} \; > ../sha1sums) # umount rpi_root # kpartx -dv /dev/loop0 del devmap : loop0p2 del devmap : loop0p1 # losetup -d /dev/loop0
This leaves us with root.tar.gz and sha1sums files.
Test challenge
The challenge is to find and repair (if possible) file system corruption. I created a python script that finds a random non-empty 4K block on device and overwrites it with random data and repeats this 100 times. Here's the corrupt.py:
#!/usr/bin/python import os import random f = 'rpi-btrfs-test.img' fd = open(f, 'rb+') filesize = os.path.getsize(f) print('filesize=%d' % filesize) random_data = str(bytearray([random.randint(0, 255) for i in range(4096)])) count = 0 loopcount = 0 while count < 100: n = random.randint(0, filesize/4096) fd.seek(n*4096) block = fd.read(4096) if any((i != '\x00' for i in block)): print('corrupting block %d' % n) fd.seek(n*4096) fd.write(random_data) assert fd.tell() == (n+1)*4096 count += 1 loopcount += 1 if loopcount > 10000: break fd.close()
Baseline (ext4)
To imitate a SD card I'll use a 4GB raw disk image full of zeroes:
dd if=/dev/zero of=rpi-btrfs-test.img bs=1M count=4096
Create a ext4 fs:
# mkfs.ext4 -O metadata_csum,64bit rpi-btrfs-test.img mke2fs 1.43.4 (31-Jan-2017) Discarding device blocks: done Creating filesystem with 1048576 4k blocks and 262144 inodes Filesystem UUID: c8c7b1cc-aebc-44a3-bb6d-790fb3b70577 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736 Allocating group tables: done Writing inode tables: done Creating journal (16384 blocks): done Writing superblocks and filesystem accounting information: done
Now we create a mount point, mount the file system and put some data on it:
mkdir mnt mount -o loop rpi-btrfs-test.img mnt (cd mnt && tar zxf ../root.tar.gz)
Lets look at the file system:
# df -h /dev/loop0 3.9G 757M 3.0G 21% /home/user/rpi-image-build/mnt
Let the corruption begin:
umount mnt python corrupt.py
Lets check if the file system has errors:
# fsck.ext4 -y -v -f rpi-btrfs-test.img e2fsck 1.43.4 (31-Jan-2017) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Directory inode 673, block #0, offset 0: directory has no checksum. Fix? yes ... lots of prinouts of error and repair info Pass 5: Checking group summary information rpi-btrfs-test.img: ***** FILE SYSTEM WAS MODIFIED ***** 35610 inodes used (13.58%, out of 262144) 65 non-contiguous files (0.2%) 7 non-contiguous directories (0.0%) # of inodes with ind/dind/tind blocks: 0/0/0 Extent depth histogram: 30289/1 226530 blocks used (21.60%, out of 1048576) 0 bad blocks 1 large file 27104 regular files 3038 directories 54 character device files 25 block device files 0 fifos 346 links 5380 symbolic links (5233 fast symbolic links) 0 sockets ------------- 35946 files
OK, this filesystem was definitely borked! We should mount it:
mount -o loop rpi-btrfs-test.img mnt
What does sha1sum think:
# (cd mnt && sha1sum -c ../sha1sums | grep -v 'OK' ) ... lots of files with FAILED ./usr/lib/arm-linux-gnueabihf/libicudata.so.52.1: FAILED sha1sum: WARNING: 1 listed file could not be read sha1sum: WARNING: 90 computed checksums did NOT match
Whoops, sha1sum as a userspace program was actually allowed to read invalid data. Uncool.
Btrfs Single test
Lets create a btrfs file system on it:
# mkfs.btrfs --mixed rpi-btrfs-test.img btrfs-progs v4.7.3 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: Node size: 4096 Sector size: 4096 Filesystem size: 4.00GiB Block group profiles: Data+Metadata: single 8.00MiB System: single 4.00MiB SSD detected: no Incompat features: mixed-bg, extref, skinny-metadata Number of devices: 1 Devices: ID SIZE PATH 1 4.00GiB rpi-btrfs-test.img
The --mixed switch is needed because the file system is smaller than 16GB and BTRFS is known to have some problems with small devices if it tries to have separate block groups (chunks) for data and metadata, which by default it does. This flag sets up the filesystem so it intermingles data and metadata on same chunks. Supposedly this helps avoid the notorious out of space errors.
Now we mount the file system and put some data on it:
mount -o loop rpi-btrfs-test.img mnt (cd mnt && tar zxf ../root.tar.gz)
Lets look at the file system:
# df -h /dev/loop0 4.0G 737M 3.3G 18% /home/user/rpi-image-build/mnt
Let the corruption begin:
umount mnt python corrupt.py
Lets check the file system:
# mount -o loop rpi-btrfs-test.img mnt # btrfs scrub start mnt/ scrub started on mnt/, fsid a6301050-0d88-4d86-948d-03380b9434d4 (pid=27478) # btrfs scrub status mnt/ scrub status for a6301050-0d88-4d86-948d-03380b9434d4 scrub started at Sun Apr 30 10:17:12 2017 and was aborted after 00:00:00 total bytes scrubbed: 272.49MiB with 37 errors error details: csum=37 corrected errors: 0, uncorrectable errors: 37, unverified errors: 0
Whoops, there goes data: 37 uncorrectable errors.
What does sha1sum say:
# (cd mnt && sha1sum -c ../sha1sums | grep -v 'OK' ) sha1sum: ./usr/share/perl5/Algorithm/Diff.pm: Input/output error ./usr/share/perl5/Algorithm/Diff.pm: FAILED open or read ... lots more of terminal output similar to above sha1sum: WARNING: 99 listed files could not be read
Data was lost, but at least the OS decided it will not allow garbage to propagate.
Btrfs Dup test
BTRFS has a way to tell it to make redundant copies of data by using DUP profile. DUP makes a duplicate copy on same device in case the first copy goes bad.
Defining the profile is most easily done at the time of creating the file system:
# mkfs.btrfs --data dup --metadata dup --mixed rpi-btrfs-test.img btrfs-progs v4.7.3 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: Node size: 4096 Sector size: 4096 Filesystem size: 4.00GiB Block group profiles: Data+Metadata: DUP 204.75MiB System: DUP 8.00MiB SSD detected: no Incompat features: mixed-bg, extref, skinny-metadata Number of devices: 1 Devices: ID SIZE PATH 1 4.00GiB rpi-btrfs-test.img
The DUP denotes that file system will store two copies of data on same device.
Again we put some data on it:
mount -o loop rpi-btrfs-test.img mnt (cd mnt && tar zxf ../root.tar.gz)
Lets look at the file system:
# df -h | grep mnt /dev/loop0 2.0G 737M 1.3G 36% /home/user/rpi-image-build/mnt
The file system appears only 2GB of size, because there are two copies to be stored on underlying device.
Again we corrupt the file system:
umount mnt python corrupt.py
Lets check the file system:
# mount -o loop rpi-btrfs-test.img mnt # btrfs scrub start mnt/ scrub started on mnt/, fsid 12ffd7a2-b9c4-46ca-b16a-cd07d94d6854 (pid=27863) # btrfs scrub status mnt/ scrub status for 12ffd7a2-b9c4-46ca-b16a-cd07d94d6854 scrub started at Sun Apr 30 10:27:31 2017 and finished after 00:00:00 total bytes scrubbed: 1.41GiB with 99 errors error details: csum=99 corrected errors: 99, uncorrectable errors: 0, unverified errors: 0
Hey, corrected errors, looks like this could work ... checking the sha1sums:
# (cd mnt && sha1sum -c ../sha1sums | grep -v 'OK' )
Looks fine. Need to check how this works on real hardware.