Single device data redundancy with BTRFS

30. Apr. 2017 ob 11:09 pod kategorijami: btrfs filesystems linux

I've been playing a bit with BTRFS lately. The file system has checksums of both metadata and data so it can offer some guarantees about data integrity and I decided to take a closer look at it hoping I could see if it can keep any of the data that we people try keep on our unstable hardware.

Preparing test data

For data I'll be using a snapshot of raspbian lite root. I mounted the image and packed the root into a tar to reproduce test easily and also compute sha1 sums of all the files to be able to verify them:

# losetup -f 2017-03-02-raspbian-jessie-lite.img
# losetup -a
/dev/loop0: [2051]:203934014 (/home/user/rpi-image-build/2017-03-02-raspbian-jessie-lite.img)
# kpartx -av /dev/loop0
add map loop0p1 (254:0): 0 129024 linear 7:0 8192
add map loop0p2 (254:1): 0 2584576 linear 7:0 137216
# mkdir rpi_root
# mount /dev/mapper/loop0p2 rpi_root
# (cd rpi_root && tar zcf ../root.tar.gz . && find . -type f -exec sha1sum {} \; > ../sha1sums)
# umount rpi_root
# kpartx -dv /dev/loop0
del devmap : loop0p2
del devmap : loop0p1
# losetup -d /dev/loop0

This leaves us with root.tar.gz and sha1sums files.

Test challenge

The challenge is to find and repair (if possible) file system corruption. I created a python script that finds a random non-empty 4K block on device and overwrites it with random data and repeats this 100 times. Here's the corrupt.py:

#!/usr/bin/python
import os
import random

f = 'rpi-btrfs-test.img'
fd = open(f, 'rb+')
filesize = os.path.getsize(f)

print('filesize=%d' % filesize)
random_data = str(bytearray([random.randint(0, 255) for i in range(4096)]))

count = 0
loopcount = 0
while count < 100:
    n = random.randint(0, filesize/4096)
    fd.seek(n*4096)
    block = fd.read(4096)

    if any((i != '\x00' for i in block)):
        print('corrupting block %d' % n)

        fd.seek(n*4096)
        fd.write(random_data)

        assert fd.tell() == (n+1)*4096
        count += 1

    loopcount += 1
    if loopcount > 10000:
        break
fd.close()

Baseline (ext4)

To imitate a SD card I'll use a 4GB raw disk image full of zeroes:

dd if=/dev/zero of=rpi-btrfs-test.img bs=1M count=4096

Create a ext4 fs:

# mkfs.ext4 -O metadata_csum,64bit rpi-btrfs-test.img
mke2fs 1.43.4 (31-Jan-2017)
Discarding device blocks: done
Creating filesystem with 1048576 4k blocks and 262144 inodes
Filesystem UUID: c8c7b1cc-aebc-44a3-bb6d-790fb3b70577
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

Now we create a mount point, mount the file system and put some data on it:

mkdir mnt
mount -o loop rpi-btrfs-test.img mnt
(cd mnt && tar zxf ../root.tar.gz)

Lets look at the file system:

# df -h
/dev/loop0      3.9G  757M  3.0G  21% /home/user/rpi-image-build/mnt

Let the corruption begin:

umount mnt
python corrupt.py

Lets check if the file system has errors:

 # fsck.ext4 -y -v -f rpi-btrfs-test.img
 e2fsck 1.43.4 (31-Jan-2017)
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Directory inode 673, block #0, offset 0: directory has no checksum.
 Fix? yes

 ... lots of prinouts of error and repair info

 Pass 5: Checking group summary information

 rpi-btrfs-test.img: ***** FILE SYSTEM WAS MODIFIED *****

        35610 inodes used (13.58%, out of 262144)
           65 non-contiguous files (0.2%)
            7 non-contiguous directories (0.0%)
              # of inodes with ind/dind/tind blocks: 0/0/0
              Extent depth histogram: 30289/1
       226530 blocks used (21.60%, out of 1048576)
            0 bad blocks
            1 large file

        27104 regular files
         3038 directories
           54 character device files
           25 block device files
            0 fifos
          346 links
         5380 symbolic links (5233 fast symbolic links)
            0 sockets
-------------
        35946 files

OK, this filesystem was definitely borked! We should mount it:

mount -o loop rpi-btrfs-test.img mnt

What does sha1sum think:

# (cd mnt && sha1sum -c ../sha1sums | grep -v 'OK' )

... lots of files with FAILED

./usr/lib/arm-linux-gnueabihf/libicudata.so.52.1: FAILED
sha1sum: WARNING: 1 listed file could not be read
sha1sum: WARNING: 90 computed checksums did NOT match

Whoops, sha1sum as a userspace program was actually allowed to read invalid data. Uncool.

Btrfs Single test

Lets create a btrfs file system on it:

# mkfs.btrfs --mixed rpi-btrfs-test.img
btrfs-progs v4.7.3
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:
Node size:          4096
Sector size:        4096
Filesystem size:    4.00GiB
Block group profiles:
  Data+Metadata:    single            8.00MiB
  System:           single            4.00MiB
SSD detected:       no
Incompat features:  mixed-bg, extref, skinny-metadata
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1     4.00GiB  rpi-btrfs-test.img

The --mixed switch is needed because the file system is smaller than 16GB and BTRFS is known to have some problems with small devices if it tries to have separate block groups (chunks) for data and metadata, which by default it does. This flag sets up the filesystem so it intermingles data and metadata on same chunks. Supposedly this helps avoid the notorious out of space errors.

Now we mount the file system and put some data on it:

mount -o loop rpi-btrfs-test.img mnt
(cd mnt && tar zxf ../root.tar.gz)

Lets look at the file system:

# df -h
/dev/loop0      4.0G  737M  3.3G  18% /home/user/rpi-image-build/mnt

Let the corruption begin:

umount mnt
python corrupt.py

Lets check the file system:

# mount -o loop rpi-btrfs-test.img mnt
# btrfs scrub start mnt/
scrub started on mnt/, fsid a6301050-0d88-4d86-948d-03380b9434d4 (pid=27478)
# btrfs scrub status mnt/
scrub status for a6301050-0d88-4d86-948d-03380b9434d4
    scrub started at Sun Apr 30 10:17:12 2017 and was aborted after 00:00:00
    total bytes scrubbed: 272.49MiB with 37 errors
    error details: csum=37
    corrected errors: 0, uncorrectable errors: 37, unverified errors: 0

Whoops, there goes data: 37 uncorrectable errors.

What does sha1sum say:

# (cd mnt && sha1sum -c ../sha1sums | grep -v 'OK' )
sha1sum: ./usr/share/perl5/Algorithm/Diff.pm: Input/output error
./usr/share/perl5/Algorithm/Diff.pm: FAILED open or read

... lots more of terminal output similar to above

sha1sum: WARNING: 99 listed files could not be read

Data was lost, but at least the OS decided it will not allow garbage to propagate.

Btrfs Dup test

BTRFS has a way to tell it to make redundant copies of data by using DUP profile. DUP makes a duplicate copy on same device in case the first copy goes bad.

Defining the profile is most easily done at the time of creating the file system:

# mkfs.btrfs --data dup --metadata dup --mixed rpi-btrfs-test.img
btrfs-progs v4.7.3
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:
Node size:          4096
Sector size:        4096
Filesystem size:    4.00GiB
Block group profiles:
  Data+Metadata:    DUP             204.75MiB
  System:           DUP               8.00MiB
SSD detected:       no
Incompat features:  mixed-bg, extref, skinny-metadata
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1     4.00GiB  rpi-btrfs-test.img

The DUP denotes that file system will store two copies of data on same device.

Again we put some data on it:

mount -o loop rpi-btrfs-test.img mnt
(cd mnt && tar zxf ../root.tar.gz)

Lets look at the file system:

# df -h | grep mnt
/dev/loop0      2.0G  737M  1.3G  36% /home/user/rpi-image-build/mnt

The file system appears only 2GB of size, because there are two copies to be stored on underlying device.

Again we corrupt the file system:

umount mnt
python corrupt.py

Lets check the file system:

# mount -o loop rpi-btrfs-test.img mnt
# btrfs scrub start mnt/
scrub started on mnt/, fsid 12ffd7a2-b9c4-46ca-b16a-cd07d94d6854 (pid=27863)
# btrfs scrub status mnt/
scrub status for 12ffd7a2-b9c4-46ca-b16a-cd07d94d6854
    scrub started at Sun Apr 30 10:27:31 2017 and finished after 00:00:00
    total bytes scrubbed: 1.41GiB with 99 errors
    error details: csum=99
    corrected errors: 99, uncorrectable errors: 0, unverified errors: 0

Hey, corrected errors, looks like this could work ... checking the sha1sums:

# (cd mnt && sha1sum -c ../sha1sums | grep -v 'OK' )

Looks fine. Need to check how this works on real hardware.