[Dirvish] reiserfs or not?

foner-dirvish at media.mit.edu foner-dirvish at media.mit.edu
Sat Jan 14 23:12:12 UTC 2006


    Date: Sat, 14 Jan 2006 20:33:02 +0000
    From: Arjen Meek <arjen at xyx.nl>

    Can anyone give me any useful input as to whether ReiserFS is better for
    a Dirvish backup filesystem than, say, ext2/3 with lots of inodes and
    small blocks? The system in question was built for Dirvish so the
    filesystem will be used only for backups; in this case of a set of
    remote filesystems amounting to ~100 GB.

My advice is to stay as far away from either version (3 or 4) of
ReiserFS as possible.  For more information, see [1] and, for even
more technical details, [2] (which is referenced after the first few
paragraphs of [1], just before all the forwarded traffic).  Plenty of
people value it for its speed, but I find its speed advantages to be
neglible for a -real- workload that isn't a Usenet news server, and
its fragility (especially in the face of kernel panics or machine
hardware resets) to be far more important---in my opinion, the -first-
priority of a filesystem is to -store files-, and the -second- is
performance.  Hans Reiser believes differently.

I considered, e.g., JFS and XFS, but I wanted the ability to
dynamically resize the filesystem in -both- directions (larger
or smaller) depending on what I did to the underlying LVM2 partitions,
so I stuck with ext3fs, after spending a long time running tests and
convincing myself that the predominant fear everyone voiced about it
for this application (running out of inodes) was unlikely to be a
problem for me.  I configured my bank partition like so:

  mke2fs -v -j -m 1 -T largefile -i 40960 /dev/the-partition

(Note in particular the ratio of inodes to blocks---neither as large
as it could be, nor as small, derived from a couple trials with my mix
of file sizes, which is probably a pretty typical mix.  Note also that
I only reserved 1% of the filesystem for root; since dirvish runs run
as root -anyway-, there seemed little point in building in lots of
reserved blocks, and having that partition fill up won't fill up root
anyway.)

With these parameters and my workload, this means that I back up about
300GB of files (really, about 340GB, but 40GB of those are squeezed
out via faster-dupemerge) onto a 300GB disk; this is about 2 million
files.  The bank partition is about 100% full at the moment, and 50%
of its inodes are in use.  Since my rate of new-file creation is
relatively low, and since the files that are hardlinked between each
snapshot don't take up any extra inodes (which is, essentially, what
a hard link --means-), I'm fairly confident that I won't run out of
inodes until I'm out of disk space.

This is based on having on the order of 50 extant snapshots at any
given time, e.g., a month of dailies, and a few weeklies/monthlies/
yearlies.  If you plan to have hundreds or thousands of snapshots,
the math may work out a bit differently.

The one pitfall:  If you happen to be unfortunate enough to have one
file with a -lot- of hardlinks to it (e.g., thousands), every time you
create a new snapshot, you'll add that many thousands of hardlinks to
that file (of course).  In my case, I had a single file with about
4000 hardlinks to it, because I had a directory with about 1000 of the
same file in it (courtesy of some installation that dropped 1000
identical files that all said, "this documentation is superceded;
please see this other place instead"), which then all got merged into
one file with 1000 hardlinks to it from a faster-dupermerge run, which
then wound up merging that directory with 4 identical copies elsewhere
on my filesystem.  (These were backups of 4 machines which all had this
directory in 'em.)  This saved lots of space, but meant that after
about 7 snapshots were created, the 8th blew out trying to create even
more hardlinks to that file, because ext3fs has a limit of 32000 (not
2^15!) hardlinks to any given file.  Some other filesystems don't have
this limit.  In my case, I solved this instead by tarring up that
single directory (which I never used anyway) everywhere it occurred,
and compressing it (which compressed extremely well, because it was
1000 copies of the same data), and then backing up -that-.  So instead
of 1000 hardlinks to that file (and thus 4000 across the 4 copies I
had) per snapshot, I had 4 hardlinks per snapshot---and I'm not going
to have 8000 snapshots on disk at any given time... :)

[1] http://foner.www.media.mit.edu/people/foner/Sys/reiserfs-considered-harmful.txt
[2] http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc



More information about the Dirvish mailing list