[Dirvish] backup + nagios

Mateusz Pospieszny mateusz at bellsouth.net
Sun Aug 21 20:11:16 UTC 2005


Hello
I've been using dirvish for a while now, at least a year by now,
possibly longer.

I have a few machines being backed up on a remote network. I wasn't
paying attention to the daily backups after initial testing and one day
the backup drive failed and dirvish started hanging on the backup job
without ever completing it (couldn't even kill it). This went for a week
without any job completing. After replacing the drive and rebuilding the
backup partition as a software RAID 1, I've wrote the attached 2 scripts
to get nagios http://www.nagios.org/ to monitor the completion of the
daily backups.

They probably have some bugs, the backup script itself should probably
just redirect the whole output into a file instead of having redirection
at every command, but it works :-)

I figured out it wouldn't hurt if somebody else would take a look at see
if there are any problems with these scripts.

the backup.sh runs daily at 5 am.
Usually all the backup jobs are completed at the latest, by 5:30 am
I have nagios setup to check the results of the backups starting at 6
am, for an hour. This gives plenty of time for the jobs to complete.

The backup.sh does all the backup work, dumping "[-result-]: <exit
status of the previous job> <type of the job>" into /var/log/dirvish.log

then the check_dirvish.awk is a simple awk script that gathers all lines
containing '[-result-]' and calculates the status of the backup job.

It then returns "OK", "WARNING" or "CRITICAL" back to nagios. It also
extract job completion time and expire/runall exit statuses so that in
case of the problem I can figure out which job has failed (since
dirvish-runall returns number of a job failed)


finally, the command for check_dirvish is defined for nagios as follows:

# 'check_dirvish' command
define command{
        command_name    check_dirvish
        command_line
$USER1/contrib/check_dirvish.awk /var/log/dirvish.log
        }


Hope this can help somebody :-)


P.S. I have about 52GB of combined data in dirvish, keeping the last 2
weeks worth of daily backups, then keeping one backup per month after
that. This is from 4 servers, two of them storing a large amount of
small files (Maildir format Inboxes) This seems to generate a lot or
usage on the backup drives. I've already got two 200GB IDE drives die on
me. Thankfully my backup partition is on a mirrored software raid now.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: backup.sh
Type: application/x-shellscript
Size: 750 bytes
Desc: not available
URL: <http://lists.dirvish.org/pipermail/dirvish/attachments/20050821/774a370a/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: check_dirvish.awk
Type: application/x-awk
Size: 538 bytes
Desc: not available
URL: <http://lists.dirvish.org/pipermail/dirvish/attachments/20050821/774a370a/attachment-0001.bin>


More information about the Dirvish mailing list