[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Do you do HSM?



> 1. How many user accounts per mailbox server

 ~800 on one server.

> 2. How much mailbox quota per account

 Unlimited.  Largest mailbox is 58G. Our top 25 are all 10G+.

> 3. Storage architecture: do you use different storage for zimbra root
> /opt/zibmra, and zimbra backup /opt/zimbra?

  Size   Volume
  20G    /opt/zimbra
  50G    /opt/zimbra/db
  100G   /opt/zimbra/index
  20G    /opt/zimbra/log
  100G   /opt/zimbra/redolog
  450G   /opt/zimbra/store
  1T     /opt/zimbra/store2
  4T     /opt/zimbra/backup

  All of the above volumes, except for store2 and backup, are on a
  Fujitsu Eternus 4000 model 500 connected via multi-pathed fibre
  channel.  They are made up of (4) dedicated 6 disk RAID10 arrays, fed
  into LVM.  The volumes are created as 4 way stripes across these 4
  RAID10 volumes.

  /opt/zimbra/store2 is made up of shared RAID5 arrays from the same
  Fujitsu.  They are fed into LVM, and simply concatenated.  We'll be
  creating a store3 soon, but all future HSM volumes will be 500G or
  less.  With ext3, we expose ourselves to too much potential downtime
  if an fsck is needed by having such a large volume (1T).  We also run
  the attached script (e2croncheck) on each volume each month to help
  ensure we don't have untimely fsck checks, and to identify potential
  filesystem issues early on.  I've also taken the liberty of attaching
  the script we use to do our LVM snapshots prior to upgrades
  (snapshot).

  /opt/zimbra/backup is a 1G ethernet connected NFS share off a Sun 7410
  clustered pair with read and write cache.  This FS is optimized on the
  7410s to cache metadata only.  Using a 7 day auto-group backup scheme,
  a nightly backup typically takes ~2hours.  This volume used to be on a
  NEXSAN SATABeast with concatenated RAID6 volumes, connected via
  multi-pathed fibre channel.  Backups used to range between 4 and 10
  hours.  Since moving to the 7410s with cache, backup time has not only
  dropped, but has become more consistent.

  We have a cron job that runs HSM once a week at 9:pm on Saturday, and
  aborts it at 11:59pm, if it's still running, so the backups aren't
  competing with HSM for I/O.  If it's not still running, then the abort
  is harmless.


> 4. Do you have dedicated network connection to the HSM storage?

  HSM storage is FC connected.  Not dedicated, but switched.  Bandwidth
  is fine.


> Will it be a performance problem if both backup and the secondary HSM
> volume share the same network connection?

  We try to not run HSM jobs and backup jobs at the same time to avoid
  contention.


> 5. Impact with backups

  We try to not run HSM jobs and backup jobs at the same time to avoid
  contention.


> 6. Impact to server
>
> Obviously moving data from primary disk to secondary disk will have
> impact to server performance. Do you let your HSM process run
> continuously?

  No. The impact is not quite as bad as backups, but noticeable.

> 7. What HSM Age do you use? If you adjusted it from the default 30
> days, why?

  60days.  We try to stay around 50% used on our tier1 store.  That's
  about the right number for us right now.  But truthfully, the
  frequency with which things older than even 30 days are accessed in
  our environment is very low.  So because of less user traffic to our
  tier2 store, having it on slower disk is essentially unnoticable by
  users.


> 8. When you initially turn on HSM, how long did it take for a complete
> run? How long does it take after the intial run?

  You don't have to let it run out.  You can abort it at any time, and
  then restart it again later.  I'd recommend running it via cron during
  low usage hours.  We just do it once a week, but if you have a lot to
  initially migrate, you could run it a few hours a day off hours until
  you get your data migrated.  Here are the commands we run out of cron
  (/etc/cron.d/zimbra-hsm -- locally created cron file):

  00 21 * * 6  zimbra  /opt/zimbra/bin/zmhsm --start >/dev/null
  59 23 * * 6  zimbra  /opt/zimbra/bin/zmhsm --abort >/dev/null



-- 
Brian Elliott Finley
Manager, Application Platforms - Infrastructure and Operations
Computing and Information Systems,  Argonne National Laboratory
Office: +1 630.252.4742  Mobile: +1 630.447.9108
#!/bin/bash
#
# e2croncheck -- run e2fsck automatically out of /etc/cron.weekly
#
# This script is intended to be run by the system administrator 
# periodically from the command line, or to be run once a week
# or so by the cron daemon to check a mounted filesystem (normally
# the root filesystem, but it could be used to check other filesystems
# that are always mounted when the system is booted).
#
# Make sure you customize "EMAIL" to be your e-mail address.
#
# Written by Theodore Ts'o, Copyright 2007, 2008, 2009.
#
# This file may be redistributed under the terms of the 
# GNU Public License, version 2.
#
# 2010.05.07 Brian Elliott Finley
# * Accept command line arguments
# * Accept a device file as argument and figure out VG and LV info
# * Detect and exit if device is _not_ an LVM device
#
# 2010.05.10 Brian Elliott Finley
# * Don't use "set -e".  It can leave straggler snapshots lying around.
# * Add comments about use of "lvremove -f"
# * Improve detection of VG and LV info using major,minor numbers
#
# 2010.05.10 Brian Elliott Finley
# * Calculate a snapshot size that is 10% of the LV size
# * Test to be sure there is enough free space in the VG to hold the snapshot


EMAIL=cis-linux-admins@lists.anl.gov
#EMAIL=finley@anl.gov
SUFFIX=fsck-snap
#
# Percentage of LV size required for snapshot
SNAP_PERCENT=10

usage() {
    echo "Usage:  e2croncheck <DEVICE>"
    echo "        (e2croncheck /dev/mapper/system-tmp)"
}

DEV=$1
if [ "x$DEV" == "x" ]; then
    usage
    exit 1
fi

MAJOR_MINOR="$(/bin/ls -l $DEV | awk '{print $5 $6}')"
LVS_ENTRY=$(/sbin/lvs --units m --nosuffix -o vg_name,vg_free,lv_name,lv_size,lv_kernel_major,lv_kernel_minor --separator "," | egrep ",$MAJOR_MINOR\$")
#
# Example output:
#   VG,VFree,LV,LSize,KMaj,KMin
#   mb2_t1,247024.00,bin,20480.00,254,29
#   mb2_t1,247024.00,db,51200.00,254,25
if [ "x$LVS_ENTRY" == "x" ]; then
    echo "The DEVICE you specified does not appear to be a logical volume."
    usage
    exit 1
fi

VG=$(echo $LVS_ENTRY | awk -F"," '{print $1}')
VFree=$(echo $LVS_ENTRY | awk -F"," '{print $2}')
LV=$(echo $LVS_ENTRY | awk -F"," '{print $3}')
LSize=$(echo $LVS_ENTRY | awk -F"," '{print $4}')

which bc >/dev/null 2>&1
if [ $? != 0 ]; then
    echo
    echo "Please install the 'bc' package.  This $0 requires it."
    echo
    exit 1
fi
SNAPSIZE=$(echo "$LSize * .$SNAP_PERCENT" | bc )
DOES_IT_FIT=$(echo "$SNAPSIZE <= $VFree" | bc )
if [ $DOES_IT_FIT -ne "1" ]; then
    # 1 on yes, 0 on no
    echo
    echo "Not enough space ($VFree)M in volume group to create snapshot ($SNAPSIZE)M."
    echo
    exit 1
fi


##
## Uncomment to help in debug...
##
#echo $MAJOR_MINOR
#echo $LVS_ENTRY
#echo $VG
#echo $VFree
#echo $LV
#echo $LSize
#echo $SNAPSIZE

TMPFILE=`mktemp -t e2fsck.log.XXXXXXXXXX`

OPTS="-Fttv -C0"
#OPTS="-Fttv -E fragcheck"

START="$(date +'%Y%m%d%H%M%S')"
lvcreate -s -L ${SNAPSIZE} -n "${LV}-${SUFFIX}" "${VG}/${LV}"
if nice logsave -as $TMPFILE e2fsck -p $OPTS "/dev/${VG}/${LV}-${SUFFIX}" && \
   nice logsave -as $TMPFILE e2fsck -fy $OPTS "/dev/${VG}/${LV}-${SUFFIX}" ; then
  echo '>>> Background scrubbing succeeded! <<<'
  tune2fs -C 0 -T "${START}" "/dev/${VG}/${LV}"
else
  echo '>>> Background scrubbing failed! Reboot to fsck soon! <<<'
  tune2fs -C 16000 -T "19000101" "/dev/${VG}/${LV}"
  if test -n "$RPT-EMAIL"; then

    # Either sendEmail or mail can be used to send messages.  sendEmail
    # is more flexible, but may not be installed on your systems (yet).
    # If you want to use sendEmail, make sure it's installed via 
    #   'aptitude install sendEmail' or manually from
    #   http://www.caspian.dotconf.net/menu/Software/SendEmail/
    sendEmail -u "E2fsck of /dev/${VG}/${LV} failed!" -o message-file=$TMPFILE -f root@$(hostname -f) -t $EMAIL
    #mail -s "E2fsck of /dev/${VG}/${LV} failed!" $EMAIL < $TMPFILE
  fi
fi

rm $TMPFILE

# "lvremove -f" is used instead of "lvremove", as the latter will prompt
# for a "y/n?" response.
#
# This looks more dangerous than it is.  "lvremove -f" will remove an
# 'active' volume, but will not remove a 'mounted' volume.  So in order
# for this to go wrong, and remove the wrong volume unintentionally,
#   * snapshot name variables in this script would have to be messed up
#   * the resulting messed up snapshot name would have to match the
#     name of another volume (highly improbable, but possible)
#   * and that other volume would have to not be mounted
#
# So, this can be considered safe unless you keep important volumes
# unmounted, and they have names like "$VG/-" or  "$VG/$LV-" or
# "$VG/-$SUFFIX". -BEF-
lvremove -f "${VG}/${LV}-${SUFFIX}" 


#!/bin/bash
#
# snapshot
#
# 2010.05.10 Brian Elliott Finley
# * Created based on e2croncheck by Ted Ts'o
#

SUFFIX=maintenance-snap
#
# Percentage of LV size required for snapshot
SNAP_PERCENT=10

usage() {
    echo "Usage:  $0 create  <DEVICE>"
    echo "        $0 release <DEVICE>"
}

OPERATION=$1
if [ "x$OPERATION" == "x" ]; then
    usage
    exit 1
fi

DEV=$2
if [ "x$DEV" == "x" ]; then
    usage
    exit 1
fi

MAJOR_MINOR="$(/bin/ls -l $DEV | awk '{print $5 $6}')"
LVS_ENTRY=$(/sbin/lvs --units m --nosuffix -o vg_name,vg_free,lv_name,lv_size,lv_kernel_major,lv_kernel_minor --separator "," | egrep ",$MAJOR_MINOR\$")
#
# Example output:
#   VG,VFree,LV,LSize,KMaj,KMin
#   mb2_t1,247024.00,bin,20480.00,254,29
#   mb2_t1,247024.00,db,51200.00,254,25
if [ "x$LVS_ENTRY" == "x" ]; then
    echo "The DEVICE you specified does not appear to be a logical volume."
    usage
    exit 1
fi

VG=$(echo $LVS_ENTRY | awk -F"," '{print $1}')
VFree=$(echo $LVS_ENTRY | awk -F"," '{print $2}')
LV=$(echo $LVS_ENTRY | awk -F"," '{print $3}')
LSize=$(echo $LVS_ENTRY | awk -F"," '{print $4}')

create_snapshot() {

    # Gotta make sure the snapshot module is loaded
    /sbin/modprobe dm_snapshot

    which bc >/dev/null 2>&1
    if [ $? != 0 ]; then
        echo
        echo "Please install the 'bc' package.  This $0 requires it."
        echo
        exit 1
    fi
    SNAPSIZE=$(echo "$LSize * .$SNAP_PERCENT" | bc )
    DOES_IT_FIT=$(echo "$SNAPSIZE <= $VFree" | bc )
    if [ $DOES_IT_FIT -ne "1" ]; then
        # 1 on yes, 0 on no
        echo
        echo "Not enough space ($VFree)M in volume group to create snapshot ($SNAPSIZE)M."
        echo
        exit 1
    fi
    
    ##
    ## Uncomment to help in debug...
    ##
    #echo $MAJOR_MINOR
    #echo $LVS_ENTRY
    #echo $VG
    #echo $VFree
    #echo $LV
    #echo $LSize
    #echo $SNAPSIZE

    echo lvcreate -s -L ${SNAPSIZE} -n "${LV}-${SUFFIX}" "${VG}/${LV}"
    lvcreate -s -L ${SNAPSIZE} -n "${LV}-${SUFFIX}" "${VG}/${LV}"
}

release_snapshot() {

    # "lvremove -f" is used instead of "lvremove", as the latter will prompt
    # for a "y/n?" response.
    #
    # This looks more dangerous than it is.  "lvremove -f" will remove an
    # 'active' volume, but will not remove a 'mounted' volume.  So in order
    # for this to go wrong, and remove the wrong volume unintentionally,
    #   * snapshot name variables in this script would have to be messed up
    #   * the resulting messed up snapshot name would have to match the
    #     name of another volume (highly improbable, but possible)
    #   * and that other volume would have to not be mounted
    #
    # So, this can be considered safe unless you keep important volumes
    # unmounted, and they have names like "$VG/-" or  "$VG/$LV-" or
    # "$VG/-$SUFFIX". -BEF-
    
    lvs "${VG}/${LV}-${SUFFIX}" >/dev/null
    if [ $? -ne 0 ]; then
        echo "No snapshot for $DEV exists."
        exit 0
    fi
    echo lvremove -f "${VG}/${LV}-${SUFFIX}" 
    lvremove -f "${VG}/${LV}-${SUFFIX}" 
}

case "$1" in
	create)
        create_snapshot
		;;
	release)
        release_snapshot
		;;
	*)
		usage
		exit 1
		;;
esac