rsync

There are many commands to copy a directory in Linux. The difference between them in current Linux distribution are very small. All of them support link, time, ownership and sparse.

I tested them to copy a Linux kernel source tree. Each command I tested twice and keep the lower result.
The original directory size is 639660032 bytes. All methods generate exact same size of 675446784 bytes without sparse option.

	Non Sparse	Sparse
rsync	rsync -a src /tmp	rsync -a -S src /tmp
cpio	find src -depth\|cpio -pdm /tmp	find src -depth\|cpio -pdm –sparse /tmp
cp	cp -a –sparse=never src /tmp	cp -a –sparse=always src /tmp
tar	tar -c src\|tar -x -C /tmp	tar -c -S src\|tar -x -C /tmp

SCP: Secure Copy

Secure Copy is just like the cp command, but secure. More importantly, it has the ability to send files to remote servers via SSH!

Copy a file to a remote server:

# Copy a file:
$ scp /path/to/source/file.ext username@hostname.com:/path/to/destination/file.ext

# Copy a directory:
$ scp -r /path/to/source/dir username@server-host.com:/path/to/destination

This will attempt to connect to hostname.com as user username. It will ask you for a password if there’s no SSH key setup (or if you don’t have a password-less SSH key setup between the two computers). If the connection is authenticated, the file will be copied to the remote server.

Since this works just like SSH (using SSH, in fact), we can add flags normally used with the SSH command as well. For example, you can add the -v and/or -vvv to get various levels of verbosity in output about the connection attempt and file transfer.

You can also use the -i (identity file) flag to specify an SSH identity file to use:

$ scp -i ~/.ssh/some_identity.pem /path/to/source/file.ext username@hostname:/path/to/destination/file.ext

Here are some other useful flags:

-p (lowercase) – Preserves modification times, access times, and modes from the original file
-P – Choose an alternate port
-c (lowercase) – Choose another cypher other than the default AES-128 for encryption
-C – Compress files before copying, for faster upload speeds (already compressed files are not compressed further)
-l – Limit bandwidth used in kiltobits per second (8 bits to a byte!).
- e.g. Limit to 50 KB/s: scp -l 400 ~/file.ext user@host.com:~/file.ext
-q – Quiet output

Rsync: Sync Files Across Hosts

Rsync is another secure way to transfer files. Rsync has the ability to detect file differences, giving it the opportunity to save bandwidth and time when transfering files.

Just like scp, rsync can use SSH to connect to remote hosts and send/receive files from them. The same (mostly) rules and SSH-related flags apply for rsync as well.

Copy files to a remote server:

# Copy a file
$ rsync /path/to/source/file.ext username@hostname.com:/path/to/destination/file.ext

# Copy a directory:
$ rsync -r /path/to/source/dir username@hostname.com:/path/to/destination/dir

To use a specific SSH identity file and/or SSH port, we need to do a little more work. We’ll use the -e flag, which lets us choose/modify the remote shell program used to send files.

# Send files over SSH on port 8888 using a specific identity file:
$ rsync -e 'ssh -p 8888 -i /home/username/.ssh/some_identity.pem' /source/file.ext username@hostname:/destination/file.ext

Here are some other common flags to use:

-v – Verbose output
-z – Compress files
-c – Compare files based on checksum instead of mod-time (create/modified timestamp) and size
-r – Recursive
-S – Handle sparse files efficiently
Symlinks:
- -l – Copy symlinks as symlinks
- -L – Transform symlink into referent file/dir (copy the actual file)
-p – Preserve permissions
-h – Output numbers in a human-readable format
--exclude="" – Files to exclude
- e.g. Exclude the .git directory: --exclude=".git"

There are many other options as well – you can do a LOT with rsync!

Do a Dry-Run:

I often do a dry-run of rsync to preview what files will be copied over. This is useful for making sure your flags are correct and you won’t overwrite files you don’t wish to:

For this, we can use the -n or --dry-run flag:

# Copy the current directory
$ rsync -vzcrSLhp --dry-run ./ username@hostname.com:/var/www/some-site.com
#> building file list ... done
#> ... list of directories/files and some meta data here ...

Resume a Stalled Transfer:

Once in a while a large file transfer might stall or fail (while either using scp or rsync). We can actually use rsync to finish a file transfer!

For this, we can use the --partial flag, which tells rsync to not delete partially transferred files but keep them and attempt to finish its transfer on a next attempt:

$ rsync --partial --progress largefile.ext username@hostname:/path/to/largefile.ext

The Archive Option:

There’s also a -a or --archive option, which is a handy shortcut for the options -rlptgoD:

-r – Copy recursively
-l – Copy symlinks as symlinks
-p – Preserve permissions
-t – Preserve modification times
-g – Preserve group
-o – Preserve owner (User needs to have permission to change owner)
-D – Preserve special/device files. Same as --devices --specials. (User needs permissions to do so)

# Copy using the archive option and print some stats
$ rsync -a --stats /source/dir/path username@hostname:/destination/dir/path

1) technique

copy from source

tar -cf – /backup/ | pv | pigz | nc -l 8888

Destination

nc master.active.ai 8888 | pv | pigz -d | tar xf – -C /

2)
time tar -c /backup/ |pv|lz4 -B4| ssh -c aes128-ctr root@192.168.1.73 “lz4 -d |tar -xC /backup”

3) copy files using netcat

4) rysnc

50 MB /SEC

rsync -aHAXWxv –numeric-ids –no-i-r –info=progress2 -e “ssh -T -c chacha20-poly1305@openssh.com,aes192-cbc -o Compression=no -x” /backup/ root@192.168.1.73:/backup/

time rsync -aHAXWxv –numeric-ids –no-i-r –info=progress2 -e “ssh -T -c chacha20-poly1305@openssh.com,aes192-cbc -o Compression=no -x” /backup/ root@192.168.1.73:/backup/

hen copying to the local file system I always use the following rsync options:

# rsync -avhW --no-compress --progress /src/ /dst/

Here’s my reasoning:

-a is for archive, which preserves ownership, permissions etc.
-v is for verbose, so I can see what's happening (optional)
-h is for human-readable, so the transfer rate and file sizes are easier to read (optional)
-W is for copying whole files only, without delta-xfer algorithm which should reduce CPU load
--no-compress as there's no lack of bandwidth between local devices
--progress so I can see the progress of large files (optional)

70 MB / SEC
5) time tar cvf – /backup/* | ssh -T -c chacha20-poly1305@openssh.com,aes192-cbc -o Compression=no -x root@192.168.1.73 “tar xf – -C / ”

time tar cvf – /backup/* | pv | ssh -T -c chacha20-poly1305@openssh.com,aes192-cbc -o Compression=no -x root@192.168.1.73 “tar xf – -C / ”

time tar -cpSf – /backup/* | pv | ssh -T -c chacha20-poly1305@openssh.com,aes192-cbc -o Compression=no -x root@192.168.1.73 “tar xf – -C / ”

6)

tar cvf - ubuntu.iso | gzip -9 - | split -b 10M -d - ./disk/ubuntu.tar.gz.

#!/bin/bash
# SETUP OPTIONS
export SRCDIR="/folder/path"
export DESTDIR="/folder2/path"
export THREADS="8"
# RSYNC DIRECTORY STRUCTURE
rsync -zr -f"+ */" -f"- *" $SRCDIR/ $DESTDIR/ \
# FOLLOWING MAYBE FASTER BUT NOT AS FLEXIBLE
# cd $SRCDIR; find . -type d -print0 | cpio -0pdm $DESTDIR/
# FIND ALL FILES AND PASS THEM TO MULTIPLE RSYNC PROCESSES
cd $SRCDIR  &&  find . ! -type d -print0 | xargs -0 -n1 -P$THREADS -I% rsync -az % $DESTDIR/% 
# IF YOU WANT TO LIMIT THE IO PRIORITY, 
# PREPEND THE FOLLOWING TO THE rsync & cd/find COMMANDS ABOVE:
#   ionice -c2

rsync -zr -f"+ */" -f"- *" -e 'ssh -c arcfour' $SRCDIR/ remotehost:/$DESTDIR/ \
&& \
cd $SRCDIR  &&  find . ! -type d -print0 | xargs -0 -n1 -P$THREADS -I% rsync -az -e 'ssh -c arcfour' % remotehost:/$DESTDIR/% 

Parallelizing rsync

Last week I had a massive hardware failure on one of the GlusterFS storage nodes in the ILRI, Kenya Research Computing cluster: two drives failed simultaneously on the underlying RAID5. As RAID5 can only withstand one drive failure, the entire 31TB array was toast. FML.

After replacing the failed disks, rebuilding the array, and formatting my bricks, I decided I would use rsync to pre-seed my bricks from the good node before bringing glusterd back up.

tl;dr: rsync is amazing, but it’s single threaded and struggles when you tell it to sync large directory hierarchies. Here’s how you can speed it up.

rsync #fail

I figured syncing the brick hierarchy from the good node to the bad node was simple enough, so I stopped the glusterd service on the bad node and invoked:

# rsync -aAXv --delete --exclude=.glusterfs storage0:/path/to/bricks/homes/ storage1:/path/to/bricks/homes/

After a day or so I noticed I had only copied ~1.5TB (over 1 hop on a dedicated 10GbE switch!), and I realized something must be wrong. I attached to the rsync process with strace -p and saw a bunch of system calls in one particular user’s directory. I dug deeper:

# find /path/to/bricks/homes/ukenyatta/maker/genN_datastore/ -type d | wc -l
1398640

So this one particular directory in one user’s home contained over a million other directories and $god knows how many files, and this command itself took several hours to finish! To make matters worse, careful trial and error inspection of other user home directories revealed more massive directory structures as well.

What we’ve learned:

rsync is single threaded
rsync generates a list of files to be synced before it starts the sync
MAKER creates a ton of output files/directories ????

It’s pretty clear (now) that a recursive rsync on my huge directory hierarchy is out of the question!

rsync #winning

I had a look around and saw lots of people complaining about rsync being “slow” and others suggesting tips to speed it up. One very promising strategy was described on this wiki and there’s a great discussion in the comments.

Basically, he describes a clever use of find and xargs to split up the problem set into smaller pieces that rsync can process more quickly.

sync_brick.sh

So here’s my adaptation of his script for the purpose of syncing failed GlusterFS bricks, sync_brick.sh:

#!/usr/bin/env bash
# borrowed / adapted from: https://wiki.ncsa.illinois.edu/display/~wglick/Parallel+Rsync

# RSYNC SETUP
RSYNC_PROG=/usr/bin/rsync
# note the important use of --relative to use relative paths so we don't have to specify the exact path on dest
RSYNC_OPTS="-aAXv --numeric-ids --progress --human-readable --delete --exclude=.glusterfs --relative"
export RSYNC_RSH="ssh -T -c arcfour -o Compression=no -x"

# ENV SETUP
SRCDIR=/path/to/good/brick
DESTDIR=/path/to/bad/brick
# Recommend to match # of CPUs
THREADS=4
BAD_NODE=server1

cd $SRCDIR

# COPY
# note the combination of -print0 and -0!
find . -mindepth 1 -maxdepth 1 -print0 | \ 
    xargs -0 -n1 -P$THREADS -I% \
        $RSYNC_PROG $RSYNC_OPTS "%" $BAD_NODE:$DESTDIR

Pay attention to the source/destination paths, the number of THREADS, and the BAD_NODE name, then you should be ready to roll.

The Magic, Explained

It’s a bit of magic, but here are the important parts:

The -aAXv options to rsync tell it to archive, preserve ACLs, and preserve eXtended attributes. Extended attributes are critically important in GlusterFS >= 3.3, and also if you’re using SELinux.
The --exclude=.glusterfs option to rsync tells it to ignore this directory at the root of the directory, as the self-heal daemon?—?glustershd?—?will rebuild it based on the files’ extended attributes once we restart the glusterd service.
The --relative option to rsync is so we don’t have to bother constructing the destination path, as rsync will imply the path is relative to our destination’s top.
The RSYNC_RSH options influence rsync‘s use of SSH, basically telling it to use very weak encryption and disable any unnecessary features for non-interactive sessions (tty, X11, etc).
Using find with -mindepth 1 and -maxdepth 1 just means we concentrate on files/directories 1 level below each directory in our immediate hierarchy.
Using xargs with -n1 and -P tells it to use 1 argument per command line, and to launch $THREADS number of processes at a time.

Recent Posts

Pages

Categories

Archives

Recent Comments