There are many commands to copy a directory in Linux. The difference between them in current Linux distribution are very small. All of them support link, time, ownership and sparse.
I tested them to copy a Linux kernel source tree. Each command I tested twice and keep the lower result.
The original directory size is 639660032 bytes. All methods generate exact same size of 675446784 bytes without sparse option.
Non Sparse | Sparse | |
---|---|---|
rsync | rsync -a src /tmp | rsync -a -S src /tmp |
cpio | find src -depth|cpio -pdm /tmp | find src -depth|cpio -pdm –sparse /tmp |
cp | cp -a –sparse=never src /tmp | cp -a –sparse=always src /tmp |
tar | tar -c src|tar -x -C /tmp | tar -c -S src|tar -x -C /tmp |
SCP: Secure Copy
Secure Copy is just like the cp
command, but secure. More importantly, it has the ability to send files to remote servers via SSH!
Copy a file to a remote server:
# Copy a file:
$ scp /path/to/source/file.ext username@hostname.com:/path/to/destination/file.ext
# Copy a directory:
$ scp -r /path/to/source/dir username@server-host.com:/path/to/destination
This will attempt to connect to hostname.com
as user username
. It will ask you for a password if there’s no SSH key setup (or if you don’t have a password-less SSH key setup between the two computers). If the connection is authenticated, the file will be copied to the remote server.
Since this works just like SSH (using SSH, in fact), we can add flags normally used with the SSH command as well. For example, you can add the -v
and/or -vvv
to get various levels of verbosity in output about the connection attempt and file transfer.
You can also use the -i
(identity file) flag to specify an SSH identity file to use:
$ scp -i ~/.ssh/some_identity.pem /path/to/source/file.ext username@hostname:/path/to/destination/file.ext
Here are some other useful flags:
-p
(lowercase) – Preserves modification times, access times, and modes from the original file-P
– Choose an alternate port-c
(lowercase) – Choose another cypher other than the defaultAES-128
for encryption-C
– Compress files before copying, for faster upload speeds (already compressed files are not compressed further)-l
– Limit bandwidth used in kiltobits per second (8 bits to a byte!).- e.g. Limit to 50 KB/s:
scp -l 400 ~/file.ext user@host.com:~/file.ext
- e.g. Limit to 50 KB/s:
-q
– Quiet output
Rsync: Sync Files Across Hosts
Rsync is another secure way to transfer files. Rsync has the ability to detect file differences, giving it the opportunity to save bandwidth and time when transfering files.
Just like scp
, rsync
can use SSH to connect to remote hosts and send/receive files from them. The same (mostly) rules and SSH-related flags apply for rsync
as well.
Copy files to a remote server:
# Copy a file
$ rsync /path/to/source/file.ext username@hostname.com:/path/to/destination/file.ext
# Copy a directory:
$ rsync -r /path/to/source/dir username@hostname.com:/path/to/destination/dir
To use a specific SSH identity file and/or SSH port, we need to do a little more work. We’ll use the -e
flag, which lets us choose/modify the remote shell program used to send files.
# Send files over SSH on port 8888 using a specific identity file:
$ rsync -e 'ssh -p 8888 -i /home/username/.ssh/some_identity.pem' /source/file.ext username@hostname:/destination/file.ext
Here are some other common flags to use:
-v
– Verbose output-z
– Compress files-c
– Compare files based on checksum instead of mod-time (create/modified timestamp) and size-r
– Recursive-S
– Handle sparse files efficiently- Symlinks:
-l
– Copy symlinks as symlinks-L
– Transform symlink into referent file/dir (copy the actual file)
-p
– Preserve permissions-h
– Output numbers in a human-readable format--exclude=""
– Files to exclude- e.g. Exclude the .git directory:
--exclude=".git"
- e.g. Exclude the .git directory:
There are many other options as well – you can do a LOT with rsync!
Do a Dry-Run:
I often do a dry-run of rsync to preview what files will be copied over. This is useful for making sure your flags are correct and you won’t overwrite files you don’t wish to:
For this, we can use the -n
or --dry-run
flag:
# Copy the current directory
$ rsync -vzcrSLhp --dry-run ./ username@hostname.com:/var/www/some-site.com
#> building file list ... done
#> ... list of directories/files and some meta data here ...
Resume a Stalled Transfer:
Once in a while a large file transfer might stall or fail (while either using scp
or rsync
). We can actually use rsync to finish a file transfer!
For this, we can use the --partial
flag, which tells rsync to not delete partially transferred files but keep them and attempt to finish its transfer on a next attempt:
$ rsync --partial --progress largefile.ext username@hostname:/path/to/largefile.ext
The Archive Option:
There’s also a -a
or --archive
option, which is a handy shortcut for the options -rlptgoD
:
-r
– Copy recursively-l
– Copy symlinks as symlinks-p
– Preserve permissions-t
– Preserve modification times-g
– Preserve group-o
– Preserve owner (User needs to have permission to change owner)-D
– Preserve special/device files. Same as--devices --specials
. (User needs permissions to do so)
# Copy using the archive option and print some stats
$ rsync -a --stats /source/dir/path username@hostname:/destination/dir/path
1) technique
copy from source
tar -cf – /backup/ | pv | pigz | nc -l 8888
Destination
nc master.active.ai 8888 | pv | pigz -d | tar xf – -C /
2)
time tar -c /backup/ |pv|lz4 -B4| ssh -c aes128-ctr root@192.168.1.73 “lz4 -d |tar -xC /backup”
3) copy files using netcat
4) rysnc
50 MB /SEC
rsync -aHAXWxv –numeric-ids –no-i-r –info=progress2 -e “ssh -T -c chacha20-poly1305@openssh.com,aes192-cbc -o Compression=no -x” /backup/ root@192.168.1.73:/backup/
time rsync -aHAXWxv –numeric-ids –no-i-r –info=progress2 -e “ssh -T -c chacha20-poly1305@openssh.com,aes192-cbc -o Compression=no -x” /backup/ root@192.168.1.73:/backup/
hen copying to the local file system I always use the following rsync options:
# rsync -avhW --no-compress --progress /src/ /dst/
Here’s my reasoning:
-a is for archive, which preserves ownership, permissions etc.
-v is for verbose, so I can see what's happening (optional)
-h is for human-readable, so the transfer rate and file sizes are easier to read (optional)
-W is for copying whole files only, without delta-xfer algorithm which should reduce CPU load
--no-compress as there's no lack of bandwidth between local devices
--progress so I can see the progress of large files (optional)
70 MB / SEC
5) time tar cvf – /backup/* | ssh -T -c chacha20-poly1305@openssh.com,aes192-cbc -o Compression=no -x root@192.168.1.73 “tar xf – -C / ”
time tar cvf – /backup/* | pv | ssh -T -c chacha20-poly1305@openssh.com,aes192-cbc -o Compression=no -x root@192.168.1.73 “tar xf – -C / ”
time tar -cpSf – /backup/* | pv | ssh -T -c chacha20-poly1305@openssh.com,aes192-cbc -o Compression=no -x root@192.168.1.73 “tar xf – -C / ”
6)
tar cvf - ubuntu.iso | gzip -9 - | split -b 10M -d - ./disk/ubuntu.tar.gz.
#!/bin/bash
# SETUP OPTIONS
export
SRCDIR=
"/folder/path"
export
DESTDIR=
"/folder2/path"
export
THREADS=
"8"
# RSYNC DIRECTORY STRUCTURE
rsync
-zr -f
"+ */"
-f
"- *"
$SRCDIR/ $DESTDIR/ \
# FOLLOWING MAYBE FASTER BUT NOT AS FLEXIBLE
# cd $SRCDIR; find . -type d -print0 | cpio -0pdm $DESTDIR/
# FIND ALL FILES AND PASS THEM TO MULTIPLE RSYNC PROCESSES
cd
$SRCDIR &&
find
. ! -
type
d -print0 |
xargs
-0 -n1 -P$THREADS -I%
rsync
-az % $DESTDIR/%
# IF YOU WANT TO LIMIT THE IO PRIORITY,
# PREPEND THE FOLLOWING TO THE rsync & cd/find COMMANDS ABOVE:
# ionice -c2
rsync
-zr -f
"+ */"
-f
"- *"
-e
'ssh -c arcfour'
$SRCDIR/ remotehost:/$DESTDIR/ \
&& \
cd
$SRCDIR &&
find
. ! -
type
d -print0 |
xargs
-0 -n1 -P$THREADS -I%
rsync
-az -e
'ssh -c arcfour'
% remotehost:/$DESTDIR/%
Parallelizing rsync
Last week I had a massive hardware failure on one of the GlusterFS storage nodes in the ILRI, Kenya Research Computing cluster: two drives failed simultaneously on the underlying RAID5. As RAID5 can only withstand one drive failure, the entire 31TB array was toast. FML.
After replacing the failed disks, rebuilding the array, and formatting my bricks, I decided I would use rsync
to pre-seed my bricks from the good node before bringing glusterd
back up.
tl;dr: rsync
is amazing, but it’s single threaded and struggles when you tell it to sync large directory hierarchies. Here’s how you can speed it up.
rsync #fail
I figured syncing the brick hierarchy from the good node to the bad node was simple enough, so I stopped the glusterd
service on the bad node and invoked:
# rsync -aAXv --delete --exclude=.glusterfs storage0:/path/to/bricks/homes/ storage1:/path/to/bricks/homes/
After a day or so I noticed I had only copied ~1.5TB (over 1 hop on a dedicated 10GbE switch!), and I realized something must be wrong. I attached to the rsync
process with strace -p
and saw a bunch of system calls in one particular user’s directory. I dug deeper:
# find /path/to/bricks/homes/ukenyatta/maker/genN_datastore/ -type d | wc -l
1398640
So this one particular directory in one user’s home contained over a million other directories and $god knows how many files, and this command itself took several hours to finish! To make matters worse, careful trial and error inspection of other user home directories revealed more massive directory structures as well.
What we’ve learned:
rsync
is single threadedrsync
generates a list of files to be synced before it starts the sync- MAKER creates a ton of output files/directories ????
It’s pretty clear (now) that a recursive rsync
on my huge directory hierarchy is out of the question!
rsync #winning
I had a look around and saw lots of people complaining about rsync
being “slow” and others suggesting tips to speed it up. One very promising strategy was described on this wiki and there’s a great discussion in the comments.
Basically, he describes a clever use of find
and xargs
to split up the problem set into smaller pieces that rsync
can process more quickly.
sync_brick.sh
So here’s my adaptation of his script for the purpose of syncing failed GlusterFS bricks, sync_brick.sh
:
#!/usr/bin/env bash
# borrowed / adapted from: https://wiki.ncsa.illinois.edu/display/~wglick/Parallel+Rsync
# RSYNC SETUP
RSYNC_PROG=/usr/bin/rsync
# note the important use of --relative to use relative paths so we don't have to specify the exact path on dest
RSYNC_OPTS="-aAXv --numeric-ids --progress --human-readable --delete --exclude=.glusterfs --relative"
export RSYNC_RSH="ssh -T -c arcfour -o Compression=no -x"
# ENV SETUP
SRCDIR=/path/to/good/brick
DESTDIR=/path/to/bad/brick
# Recommend to match # of CPUs
THREADS=4
BAD_NODE=server1
cd $SRCDIR
# COPY
# note the combination of -print0 and -0!
find . -mindepth 1 -maxdepth 1 -print0 | \
xargs -0 -n1 -P$THREADS -I% \
$RSYNC_PROG $RSYNC_OPTS "%" $BAD_NODE:$DESTDIR
Pay attention to the source/destination paths, the number of THREADS
, and the BAD_NODE
name, then you should be ready to roll.
The Magic, Explained
It’s a bit of magic, but here are the important parts:
- The
-aAXv
options torsync
tell it to archive, preserve ACLs, and preserve eXtended attributes. Extended attributes are critically important in GlusterFS >= 3.3, and also if you’re using SELinux. - The
--exclude=.glusterfs
option torsync
tells it to ignore this directory at the root of the directory, as the self-heal daemon?—?glustershd
?—?will rebuild it based on the files’ extended attributes once we restart theglusterd
service. - The
--relative
option torsync
is so we don’t have to bother constructing the destination path, asrsync
will imply the path is relative to our destination’s top. - The
RSYNC_RSH
options influencersync
‘s use of SSH, basically telling it to use very weak encryption and disable any unnecessary features for non-interactive sessions (tty, X11, etc). - Using
find
with-mindepth 1
and-maxdepth 1
just means we concentrate on files/directories 1 level below each directory in our immediate hierarchy. - Using
xargs
with-n1
and-P
tells it to use 1 argument per command line, and to launch$THREADS
number of processes at a time.
Recent Comments