Amazon’s Simple Storage Service (S3) has a lot to like. It’s cheap, can be used for storing a little bit of data or as much as you want, and it can be used for distributing files publicly or just storing your private data. Let’s look at how you can take advantage of Amazon S3 on Linux.
Amazon S3 isn’t what you’d want to use for storing just a little bit of personal data. For that, you might want to use Dropbox, SpiderOak, ownCloud, or SparkleShare. Which one depends on how much data, your tolerance for non-free software, and which features you prefer. For my work files, I use Dropbox – in large part because of its LAN sync feature.
But S3 is really good if you need to make backups of a large amount of data, or smaller amounts but you need an offsite backup. It’s also good if you want to use S3 to host files for public distribution and don’t have a server or need to offload data sharing because of capacity issues. Maybe you just want to use it to host a blog, cheaply. S3 also has some nifty features for content distribution and data storage from multiple regions, which we’ll get into another time.
Getting the Tools
You can use S3 in a number of ways on Linux, depending on how you’d like to manage your backups. If you look around, you’ll find a bunch of tools that support S3, including:
S3 Tools and Duplicity are command line utilities that support S3. S3 Tools, as the name implies, focuses on Amazon S3. Duplicity has S3 support, but also supports several other methods of transferring files. Deja Dup is a fairly simple GNOME app for backups, which has S3 support thanks to Duplicity. Dragon Disk is a freeware (but not free software) utility that provides more fine-grained control of backups to S3. It also supports Google Cloud Storage and other cloud storage software.
For the purposes of this article, I’m going to focus on S3 Tools. If you’re a GNOME user, it should take very little effort to set up Deja Dup for S3. We’ll tackle Duplicity and Dragon Disk another time.
S3 Tools
You might find S3 Tools in your distribution’s repositories. If not, the S3 Tools folks have package repositories and have support for several versions of Red Hat, CentOS, Fedora, openSUSE, SUSE Linux Enterprise, Debian, and Ubuntu. You’ll also find instructions on adding the tools on the package repositories page.
Once you have S3 Tools installed, you need to configure it with your Amazon S3 credentials. If you haven’t signed up for them yet, hit the Sign Up button at the top of the S3 overview page. You’ll also want to look at the pricing, which starts at $0.125 per GB per month.
The pricing calculator can help you get an idea how much it would cost to store your data in S3. For example, if you’re storing 100GB in S3, it would run about $12.50 per month – before any costs for data transfer out of S3. Transfer in to S3 is free. Amazon also charges for get/put requests and so forth – so if you’re using S3 to serve up content, then the pricing is going to be higher.
Back to the tools. You need to configure s3cmd (the command line utility from the S3 Tools project) like so:
s3cmd --configure
It will walk you through adding your Amazon credentials and GPG information if you want to encrypt files while stored on S3. Amazon’s storage is supposed to be private, but you should always assume that data stored on remote servers is potentially visible to others. Since I’m storing information that has no real need for privacy (WordPress backups, MP3s, photos that I’d happily publish online anyway) I don’t worry overmuch about encrypting for storage on S3.
There’s another advantage of foregoing GPG encryption, which is that s3cmd can use an rsync-like algorithm for syncing files instead of just re-copying everything.
Now to copy files and use s3cmd sync. You’ll find that the s3cmd syntax mimics standard *nix commands. Want to see what is being stored in your S3 account? Use s3cmd ls
to show all buckets. (Amazon calls ’em buckets instead of directories.)
Want to copy between buckets? Use s3cmd cp bucket1 bucket2
. Note that buckets are specified by the syntax s3://bucketname.
To put files in a bucket, use s3cmd put filename s3://bucket
. To get files, use s3cmd getfilename local
. To upload directories, you need to use the --recursive
option.
But if you want to sync files and save yourself some trouble down the road, there’s the sync
command. It’s dead simple to use:
s3cmd sync directory s3://bucket/
The first time, it will copy up all files. The next time it will only copy up files that don’t already exist on Amazon S3. However, if you want to get rid of files that you have removed locally, use the --delete-removed
option. Note that you should test this with the --dry-run
option first. You can accidentally delete files that way.
It’s pretty simple to use s3cmd
, and you should look at its man page as well. It even has some support for the CloudFront CDN service if you need that. Happy syncing!
Download
- Login as a superuser (root), launch a terminal.
- Go to /etc/yum.repos.d ( You can use ftp ot wget commands)
- Download respective s3tools.repo file for your distribution. For example wget http://s3tools.org/repo/CentOS_5/s3tools.repo if you’re on CentOS 5
- Run yum install s3cmd if you don’t have s3cmd rpm package installed yet.
- Do yum upgrade s3cmd if you already have s3cmd rpm installed and long for a newer version.
- You will be asked to accept a new GPG key – answer yes (for twice).
- That’s it. From Next time you run yum upgrade you’ll automatically get the very latest s3cmd for your system.
s3cmd
s3cmd is a free Linux command line tool for uploading and downloading data to and from your Amazon S3 account.
Download and install s3tools manually or do what I did and add their package repository to your package manager for a much easier install.
After installing s3cmd configure it by running the following command:
# s3cmd --configure
Enter your Access Key ID and Secret Access Key discussed earlier and use the default settings for the rest of the options unless you know otherwise.
If you haven’t already created a bucket you can do that now with s3cmd:
# s3cmd mb s3://unique-bucket-name
List your current buckets to make sure you successfully created one:
# s3cmd ls
2010-10-30 02:15 s3://your-bucket-name
You can now upload, list, and download content:
# s3cmd put somefile.txt s3://your-bucket-name/somefile.txt
somefile.txt -> s3://your-bucket-name/somefile.txt [1 of 1]
17835 of 17835 100% in 0s 35.79 kB/s done
# s3cmd ls s3://your-bucket-name
2010-10-30 02:20 17835 s3://your-bucket-name/somefile.txt
# s3cmd get s3://your-bucket-name/somefile.txt somefile-2.txt
s3://your-bucket-name/somefile.txt -> somefile-2.txt [1 of 1]
17835 of 17835 100% in 0s 39.77 kB/s done
A much better and more advanced method of backing up your data is to use ‘sync’ instead of ‘put’ or ‘get’. Read more about how I use sync in the next section.
Automate backup with a shell script and cron job
Below is a sample of the shell script I wrote to backup one of my servers:
#!/bin/sh
# Syncronize /root with S3
s3cmd sync --recursive /root/ s3://my-bucket-name/root/
# Syncronize /home with S3
s3cmd sync --recursive /home/ s3://my-bucket-name/home/
# Syncronize crontabs with S3
s3cmd sync /var/spool/cron/ s3://my-bucket-name/cron/
# Syncronize /var/www/vhosts with S3
s3cmd sync --exclude 'mydomain.com/some-directory/*.jpg' --recursive /var/www/vhosts/ s3://my-bucket-name/vhosts/
# Syncronize MySQL databases with S3
mysqldump -u root --password=mysqlpassword --all-databases --result-file=/root/all-databases.sql
s3cmd put /root/all-databases.sql s3://my-bucket-name/mysql/
rm -f /root/all-databases.sql
I use ‘s3cmd sync –recursive /root/ s3://my-bucket-name/root/’ and ‘s3cmd sync –recursive /home/ s3://my-bucket-name/home/’ to synchronize all data in the local /root and /home directories including their subdirectories with S3. I use ‘sync’ instead of ‘put’ because I do not always know exactly what files are stored in these folders. I want everything backed up, including any new files created in the future.
With ‘s3cmd sync /var/spool/cron/ s3://my-bucket-name/cron/’ I omit ‘–recursive’ because I do not care about any subdirectories (there aren’t any).
With “s3cmd sync –exclude ‘mydomain.com/some-directory/*.jpg’ –recursive /var/www/vhosts/ s3://my-bucket-name/vhosts/” I synchronize /var/www/vhosts but exclude all jpg files inside a particular directory because they are replaced very frequently by new versions and are unimportant to me once they are a few minutes old.
Using mysqldump I export all databases to a text file that can be easily used to recreate them if needed. I upload the newly created file using ‘s3cmd put /root/hold-for-S3/all-databases s3://my-bucket-name/mysql/’.
To read more about sync and its options such as ‘–dry-run’, ‘–skip-existing’, and ‘–delete-removed’ read http://s3tools.org/s3cmd-sync.
Create a cron job to execute your shell script as often as you like. Now you can be less worried about losing all your important data.
Recent Comments