Recently we were tasked with the following problem: one of the websites we host was going to be featured prominently on a national television program. They are not a huge-traffic website, averaging probably a couple thousand hits a month and are hosted on a shared server. As we also haven't seen what the tv program was going to be and what sort of mention the website was going get, we essentially had no idea how much traffic to expect.
A couple of previous times, under normal circumstances the game plan would be something along the lines of: throw up a bunch of x-large amazon instances and move their site, which contains a number of static and dynamic elements onto the cloud. Unfortunately as this isn't something that we do very often so it's a somewhat painful manual process that takes a fair number of hours to do (and is not something I look forward to as it's a human-error-prone process) and we don't have stock EC2 images up-to-date with all the required software. So, by the time the sites are up and tested and load balanced there is a significant expense to the client - both in the cost of the instances and the consulting time to get them up and functioning properly. This client both didn't have the budget to do that and as we didn't know how much traffic was going to come in, we didn't want to waste their money if the rampaging horde of internet users never arrived.
So we can up with a solution that would be cheap but, for the most part at least, would be able to handle whatever traffic came their way - put the static pages of the site on Amazon S3 and run the website from there, but bring users back to the DB driven back-end for the pages that needed it (logins, user accounts, contributions, volunteering, etc) - all of which we were able to do with a just a couple of hours of time and a couple of quick and dirty scripts. Here's what we did, assuming a linux/apache2 hosting environment:
First the Preliminary steps:
Get yourself an s3 account and download the s3sync.rb scripts and get them configured properly. See the included README - it's pretty simple and mostly involves configuring ~/.s3conf/s3config.yml
Then, move all the file and image assets on the live site to S3 (how you do this depends on your setup - many CMS's have some S3 module or support, on a static site s3sync.rb and sed should do the trick) This isn't actually necessary for the rest of what's below but as the point is scaling to handle increased traffic it'll do it's part to reduce the load on the live server as well by greatly reducing the number of requests that even hit the server.
Now the meat of the process:
1. Download a static copy of your site. There's an export script built into our CMS Webiva but I believe using wget to mirror the site should get you there too:
$ wget -m http://sample.com
Note: Depending on how your site downloads - if the pages aren't named .html you'll have to modify the commands and scripts that follow - there's no reason S3 can't serve pages ending with a .php extension with a text/html mime type.
Note 2: This is more of a tutorial than a set of drop-in-place scripts - don't expect to follow these steps exactly as your own site may be set up differently.
2. We need to replace any references to the root homepage with index.html - a couple of quick sed calls searching for '/' and "/" did the trick for us:
$ find . -name "*.html" -exec sed -i s/\'\\/\'/\'\\/index.html\'/g {} \;
$ find . -name "*.html" -exec sed -i s/\"\\/\"/\"\\/index.html\"/g {} \;
This is necessary because amazon s3 doesn't support index.html files. We can get around this issue for other pages on the site but not for the homepage.
3. Create an s3 bucket with s3cmd.rb with the full sub-domain that you will be hosting on (in our case example static.sample.com):
$ s3cmd.rb createbucket static.sample.com
4. Drop the static javascripts and stylesheets onto our s3 bucket - this applies for files that are hosted locally that your static site will need:
$ cd public/
$ s3sync.rb -v -p -r javascripts/ static.sample.com:javascripts/
$ s3sync.rb -v -p -r stylesheets/ static.sample.com:stylesheets/
5. Next I wrote a quick script (url_replace.rb) to replace links to specific pages that have forms or dynamic content to point all the way back to the real site. This will allow any relative links to something like the /contribute page to always go back to the live site. The script just searches for certain single - ' - or double - " - quoted relative urls and replaces them with full http://sitename.com/url urls. This worked well enough for us but depending on your site's content you might need to modify. This script needs to be called once for each page with dynamic content with the domain to redirect to and the relative url of the page.
Here's the script:
#!/usr/bin/env ruby
domain = ARGV[0]; url = ARGV[1]
url = url.gsub("/","\\\\\\/")
`find . -name \"*.html\" -exec sed -i s/\\'#{url}\\'/\\'http:\\\\/\\\\/#{domain}#{url}\\'/g {} \\;`
`find . -name \"*.html\" -exec sed -i s/\\\"#{url}\\\"/\\\"http:\\\\/\\\\/#{domain}#{url}\\\"/g {} \\;`
Now here's a sample usage:
$ ./url_replace.rb static.sample.com /email_signup
$ ./url_replace.rb static.sample.com /create_event
$ ./url_replace.rb static.sample.com /contribute
$ ./url_replace.rb static.sample.com /contact
6. Finally I wrote another quick script (s3put.rb) to copy all the exported html files over to S3. Since S3 doesn't support index files (in fact, s3 doesn't really support directories, everything is just a key) it needs to just give them directory-like names but make sure they get sent to the client as text/html files (this is why I couldn't just use s3sync.rb to do the heavy lifting)::
Here's the script:
#!/usr/bin/env ruby
bucket = ARGV[0]; file = ARGV[1..-1].join(' ')
# Files are coming in from a find command, so they are all prefixed with a ./ (i.e. ./dir/filename.html)
# Special case for index.html files - let them work as /directoryname and /directoryname/
puts("Handling: #{file}")
if file =~ /^\.\/(.+)\/index.html$/
filebase = $1
`s3cmd.rb put \"#{bucket}:#{filebase}\" \"#{file}\" x-amz-acl:public-read content-type:text/html`
`s3cmd.rb put \"#{bucket}:#{filebase}/\" \"#{file}\" x-amz-acl:public-read content-type:text/html`
end
# copy the file normally
filepath = file[2..-1]
`s3cmd.rb put \"#{bucket}:#{filepath}\" \"#{file}\" x-amz-acl:public-read content-type:text/html`
And then a find command piped to xargs should do the trick (note the script above is counting on a "find ." command so all files paths coming in have a leading ./ you can adjust as necessary).
$ find . -name "*.html" -print | xargs -L 1 ./s3put.rb static.sample.com
7. Now do s3sync for any other static files that need to be hosted (in my case, all static files were in the system/ directory)
$ s3sync.rb -v -p -r system/ static.sample.com:system/
8. Now to take advantage of the fact that any sub-domain prefix to the s3.amazonaws.com domain is treated as a bucket name, we need to add in a quick CNAME to our site's DNS to get a nice looking static domain - in our case we want static.sample.com would serve the static site. Add something like the following to your BIND setup or configure your DNS with whatever you use:
static.sample.com CNAME static.sample.com.s3.amazonaws.com
Once that propogates, http://static.sample.com/index.html will take you to the homepage of the static site.
You can test your site out on http://static.example.com/index.html and make sure everything is good to go. Any links to dynamic pages should take you back to the live site.
Once it's time to deploy the static site:
Add some rewrite rules to your the apache2 config of the live site to kick it over to static.sample.com on the appropriate pages. For example to kick the homepage and anything under /about, /news /blog, /partners you can use:
RewriteRule /$ http://static.sample.com/index.html [NC,QSA,L]
RewriteRule ^/(about|news|blog|partners)(.*)$ http://static.sample.com/$1$2 [NC,QSA,L]
...Any other rules you need...
And were done - the majority of users will see only the static site unless they to do something active on the site.
Update 9/20: As someone pointed out, pushing from S3 to Cloudfront will give you much better latency and is a good idea for public-facing hosted files at the cost of being unable to modify your site instantly (if you discover a typo in some html for example). A good compromise would be to turn cloud front on in your bucket, and then do a sed replacementfor any static media files (leaving your .html files on s3). Since the url for the media files is much less important, leaving them on the deafult cloudfront domain name, as ugly as it is, won't really hurt.
Now what happens to as soon as the rush is over and people have linked to the static site - but the site contains old content? Not a problem, because we used the cname feature of s3 we can kick the s3 bucket to the curb and just add in another virtual host on your live server to redirect people to the real site:
<virtualhost>
ServerName static.sample.com
RewriteEngine On
RewriteRule /(.*)$ http://sample.com/$1 [R=permanent,NC,QSA,L]
...
</virtualhost>
Now point that cname back to you live local server and all the static links will redirect to the live site.
Overall, not a huge deal and we saved the client a bunch of money. Now we just have to hope too many people don't try to contribute all at once on the live server.
=====================
Followup - the TV spot was cut to a quick 5 minute spot without a mention of the website and we didn't even need the extra bandwidth of the static hosted site. Oh well, one more for the bag of tricks, I'm sure the experience will come in handy.
Comments Leave a comment
Alternatively, you may also want to check http://drydrop.binaryage.com for hosting static web-site on Google App Engine and using GitHub to push changes.
Something to consider is that there are several options for mounting a S3 bucket to a local mountpoint on the server.
This would let you ditch all the extra commands for s3sync and instead just do copies, making things simpler.
@Antonin - drydrop looks like another good alternative (also inexpensive), the main goal was pulling from a non-static site and having a static copy work mostly the same - I think you could probably throw a similar set of scripts together to make that work on app engine.
@Michael - s3sync is nice because it checks for modified files using stored checksums and only updates files that have changed, but perhaps a mounted bucket with rsync could do the same (although it might end up taking longer because it's probably reading a lot more data from S3)
Check out CloudBerry Backup, great way to backup to S3 on Windows!
Leave a Comment