Hotlinking is obnoxious. Hotlinking is theft. Hotlinking is getting harder and harder to prevent. Any web developer worth their salt has run into this problem - What do you do when shady people are willing to steal your work? Keep in mind, we're not talking about open source projects or contributing to a cause - we're basically talking about a jack move. While most developers have decent options with htaccess or server config files, when it comes to using a CDN options are pretty limited. Since Cloudfront hotlinking is a battle I've put some time into, I'll let you guys know the preventative options I've come up with so hopefully you can avoid this annoyance on your future projects.
Cloudfront Hotlinking is Seriously Obnoxious
Put briefly, hotlinking is when someone uses your image URL on their site. Some folks may think that since an image is on your server, someone else's server shouldn't be able to display it - But those people would be wrong. The truth is - if you can see the image and there aren't any preventative measures in place, anyone can use that URL for themselves. This is annoying for two main reasons - 1. Someone is using an image you created and 2. Someone is using your server to host it. This is basically like a car thief asking you to drop your own car off. If you think about the worst case-scenario - What if someone builds a website displaying 5,000 of your images!?! That's gonna put a serious hurt on your server and if you're serving those files from a CDN - you're going to be pretty angry when the monthly bill arrives.
While it shouldn't be too difficult for CDN providers to address this on a server level, it is a step most haven't taken yet. Whether the issue is an internal conflict of interest or lack of motivation, addressing Cloudfront hotlinking has become a bit of a drag for web developers. While working on a recent project using a Cloudfront CDN we stumbled across Amazon's answer to this problem. A little Googling will show you that the word hasn't quite spread yet, so I'll do a quick walkthrough on Amazon WAF and show you how to get it up and running in short time.
Customize Cloudfront for Optimum Performance and Security
The project that brought about this issue was pretty commonplace, but I'll use my own website to illustrate the steps below. I have a Joomla! core pushing images to Amazon S3. Those images are then pushed over to Cloudfront, which is one of the most popular CDN options on the market. Here's the list of the requirements I had while applying this fix...
- We need to hold files on S3 so we can retain a custom CDN domain name
- Those files should be pushed to a limited Cloudfront network for speedy delivery
- All files (JS/CSS/images) should be reachable by Googlebot & Bingbot
- Images should be reachable by social networks so images are attached to shares
For this example we're using my business website haeckdesign.com. We need the images to come from cdn.haeckdesign.com, since keeping everything under one apex domain is preferable. We need the images to be viewable to everyone, but those images shouldn't be able to be served or extracted from an external location. Just to throw another wrench in the works - We need to make sure javascript, css stylesheets, and images are crawl-able by popular search engines so we abide by all the rules and get some image results. If this were on a basic Apache server we could just run a simple match redirect, but since it's off server we've got to use Amazon's network to fix this for us - enter WAF, Amazon's Web Application Firewall.
Introducing the WAF Firewall
Science proves that one easy way to sound really smart is to drop a series of acronyms... so with that in mind, let's clarify our goal. We're aiming to define a WAF ruleset to restrict an existing AWS Cloudfront, which is mirroring S3 content, optimizing our CDN distribution while ensuring white-listed access for optimal SEO performance (...am I smart yet?). Back to reality - login to AWS and from the control panel, look for Amazon WAF on the main product menu. Once you're on the AWS WAF page it'll just take a couple of steps to get your firewall up. I'd personally suggest starting by creating your string match conditions & filters, but these are the basic steps then I'll cover what I did for this particular project.
Step One: Determine your best filtering method. So figure out the least obtrusive way to achieve your goal. The order does matter - so funnel your rules down like you would with Regex redirects.
Step Two: Create a New Web ACL. This is the umbrella that all of your rules are going to fall under. Each rule is a group of filters and each filter is a group of match conditions.
Step Three: Create your Rules & Conditions. In this case, we used "string match conditions" to test response headers, but IPs and response codes can also be used to filter.
Step Four: Tie your rule to your Cloudfront distribution. You can set that when creating your ACL or afterward by viewing your Cloudfront distribution, hitting edit, and clicking the WAF setting right below the title.
Although generating a WAF firewall isn't technically difficult, it can be tricky since you need to think from a different perspective. This can take you anywhere from 5 minutes to a half-hour, but the majority of the time will be spent running through hypothetical "If I were a .png file..." kind of questions. After a little brainstorming, I came up with (what I currently believe to be) the most logical approach. Like most CDN users, I'm pushing heavily used media from Cloudfront... this means CSS, JS, and images. This should be similar to the way Cloudfront is commonly used as a CDN for websites, but make sure you are adjusting the rules to fit the needs of your exact situation.
I find analogies useful when brainstorming processes, so I've used the classic Seinfeld "Soup Nazi" episode to center my logic around. At the risk of sounding outlandish, here it is... "Any passerby should feel safe, but if you're going to enter the soup shop ("/images/") and have the nerve to order Turkey Chili... You better do it right". Running with this concept, I have two match criteria I'll be generating. The first match condition (named SoupShop) will look for URI requests that contain "/images/". The second match condition (named TurkeyChili) checks the Header Referrer and allows requests from my website's domain name and the websites I want to get through. In this case, I whitelisted "bing.com, google.com, facebook.com, pinterest.com, twitter.com, stumbleupon.com". Since I want Googlebot & Bingbot to get through, I also set header "useragent"'s containing "Googlebot & Bingbot" through. While I'm sure this isn't the most optimal approach (since using "query contains" rarely is), this should help most developers get off to a running start.
Now to tie it all together, I create a new rule (named SoupProtocol) that includes both match conditions. The first step says to be restrictive of any queries on my "/images/" folder and the second says to allow the whitelisted domains through. Then I create a new ACL that uses the "SoupProtocol" filter to block queries, then defaults everything else to pass right through. Again - this may not be the exact best approach (and if you know a better method, please included it in the comments), but it achieves the goals listed above and does so with minimal requirements from WAF itself.
Testing Your WAF Firewall
Now that we have everything setup, let's make sure it's performing as needed so we can put on the finishing touches. Running a quick CURL while the firewall is distributing to Cloudfront, I expect to see a 200 for a basic image request. In this example, I request my logo with a CURL request that looks like this...
curl -I https://cdn.haeckdesign.com/images/haeck-design-logo.svg
This curl response is exactly what you should expect from any query before putting a firewall up. In technical terms, a 200 response means that everything is "all clear" on the call and response.
After giving WAF a few minutes to distribute itself throughout Cloudfront, I'm going to run that command again. In this case, I'm expecting to see a 403 Forbidden. As you'll notice, that's just what we got. Don't celebrate yet though - All that test really means is that we got an error upon retrieval. We still need to ensure that the image is both publicly view-able and also completely accessible to the search engine bots and social networks we whitelisted earlier.
One way to ensure that it's publicly viewable is simply to visit the site yourself. After that checks out we're going to run one final curl that mimics the response the file will get when called from Googlebot. That can be achieved with the following terminal command...
curl --user-agent "Googlebot/2.1 (+http://www.google.com/bot.html)" -v "https://cdn.haeckdesign.com/images/haeck-design-logo-raleigh-nc.svg"
As long as we see a 200 response, we should feel safe that we've successfully done the job. HIGH FIVE!! If you're the paranoid type, you could always submit the file to Google Webmaster's Fetch as Google tool or Facebook's Linter. Everything should get through with no hassles. Once you've completed testing your WAF firewall, you should certainly keep on an eye out for any errors being triggered in Google Webmaster or elsewhere if you use another third-party tracking tool. It may also be a good idea to check back into AWS in a couple of days to review the WAF log and make sure you're not incurring ridiculous fees by having an overactive firewall. If something does look off, don't forget that you can just log in to your Cloudfront distribution and temporarily pause your WAF filter fairly easily.
Not a Perfect Fix... But awfully close.
Patch fixes are rarely perfect and using WAF is no exception. While it does prevent hot-linking there is one notable drawback - the cost. After running WAF for a month I checked back to see the bill estimates, expecting to see a negligible cost... I was disappointed. While it was only a buck or two, the underlying structure is what bothered me. It appears as though AWS is charging per hit - which is not ideal. That's similar to a bouncer charging you per "troublemaker" kicked out. If you run with that concept over time you'll probably notice that the focus isn't really on providing security, it's more on reporting incidents. I may be expecting too much, but we're not trying to do anything overly complicated here. That said - These are the current WAF rates and Amazon has historically tended towards lowering pricing after unrolling infrastructure. I hope WAF won't be an exception to that rule. Ultimately a bucket policy setup similar to S3 might be the solution, but I do understand that the most important function of Cloudfront is providing media quickly - So worrying about restricting that should be a lower priority on the task-list.
One other point I'd suggest keeping in mind is that some distribution tools should have access. Google, Bing, and popular social networks are the obvious ones, but if you rely on RSS readers like Flipboard or Apple News - you may want to do some testing to ensure this filter doesn't get in the way. This is a "case-by-case" thing, but I do suggest you give it a look before marking the finished.
All things considered, using WAF to prevent Cloudfront hotlinking is the best option going and certainly suitable for production use. AWS is constantly changing though, so if you have some ways to optimize this approach please feel free to include them in comments or on your favorite social. If you've found this post useful share it with the buttons below and thanks for stopping by.