We occasionally see major spikes in traffic due to robots and spiders on the site. While the traffic is a good thing, it is lower priority. Unfortunately, there is no way to tell apache to work slower on some requests. If we were a smaller site, our Varnish configuration would be sufficient since all of the pages could be cached for anonymous access. Unfortunately, we have a database of 20,000+ nodes that are publicly accessible, and spiders want to see all of them. When the wrong factors line up, it equates to a minor DOS attack.
Here are some of our rejected options:
- There is a mod_qos that would allow us to simply drop the traffic, but that is not what we want to happen.
- We can stick with the basic install and current resource, but that means all traffic slows way down during a DOS.
- We can scale the site as a whole. Unfortunately, some spiders will eat whatever we feed them such that we add cost without really solving the problem.
Our end solution is to segment this traffic into a separate EC2 instance. Using some basic VCL, we direct all anonymous traffic to a particular node type (i.e., URL prefix) to a specific Apache instance. Voila! As that low priority traffic spikes, it only affects other low-priority hits to the site. Since human visitors are likely to consistently hit a smaller subset of those pages, they will still benefit from the Varnish caching on the front-end such that they will not experience the full slow-down that can happen on the first hit to each page that spiders experience.