Getting rid of RSS slammers

October 12th, 2005 by Samuel Tardieu

A few weeks ago, I noticed that some people were getting my RSS feed once every minute. The load on the WWW server was already high and I found a much cheaper solution on my side: redirect them to the RSScache service through an Apache redirection.

This morning, I read that Daniel Glazman had the same problem and I suggested him (in a private email as he forbids comments on his blog) to do the same. After discussing a while, we thought it could be a good idea to automate the process.

I wrote a small Python script called rssabuse.py which parses your web server access log, tries to detect the abusers for the previous day and rewrites part of your .htaccess so that abusers are redirected transparently to RSSCache. Ok, they may get extra advertisments in the feed, so what? This is their problem, not yours. A HTTP redirection is much less costly than a full feed serving and they can still follow your blog activity. This should work with many blogs software (using WordPress or DotClear for example), provided that you can use Apache’s mod_rewrite in your .htaccess.

The idea is to put something like that in your .htaccess:

RewriteEngine on
RewriteBase /blog
# rssabuse section
RewriteCond %{REMOTE_ADDR} 0.0.0.0  [replaced later by this script]
RewriteRule ^(feed.*)$ http://my.rsscache.com/www.rfc1149.net/blog/$1 [R,L]

and then, every night, shortly after midnight, you launch (through a crontab for example):

rssabuse.py /home/log/apache/access.log '^/blog/feed' 100 /home/sam/blog/.htaccess

(100 means 96 times a day plus a few hits to be on the safe side)

The script will count accesses to ^/blog/feed as a regular expression and redirect the hosts (by name or address) abusing your feeds to RSScache by rewriting your .htaccess file. You should see your server load decrease as the abusers are kept away.

A note for the technical junkies: the script will try very hard to make the file update atomic so that no hit to your web server can see a partial or missing .htaccess.

rssabuse.py is made available under the GNU General Public License version 2.

  • Version 1.0: initial release
  • Version 1.1: the list of abusers is available on standard output so that you can see that it is working
  • Version 1.2: fix a bug in date computation and output more helpful statistics with the number of accesses that caused a host to be blocked

Related posts:

  1. blenderdist
  2. recoverjpeg 1.1 is released
  3. Forth interpreter and readline library in Ada
  4. The crazyness of DRM
  5. How recoverjpeg saved my day

18 Responses to “Getting rid of RSS slammers”

  1. Thomas Says:

    There’s a not missing after their problem;
    is a chronological personal web site still a blog if you cannot leave comments?
    I think the crontab entry needs to refer to rssabuse.py

    Feel free to delete this comment ;-)

  2. Thomas Says:

    Why don’t you generate the new file under a temporary name in the same directory as the old one inconditionally?

  3. Samuel Tardieu Says:

    Thanks Thomas, I’ve fixed the two typos in the post.

    Concerning your remark about comments, I would tend to agree with you. However, some people think that comments are not appropriate. As I told Daniel today, I do not find blogs where comments are disabled very attractive, especially when you cannot even use trackbacks to post followups on your own blog.

  4. Samuel Tardieu Says:

    Thomas: because you may be allowed to write into the file but disallowed to create a new file in the same directory.

  5. Pierre Says:

    Why don’t you simply generate a static page containing the RSS stuff? You could update it once or twice an hour by cron, or even purely on a as-needed basis : after all, you’re the one who’s in the best position to do that. Serving a static page is much less costly than serving a dynamic page.

  6. Samuel Tardieu Says:

    Pierre: sure, that would be better.

    But there are already some optimizations in WordPress (such as sending back 304 if the feed has not been modified if the client is intelligent enough. And I want to punish abusers as well as alleviate the load on my web server :-)

  7. Thomas Says:

    Sam, you seem to have become moderate :-) I distinctly remember a time where you would have argued that such web sites weren’t blogs at all!

    Your rename_safely function does assume that it can create the temp file in the proper directory.

  8. Samuel Tardieu Says:

    Thomas: well, I had to agree with other people on a common definition for blog. For me, blogs without comments and trackbacks are not real blogs, but if I am the only one to use this definition, the communication will hardly be easy.

    The rename_safely function doesn’t assume that it can create a file in the target directory: it first tries to atomically rename the temporary file into the proper one, then to create a file in the target directory and atomically rename it into the target file (in case where we had a cross-device rename failure), then to open the file for writing (without creating a new one) and copy the content of the temporary file.

    Of course, I assume that Python is properly configured so that the tempfile module can create temporary files, typically in /tmp (the size is not an issue as .htaccess files tend to be very small).

  9. Matthieu Says:

    Hi,
    the idea seems great but i can’t stop thinking at my own personal case.
    I’m in a big company (~200 000 employee) and we have a reverse proxy to go internet so when i or my coworker hit your website you will see one ip address ….

  10. Samuel Tardieu Says:

    Matthieu: so the proxy should be caching the feed information, right?

  11. Daniel Glazman Says:

    Samuel: my blog has no open comments/trackbacks because I was fed up with insults, trolls and other forms of intrusion into **MY** personal diary. I publish for myself, not for others. I just do not care about the way people call my web site. They can call it “blog” or “foobar”, only the contents matter.

  12. Freako Says:

    You know that you can earn money from your ads in your RSScache feed? So, it’s pay back time with those abuser!

  13. Samuel Tardieu Says:

    Freako: good idea, I just activated it!

  14. Clarky's Corner Says:

    En Vrac

    Samuel Tardieu propose une solution assez intéressante pour lutter contre les personnes qui pompent les ressources de bande passante en utilisant des clients RSS qui ne respectent pas un minimum de temps d’attente entre deux rafraîchissements….

  15. Brenda Quiles Says:

    Samuel, Thanks! Great work! :)

  16. Tangy Says:

    Thank you so much for this! I have a couple of sites that get just jacked on their RSS feeds from slammers… I will definitely be implementing this solution.

    cheers!

  17. Chicago Realto Says:

    Thanks for the post! I have two new websites and I have noted the same problem, so now I will try your way to solve it this weekend.

  18. seo ranking Says:

    Great articles & Nice a site

Leave a Reply


Creative Commons License
Register Login