An avian carrier's blog – Ruby Atom feed

Ruby programming language
  1. Who did resurrect will-spam-for-food? (2012-01-22)

    A loooong time ago (was it 15 years ago?), two friends and I created the will-spam-for-food.eu.org DNSBL, also knows as WSFF. WSFF was a honeypot based system whose aim was to prevent massive spams from reaching their victims by catching and blocking the sender IP address early in the process. The system was first written in Ruby, a very young language at this time, then rewritten in Python because using threads in the 64 bits SparcLinux Ruby was very hazardous then and led to frequent crashes.

    A few years later, we had no time to do the routine WSFF maintenance anymore, and decided to shutdown the blacklist. We even unregistered the domain name to make sure that noone would continue to use a stale copy of the blacklist. All went well, until today: I received several emails from site administrators complaining that their site has been added to the WSFF blacklist and asking for a removal. I am still waiting for full reports in order to understand what is currently happening.

    Let me be clear about that: the WSFF blacklist does not exist anymore and has not existed for years. Whoever tells you you have been added to this blacklist either is a liar or runs a badly configured email system. Sending removal requests is useless as we cannot remove you from a non-existent blacklist.

    Note: I will redirect the old contact URL to this post so that system administrators can see this.

    Update 1 (2012-01-22 10:00 UTC): all traces point to MXToolBox, a company that monitor the blacklists for its customers. I have contacted them on Twitter and on their two contact email addresses to let them know they are crying wolf. If you have received such a bogus notification, do not hesitate to send them this page address.

    Update 2 (2012-01-22 17:30 UTC): according to the commenter Kristy C below, MXToolBox stated that they would be removing WSFF from their list.

    Update 3 (2012-01-23 15:00 UTC): an engineer at MxToolBox commented below that WSFF has been disabled in their tool.

  2. Feed and relative links (2010-12-27)

    Yesterday, the Factor section of this blog was added to Planet Factor. Soon after, Jon Harper noticed that some links in one of my posts were incorrectly directed onto the Planet Factor site.

    In fact, it is perfectly allowed to have relative links in an Atom feed, both in the structured part and in the HTML one. The URL resolution mechanism starts from the feed address, unless it is overriden by one or more xml:base elements in the feed itself, according to the XML Base specification. Unfortunately, as of today the Factor feed parser does not handle relative URLs at all and let them unchanged.

    It needs fixing of course, but since it is unlikely that Factor feed parser is the only parser with such a bug, I tweaked this site feeds generation. Each post goes through the following filter before getting into the Atom feed:

    require 'jekyll'
    require 'rexml/document'
    require 'uri'
    
    module AbsoluteLinks
    
      BASE = URI.parse(Jekyll.configuration({})['url'])
    
      # The complete list should be cite, classid,
      # codebase, data, href, longdesc, src, and usemap
      # but we only use a few of them.
      TOFIX = ['cite', 'href', 'src']
    
      def fix_link(post, attr)
        post.each_element("//[@#{attr}]") { |e|
          origin = e.attributes[attr]
          e.attributes[attr] = BASE.merge(origin)
        }
      end
    
      def absolute_links(input)
        post = REXML::Document.new("<post>#{input}</post>").root
        TOFIX.each {|attr| fix_link(post, attr)}
        post.to_s[6..-8]
      end
    
    end
    

    Only cite, href and src are handled here instead of the whole list given in comments. REXML (Ruby XML library) is slow enough to avoid looking for all the tags. A SAX-based parser may be more appropriate here since it would require only one tree traversal.

    Also, I used an ugly hack to have REXML parse the post content as one element where there are several paragraphs and sections. The content gets encapsulated into a <post/> XML tag which gets brutally removed at the end by a crude string manipulation.

    Now, someone should fix Factor feed parser and let it properly handle relative URLs. This is more complicated than it sounds, as it requires parsing and changing the (possibly semi-valid) HTML content from the feed entries.

  3. Something nice about every language I use (2010-12-09)

    I'll follow Dave Ray and will try to say something nice about a bunch of programming languages I use or have used seriously:

    • Ada – The only language I would trust my life to.

    • C – It gets things done easily in a controlled space when resources are scarce. I use it in many embedded situations, often with FreeRTOS.

    • C++ – Its templating system with specialization beats everything I know. When I worked on Urbi at Gostai, I had a lot of pleasure using it.

    • Erlang – The language to use to develop distributable parallel applications. I wrote many programs for research projects with it.

    • Factor – One of the languages I feel the most comfortable with. I really like the reverse polish notation and the powerful combinators. I use it for many personal and teaching projects.

    • Forth – Forth is one of the languages that I have been liking since the first time I heard about it. Its conciseness, simplicity, grammar and ease of implementation beats almost everything when it comes to size on very small embedded systems. I used it to write a Forth compiler targetting the Microchip PIC16Fxxx microcontrollers family.

    • Haskell – I started using it when I had to send patches for Darcs. I really love monads, and I also love explaining them in class. My window manager configuration is also written in Haskell.

    • J – It is unbeatable if you have RSI and need to type as little characters as possible for a task that can be applied to a whole array. I use it mostly to solve Project Euler problems.

    • Java – Well, everyone knows it so it may be used to explain a simple concept. Is that nice enough?

    • Javascript – Javascript lets us do things in the browser I would not have imagined five years ago. For example, this web page is static but includes Twitter updates and comments, thanks to Javascript. On the server side, I use it within a CouchDB database where I store a whole web application; it dynamically generates iCalendar views for multiple people from data gathered at TVrage.com using their XML API.

    • Python – I can hack anything in a few minutes and still be able to read it later. I wrote a Forth compiler for the Microchip PIC18Fxxx microcontrollers family with it.

    • Ruby – Feels like Python, only more functional and cleaner. I would use it more if I had not been bitten by threading unstability on Sparc64Linux in the past (for the will-spam-for-food.eu.org service we ran with Pierre Beyssac and Thomas Quinot). Ruby helps me run this blog.

    • Scala – There comes a useful, powerful and pleasant to use language targetting the Java virtual machine. I used it to write my HarassMe Android application.

    I probably forgot some languages in the list. However, if I use them, I am sure I can tell something nice about them.

  4. Jekyll and live feeds update (2010-11-28)

    Before I use Jekyll, Wordpress was running my blog. One thing I noticed while using Wordpress was that Google and other blog search engines were fetching my new posts a few seconds after I published them.

    To achieve these performances, Wordpress use two different systems:

    1. It sends a ping to some services which in turn fetch your feeds. Some concentrators such as ping-o-matic allow you to ping them, and they in turn ping various search engines for you so that you don't have to. Then each search engine decides whether or not it will crawl your blog again.

    2. Wordpress also uses the recent pubsubhubbub protocol (what a lovely name!) In your feed, you declare the address of a hub where interested parties can send subscription requests. Then, when a new article is published on your blog, Wordpress sends a ping to the hub, and the hub retrieves your feed. If the feed has changed, it is sent to the subscribers using a callback address they registered when they subscribed. This way, interested services such as Google do not have to retrieve the feed themselves, as it will get pushed to them when it contains new items.

    It is easy to enhance a Jekyll blog with the pubsubhubbub system, because:

    • there exists public open pubsubhubbub hubs, such as the well known https://pubsubhubbub.appspot.com;
    • you may send the ping message from everywhere, not necessarily from the server.

    The first thing to do is to add hub information in your Atom or RSS feeds. For an Atom feed, you may add the following into the feed section

    <feed xmlns="http://www.w3.org/2005/Atom">
      <link rel="hub" href="https://pubsubhubbub.appspot.com"/>
      ...
    </feed>
    

    while a RSS feed would contain

    <rss xmlns:atom="http://www.w3.org/2005/Atom">
      <channel>
        <atom:link rel="hub" href="https://pubsubhubbub.appspot.com"/>
        ...
      </channel>
    </rss>
    

    Then you may want to ensure that you can tell the hub that your feed has some fresh interesting content by pinging it. If you don't, your feed will be retrieved at regular intervals, but you will lose the benefit of using pubsubhubbub. If you are using rake for your development, you may want to create a :ping task which will send the ping when you run it:

    desc 'Ping pubsubhubbub server.'
    task :ping do
      require 'cgi'
      require 'net/http'
      printHeader 'Pinging pubsubhubbub server'
      data = 'hub.mode=publish&hub.url=' + CGI::escape("http://address.of.your/feed/")
      http = Net::HTTP.new('pubsubhubbub.appspot.com', 80)
      resp, data = http.post('http://pubsubhubbub.appspot.com/publish',
                             data,
                             {'Content-Type' => 'application/x-www-form-urlencoded'})
    
      puts "Ping error: #{resp}, #{data}" unless resp.code == "204"
    end
    

    If you prefer to use make, then a similar target using wget or curl would do the job. The only thing you need to do is send a POST request to http://pubsubhubbub.appspot.com/publish with an URL-encoded form containing the following two fields:

    • hub.mode: a single string publish.
    • hub.url: the URL of your updated feed. This can be repeated multiple times if several feeds have been updated at once.

    Note that in the real life, my rake rule is much more complex: since I have separate feeds for the two languages I use on this blog, as well as one feed per tag, my Rakefile contains code to check whether posts have been updated in the last 24 hours, and all the feeds that might have changed (and only these) will be signalled to the hub.

    What can you do with those realtime updates? You can start using services such as twitterfeed to post twitter notices of your blog posts right after they appear on your site, or you can use PuSH Bot to get live updates in your XMPP stream (in Google Talk for example). This is really as easy as pie, there is no reason your blog should not be using it right now.

    How will I publish this very post? I will just do

    rake install ping
    

    and be done with it.

  5. There must be a better way (2010-11-24)

    Since I now use Jekyll to generate this web site, I had to find a way to convert tag names into nice ASCII-only-lowercase symbols. For example, Free Software would become free-software and Éducation would become education.

    One solution I came up with is a slugify filter which uses the unicode ruby gem. After converting the string to lower case and decomposing æ and œ to ae and oe respectively, it uses the unicode normalization form KD which separates individual characters from accentuation marks as shown in this figure. Then only plain ASCII letters are kept, spaces are replaced by hyphens, and the string is reassembled.

    # -*- coding: utf-8 -*-
    module Slugify
    
      require 'unicode'
    
      def slugify(input)
        t = Unicode::nfkd(input.downcase.gsub('æ', 'ae').gsub('œ', 'oe'))
        t.gsub(/[^\w\s-]/, '').gsub(/[\s-]+/, '-').downcase
      end
    
    end
    

    This way, I can link to the tag page using <a href="/blog/tag/{{ tag | slugify }}">{{ tag }}</a> without fearing that some software chokes on the URL. It works well and I am now satisfied with this function, so I removed the questions that were there in previous instances of this post. The only thing I dislike is the double downcase call, due to the fact that some entities cannot be downcased without knowing more about the used language.

    Edit: updated to match the name and behaviour of Django's slugify as per Ricardo Buring comment with an additional "æ" to "ae" and "œ" to "OE" translations.