Feed and relative links

Samuel Tardieu, 2010-12-27

{% include urls Yesterday, the Factor section of this blog was added to Planet Factor. Soon after, Jon Harper noticed that some links in one of my posts were incorrectly directed onto the Planet Factor site.

In fact, it is perfectly allowed to have relative links in an Atom feed, both in the structured part and in the HTML one. The URL resolution mechanism starts from the feed address, unless it is overriden by one or more xml:base elements in the feed itself, according to the XML Base specification. Unfortunately, as of today the Factor feed parser does not handle relative URLs at all and let them unchanged.

It needs fixing of course, but since it is unlikely that Factor feed parser is the only parser with such a bug, I tweaked this site feeds generation. Each post goes through the following filter before getting into the Atom feed:

require 'jekyll'
require 'rexml/document'
require 'uri'

module AbsoluteLinks

  BASE = URI.parse(Jekyll.configuration({})['url'])

  # The complete list should be cite, classid,
  # codebase, data, href, longdesc, src, and usemap
  # but we only use a few of them.
  TOFIX = ['cite', 'href', 'src']

  def fix_link(post, attr)
    post.each_element("//[@#{attr}]") { |e|
      origin = e.attributes[attr]
      e.attributes[attr] = BASE.merge(origin)
    }
  end

  def absolute_links(input)
    post = REXML::Document.new("<post>#{input}</post>").root
    TOFIX.each {|attr| fix_link(post, attr)}
    post.to_s[6..-8]
  end

end

Only cite, href and src are handled here instead of the whole list given in comments. REXML (Ruby XML library) is slow enough to avoid looking for all the tags. A SAX-based parser may be more appropriate here since it would require only one tree traversal.

Also, I used an ugly hack to have REXML parse the post content as one element where there are several paragraphs and sections. The content gets encapsulated into a <post/> XML tag which gets brutally removed at the end by a crude string manipulation.

Now, someone should fix Factor feed parser and let it properly handle relative URLs. This is more complicated than it sounds, as it requires parsing and changing the (possibly semi-valid) HTML content from the feed entries.