To peer review or to not peer review?

Samuel Tardieu, 2006-12-26

As an experienced programmer, I participate in many Free Software projects when time permits. I am committed to a few projects, and I frequently submit patches to random projects that I happen to bump into. I also understand the dynamics of free software: when a bug stands in my way, I often fix it myself rather than waiting for another contributor (who may have her own priorities and agenda) to fix it. Same when I badly need a feature.

In this post, I will compare the submission process of two changes I made to free software recently:

a new watchdog driver for the Linux kernel;
a fix for a critical flow in SIP message handling in the Asterisk telephony system.

Linux device driver

I first posted my new device driver code as a patch (a difference between the actual Linux source code and the modified one) on the linux-kernel mailing-list. Shortly after that, some people publicly answered my mail and offered remarks and criticisms about my changes. Most of the advices were well targeted and I modified my patch accordingly. Some of the remarks were a bit off because people commenting the code hadn’t read the device datasheet and were confused by some names used therein and mirrored into the driver; I explained the situation and why I would not act upon those remarks. One point about a possible concurrent access was discussed and resolved after a few technical exchanges. I then posted a modified patch for everyone to comment on. This later patch was then acked (i.e., blessed) by a major developer.

Various parts of the Linux kernel are maintained by different people. The device I was addressing was a watchdog (a piece of hardware that forcibly reboots your computer if the operating system fails to say “I’m still alive” on a regular basis), so the watchdog subsystem maintainer took responsability and integrated it into his own development tree, so that people willing to test this new driver could do so easily. After some time, while the new driver had shown not visible disturbance of the rest of the kernel, it was pulled by Linus Torvalds into the main Linux kernel tree and was released as part of Linux 2.6.19.

Note that when the watchdog subsystem maintainer integrated my new driver into his tree, he was already quite confident that the driver was clean as it had been carefully read and commented on by several other developers. The integration within his tree rather than into the main Linux kernel ensured that all the watchdog drivers can play nicely together.

Asterisk flaw in the SIP engine

Free Telecom is the second most important ADSL provider in France. They provide a triple-play service over ADSL: IP, telephony and television. The telephony service can be accessed either using an analog phone connected to their ADSL modem or using a SIP connection to their server. On the server side, Free Telecom chose to use a solution by Cirpack, made from boxes able to handle several thousands of simultaneous SIP sessions.

When the Cirpack server was upgraded at the beginning of December, all Asterisk boxes using Free Telecom as their SIP provider immediately stopped working: the voice was not going through anymore. This problem was signaled onto a forum by an Asterisk user a few hours after the upgrade and promptly analyzed by a Cirpack engineer: it appeared to be a flaw in Asterisk SIP handling. The engineer rolled back the Free Telecom server to the previous revision and sent me a mail with the description of the problem. Why me? Because we know each other as we studied together, and he knew I was using Asterisk to connect to the Free Telecom SIP server and that I was likely to quickly investigate and fix the problem.

A few hours later, I produced two short fixes for Asterisk and was able to test them against a Cirpack server running the new firmware. Everything went fine and the problem was fixed. I posted the patches to the Asterisk bug tracking system and, less than four hours later, added full debugging information with and without the patches at the request of a manager so that it was clear what the problem was and how the patch fixed it.

I also sent several mails on the Asterisk developers mailing-list to underline the importance of the flaw. As long as the flaw is not fixed, any upgrade made by a VoIP provider may break all its Asterisk clients without any easy workaround. To describe the flaw shortly, an unpatched Asterisk doesn’t understand perfectly valid SIP headers and interprets them in a totally wrong way, causing the subsequent traffic to be sent to the wrong place.

Asterisk 1.4.0 was released 19 days after I explained this critical flaw and posted the patches to correct it. Not only were the patches not included in the release, but as far as I can tell no peer review has occurred on the patches. The only request made by a manager was that some developers, who have not yet answered, test the patch.

Also, at some point, this very same manager added a relationship between this problem and another one without any comment to explain this alleged relationship. As far as I can tell, the two bugs are totally unrelated and I fail to see any relationship between them except that they address two problems in SIP message processing, although one is about SIP headers syntax and the other one about the SIP engine internal state machine.

At this point, it is worth noting that I do not feel bad about Asterisk because my patches were not included in the latest release; what I criticize here is what I consider a lack of feedback on user-contributed fixes and a lack of interaction between developers.

Comparing the two processes

Proposed changes to the Linux kernel are posted on a public mailing-list as plain-text, where anyone is free to comment on them. The plain-text format makes it easy to intersperse the relevant code portion with the comments. One or several structured discussions follow, each one addressing one aspect of the proposed patch. New versions of the patch may then be proposed and discussed until the patch is finally blessed (acked) by one or more fellow developers. Note that this process happens in an email client, without any compilation taking place at this stage. Technical flaws may be found by code reading and discussion rather than by testing whether the code seems to trigger a bug or not. Also, if the code would benefit from extra documentation, such documentation will be requested publicly by other developers.

Proposed changes to Asterisk are posted onto the Asterisk bug tracking system maintained by Digium (the original authors and the current maintainers of Asterisk). A disclaimer also needs to be filled by contributors, as Digium wants to be able to make a proprietary version of Asterisk, while others may only distribute it as a GPL software. I have the impression that the patches are not peer reviewed: the use of a bug tracking system doesn’t ease such a code review process, compared to a mailing-list as in the Linux kernel patches case. I am also under the impression that patches are tested rather than being read first. If enough developers report that the patch hasn’t visibly broken their system, the patch may eventually be integrated.

Also, parts of Asterisk sometimes undergo major rewritings without any attempt to explain what has been changed exactly. For the Linux kernel, it would be unacceptable: a serie of incremental patches would be required to be submitted on the mailing-list, with a step-by-step justification of why things need to be changed. When incremental patches are not doable, because changes depend on each other, separate patches that need to be applied at the same time will still be required so that individual changes are reviewable by other developers.

As you may have guessed at this stage, I much prefer the Linux kernel way of doing it. The peer review system exposes proposed changes to several pairs of hackers eyes. The patches and the subsequent discussions also teach potential contributors what they need to send and how they need to present it. This iterative process not only generates better code but also shows good practices to other programmers.

I would really like other large software projects, such as Asterisk, to adopt it to increase the code quality and the developers interaction.