Review of Foundations of Python Network Programming

Abstract

A review of the book Foundations of Python Network Programming by John Goerzen.

Disclosure

I've known John Goerzen from Debian and his OfflineIMAP project long before this book came out. I'm probably biased to think of him as a talented person. I also believe I played a role in introducing him to Twisted.

I'm also one of the people working on the Twisted framework, and would probably say good things about any book that mentioned Twisted.

I would also like to state that I am not in the real target audience for this book -- I've done my fair share of kernelspace networking C, I've read my fair share of TCP RFCs, and written poll loops for a long time.

The goal I set for myself was not to write a glowing fluffy review, give a link to amazon and increase sales -- even if I was bribed with a shiny book -- but to really dig into this book, see how good it is. Blame it on me being an average pessimistic Finn, or something.

First Impressions

The book looks neat and tidy, the cover is colorful and abstract but at the same time professional looking. The content layout is spacious and clear, sometimes almost spartan.

The first thing that really caught my eye was fonts. The book uses quite distinct fonts -- comparing it to an O'Reilly book I had at hand, I can honestly say I think the body text is more readable in Foundations. The font used in headings and table of contents is a lot less readable for me -- especially the letter Q and comma. The Q looks like an O with dirt under it (not even touching the letter itself), so what I first read from the table of contents was SOL in Python, as in So-Out-of-Luck.

An Overview

The first hundred pages -- one fifth of the total book -- are a good, quick introduction to networking basics such as TCP, UDP and DNS (not the protocol, but the query API and things like hostname spoof protection). The last twenty pages of that rush in brief explanations of many a bit more obscure networking things, such as half-open sockets, string termination, network byte order, broadcast and so on.

After a brief 50 page trip through XML and XML-RPC, the book starts to lose its tutorial style and become more useful also as a reference. In 30 pages, parsing and generating email messages is explained very clearly. SMTP and POP both their approximate 10 pages, which seems adequate for covering everything important.

The author's experience with IMAP shows in the next 50 pages. I think atleast these 50 pages are worth purchasing. This is the part of this book I myself was personally most interested in, and I am not disappointed.

A good 20 pages is spent on FTP. Personally, I would have rather seen more on HTTP. FTP is dead to me, and I hope it understands to stop moving soon. Database access gets another 20 pages, and so does SSL.

Different server-side programming frameworks are discussed next. At 10-20 pages each, the author takes us through a whirlwind tour of SocketServer, SimpleXMLRPCServer, CGI, and mod_python. I have no clue why Twisted isn't on this list.

The rest of the book is about different ways to achieve the goal of serving multiple clients at once. Whereas in the earlier chapters forking and threading were things a library did for you, now you get to understand what was happening under the hood.

Separate processes, threads and asynchronous event loops are each discussed in roughly 25 pages, with the last one introducing Twisted briefly by reimplementing an earlier example with Twisted. Of course, practically everything in the IMAP chapter was specifically about Twisted's IMAP library, so concepts such as Deferred were already discussed there.

The Python in the Book

I think that in any book that is so related to a programming language, the examples included matter a lot for the readability of the whole. On the whole, the examples were readable, kept as light as they possibly can be while still covering the subject area well enough. However, there were a few things that kept distracting me.

The python examples included at least one repeating pattern I consider non-pythonic. It is small, but it did keep annoying me through the book.

# Part of `gopherclient.py` from Chapter 1, page 10.
while 1:
    buf = s.recv(2048)
    if not len(buf):
        break
    sys.stdout.write(buf)

Why would one use an explicit len() call in the above? The string returned by s.recv() can be directly tested for emptiness, by just saying if not buf: ...

Another thing I noted was the heavy use of global variables in some of the examples. Most of the uses seemed like they would be more readable and pythonic if there would be an object explicitly storing the state. (e.g. page 106)

There's a lot of var == None comparisons in the book, I prefer var is None, and consider that more pythonic.

One thing that also caught my eye was using The requested document '%s' was not found in a string, instead of the better-behaving ...document %r was... (page 345).

The book contains multiple references to using pickle in a network protocol. Sometimes, but definitely not always, the author remembers to warn of security concerns. Using pickle for anything not directly controlled by the process itself is so horribly bad I feel the book should have never even mentioned the possibility. If the author wanted to have a non-XML-RPC example in a more DIY sense, he could have pointed to e.g. twisted.spread.jelly. (e.g. pages 159, 361)

Whereas the book often explicitly cautions not to try to e.g. store files in memory, this thinking is not practised thoroughly. Here's part of an example that gathers a list of message UIDs in memory while it fetches the messages via IMAP, and a small refactoring of it that trades memory use for increased protocol traffic -- other tradeoffs, such as gathering 100 UIDs before sending a command, are equally possible.

# Part of `tdownload-and-delete.py` from Chapter 12, page 252.
def handleuids(self, uids):
    self.uidlist = MessageSet()
    dlist = []
    destfd = open(sys.argv[3], "at")
    for num, data in uids.items():
        uid = data['UID']
        d = self.proto.fetchSpecific(uid, uid = 1, peek = 1)
        d.addCallback(self.gotmessage, destfd, uid)
        dlist.append(d)
    dl = defer.DeferredList(dlist)
    dl.addCallback(lambda x, fd: fd.close(), destfd)
    return dl

def gotmessage(self, data, destfd, uid):
    print "Received message UID", uid
    for key, value in data.items():
        print "Writing message", key
        i = value[0].index('BODY') + 2
        msg = email.message_from_string(value[0][i])
        destfd.write(msg.as_string(unixfrom = 1))
        destfd.write("\n")
    self.uidlist.add(int(uid))

def deletemessages(self, data = None):
    print "Deleting messages", str(self.uidlist)
    d = self.proto.addFlags(str(self.uidlist), ["\\Deleted"], uid = 1)
    d.addCallback(lambda x: self.proto.expunge())
    return d
# My version of `tdownload-and-delete.py`
def handleuids(self, uids):
    dlist = []
    destfd = open(sys.argv[3], "at")
    for num, data in uids.items():
        uid = data['UID']
        d = self.proto.fetchSpecific(uid, uid = 1, peek = 1)
        d.addCallback(self.gotmessage, destfd, uid)
        dlist.append(d)
    dl = defer.DeferredList(dlist)
    dl.addCallback(lambda x, fd: fd.close(), destfd)
    return dl

def gotmessage(self, data, destfd, uid):
    print "Received message UID", uid
    for key, value in data.items():
        print "Writing message", key
        i = value[0].index('BODY') + 2
        msg = email.message_from_string(value[0][i])
        destfd.write(msg.as_string(unixfrom = 1))
        destfd.write("\n")
    d = self.proto.addFlags(uid, ["\\Deleted"], uid = 1)
    return d

def deletemessages(self, data = None):
    print "Deleting messages"
    d = self.proto.expunge()
    return d

I was also a bit disappointed by the lack of any visual helpers in the layout of the example code. I would have appreciated moderate use of bold, italics and so on. I also found I had more trouble following the indentation than usual -- maybe a containing box around the code would have given the eye something to fixate on. The left margin is so far away, trying to quickly discern relative indentations of blocks of code gets tiring quite fast.

The examples are all available online, which is a great help -- while reading a book is usually more comfortable, and most of all more flexible, than reading on a computer, for code I found I vastly prefer the colorized output and search functionalities of my emacs.

The Twisted in the Book

The Twisted code in the book is not of as good quality as the normal python code. There are mishandled Deferreds, race conditions and whatnot. Let me try to illustrate by showing you the first example from the chapter discussing Twisted:

#!/usr/bin/env python
# `tconn.py` from Chapter 12, page 226.
# Basic connection with Twisted - Chapter 12 - tconn.py
# Note: This example assumes you have Twisted 1.1.0 or above installed.

from twisted.internet import defer, reactor, protocol
from twisted.protocols.imap4 import IMAP4Client
import sys

class IMAPClient(IMAP4Client):
    def connectionMade(self):
        print "I have successfully connected to the server!"
        d = self.getCapabilities()
        d.addCallback(self.gotcapabilities)

    def gotcapabilities(self, caps):
        if caps == None:
            print "Server did not return a capability list."
        else:
            for key, value in caps.items():
                print "%s: %s" % (key, str(value))

        # This is the last thing, so stop the reactor.
        self.logout()
        reactor.stop()
        
class IMAPFactory(protocol.ClientFactory):
    protocol = IMAPClient

    def clientConnectionFailed(self, connector, reason):
        print "Client connection failed:", reason
        reactor.stop()

reactor.connectTCP(sys.argv[1], 143, IMAPFactory())
reactor.run()

That's a reasonably small program, but nonetheless, alarms bells are ringing in my head: there are no errbacks anywhere, and the reactor is only stopped on a successful code path. And self.logout()'s return value, which is a Deferred, is thrown away and the reactor stopped immediately. There's no guarantee the logout message even got as far as the socket, the program just shuts down.

Let's fix that right now! And while we're at it, let's make it a bit more Twisted:

#!/usr/bin/env python
# `my-tconn.py`
# Basic connection with Twisted - Chapter 12 - tconn.py
# Note: This example assumes you have Twisted 1.1.0 or above installed.

from twisted.internet import defer, reactor, protocol, error
from twisted.protocols.imap4 import IMAP4Client
import sys

class IMAPGetCapabilities(IMAP4Client):
    def _doLogout(self, r):
        d = self.logout()
        d.addCallback(lambda _: r)
        return d

    def connectionMade(self):
        d = self.getCapabilities()
        d.addBoth(self._doLogout)
        d.chainDeferred(self.factory.deferred)

class IMAPGetCapabilitiesFactory(protocol.ClientFactory):
    protocol = IMAPGetCapabilities

    def __init__(self):
        self.deferred = defer.Deferred()

    def clientConnectionFailed(self, connector, reason):
        self.deferred.errback(reason)

    def clientConnectionLost(self, connector, reason):
        if reason.check(error.ConnectionDone):
            # only an error if premature
            if not self.deferred.called:
                self.deferred.errback(reason)
        else:
            self.deferred.errback(reason)


def _showCapabilities(caps):
    if not caps:
        print "Server did not return a capability list."
    else:
        for key, value in caps.items():
            print "%s: %s" % (key, str(value))

def _showError(reason):
    print reason.getErrorMessage()

f = IMAPGetCapabilitiesFactory()
f.deferred.addCallback(_showCapabilities)
f.deferred.addErrback(_showError)
f.deferred.addBoth(lambda _: reactor.stop())
reactor.connectTCP(sys.argv[1], 143, f)
reactor.run()

While I fixed the bugs I happened to see in that example, I also took that opportunity to separate the application logic from the protocol implementation, only stop the reactor in one place (the main program body), and generally clean up the program.

Let me also assure you this is not an isolated incident, here are some more examples:

  • Page 232, example t-error.py:

    If the connection is lost, the program hangs.

    The loginerror function should start with failure.trap(imap4.IMAP4Exception).

  • Page 247, example tdownload.py:

    The Deferreds in the DeferredList have no final errbacks. There should probably be a consumeErrors=True argument given to the DeferredList.

    There is no error checking on the result of the DeferredList.

  • Page 257, example tstructure.py:

    There is no error checking on the result of the DeferredList.

    In case of error, the example dies a misleading death trying to access the items attribute of a Failure object. 1

All in all, I am happy to see Twisted get some of the spotlight, but I guess it is too big to be handled as a subtopic of a single chapter.

The Networking in the Book

Overall, I feel the book has succeeded very well in its goal of introducing networking to people. There are some places, though, that state invalid things or that can lead the reader into assuming things that are not valid. I have full confidence the author is familiar with these concepts, and that the only reasons for these points of confusion are hurrying up the result, blindness to ones own text, and trying to keep the book reasonably short. Still, I feel sad to see such things slip by.

Here are some examples:

  • After showing an example that binds to 127.0.0.1:51423 and not just 0.0.0.0:51423, the author states:

    If you have a host with an IP address other than 127.0.0.1, you could normally connect to port 51423 on that address; now you cannot.

    This result looks suspiciously like it was the result of editing some IP address the author's machine happened to have to 127.0.0.1. The sentence is quite confusing, considering that every host has the IP address 127.0.0.1, in addition to any other addresses it may have. What the author apparently tried to say is, of course, that other hosts cannot connect to the service (page 103).

  • The book lets novice readers assume TCP would preserve message boundaries. Here's a snippet from a longer example:

    # Part of `pollclient.py` from Chapter 5, page 106.
    data = s.recv(4096)
    if not len(data):
        print("\rRemote end closed connection; exiting.")
        break
    # Only one item in here -- if there's anything, it's for us.
    sys.stdout.write("\rReceived: " + data)
    sys.stdout.flush()
    

    Once again, it is easy to believe more detailed explanations were skipped due to space restrictions, but I personally have seen way too many programs that assume read(2) returns full lines. The fact that this behaviour is highly likely when testing with manual input, using a line-buffering client application, only makes the situation worse.

    # My version of `pollclient.py`.
    data = s.recv(4096)
    if not data:
        print "Remote end closed connection; exiting."
        break
    print "Received: %r" % data
    
  • Page 87 refers to any protocol with binary content (as opposed to ASCII) as "C-based".

  • When speaking of CGI-generated HTTP content, the author points out many scripts do not supply a Content-Length header, and that in this case HTTP signals end of file by connection close. He continues to say that "there's no way to detect a truncated file" (page 125).

    I would really love if people would finally embrace HTTP 1.1 and Transfer-Encoding: chunked, which was created for exactly this purpose. Getting people to realize a solution exists is half the battle, so I would have appreciated a brief mention of chunking here.

I would also have appreciated a brief explanation of how the peer sees a s.shutdown() of a socket, as this is a topic that is often misunderstood (page 88).

Conclusion

I know this review may seem harsh to americans with overly white smiles. I'm picking individual items and pointing out the problems with them. But do not misunderstand me -- I read through the book quite carefully, and these were the only things I wasn't totally happy with.

This book is 99% good, and the only reason that isn't 100% is due to the wide scope of the book. Which, then again, is also a good thing.

I'm not a big book buyer, I have no shelf full of references, but I am happy to have this book on my shelf. I will happily recommend it to friends looking for a generic Python networking book.

Suggestions for Errata

I kept notes while reading the book, and also wrote down things that seemed like typos or other minor errors to me. Here's a list of those, mostly to help the publisher should they choose to print a second edition of the book:

  • Page 79, when talking about DNS query type ANY, states things like:

    there's a special case for the query type ANY in that it sometimes misses MX records (and others) if they aren't requested first

    and:

    ...it only returns information cached by your local name servers, which may be incomplete

    unnecessarily clouds the workings of DNS with magic. Merely pointing out that ANY means "whatever information a cache or authoritative server may have" (as opposed to all information) would be sufficient.

  • Pages 106-108, examples pollclient.py and selectclient.py:

    For some reason, these examples use print("..."), where everything else is more pythonic and leaves out the parens.

  • Page 132, example ctitle.py:

    Could handle hexadecimal character references like ®, and use unichr to support values >255.

  • Page 181:

    "Different languages use different meanings for the same character"

    While this is technically true (German ä and Finnish ä are different things), what was probably meant here is something like:

    "Different languages are written with different characters, and different character sets are used to represent hundreds of alternative characters with only 256 integer values."

  • Page 185, example mime_gen_alt.py:

    Includes some of the ugliest HTML tagsoup I've seen in a while.

    There's really no excuse not to use correct XHTML, or at least balanced tags, these days.

  • Page 211, third line from the bottom:

    "read-only" should probably be "read only", the context is about marking messages read, and doing that only when it's downloaded.

  • Page 231:

    "This means that the result of loggedin() -- or the last Deferred that it returns -- is passed to stopreactor()."

    -- should say

    "...or the result of the last..."

  • Page 259, at the bottom:

    "Since these both return callbacks", probably meant "Deferreds".

  • Page 270, at the bottom:

    Missing period from obj.copy.

  • Page 330:

    "build-in" should probably be "built-in".

  • Page 333:

    The example osslverify.py refers to variable cnverifie instead of cnverified.

  • Page 367:

    The example is called cgi.py, but the URL in the text is http://localhost:8765/test.py.

  • Page 370:

    ScriptAlias is missing spaces.

  • Page 427:

    There is a race condition in the SIGCHLD handling on platforms where the handler is reset on triggering. I did not expect a full treatment of SIGCHLD problems, but I did expect a warning to tread lightly.


  1. Admittedly, the DeferredList API is nonintuitive and has led many people into writing bad code. Unfortunately, no one has been able to formulate a better API for combining multiple Deferreds.