Thursday, September 4, 2008

Push vs. Poll

It seems pretty obvious at first but this is no simple matter. For years, I have been in the poll camp. The idea was that if you had long running processes that would poll for updates every 20 to 60 minutes, the cost of setting up, performing requests and tearing down a connection would be less than the cost of maintaining a connection on the server.

Well, there are several problems with this scenario. The first one being that under such conditions, the total number of bytes needed to perform the requests on the server usually ends up being a higher number than the total bytes needed to produce responses to indicate that no new data is available since the last poll. The second is that it is much harder to limit incoming requests to the capacity of your servers. Namely, given n number of client processes out in the field, entropy sets in and will cause your clients to poll at different times. However, as the number of client processes gets bigger and bigger, the number of simultaneous polls that hit the server at the same time grows at the same rate. Using push, we can feed updates to clients at a rate that the server can handle. Since the server is responsible for contacting the clients, it can do so at its own pace completely sidestepping the “Thundering Herd” problem. Third is that not all bytes are of equal cost. In this ever heterogeneous computing landscape, e.g., the cost of a byte on a high speed DSL line isn’t the same than the cost of a byte on a cellular edge network. In the scenario above, sending requests that have empty responses should be a no cost operation on a domestic high speed link but this doesn’t hold true for a user on the cellular network. Fourth, given an update that needs to be pushed, that single piece of information is completely independent of any updates that came before it and will come after it, meaning that it is easy to distribute this work to a farm. For example, we can imagine that for a poll, we would need to query the database for each and every poll. For a push, we would push the result of running a single query to multiple clients. The fourth argument is much more domain specific since your data might not allow for this at all but it’s still perfectly possible on some data sets. Fifth, web applications are not kiosks, they can't be polling only once every hour. However, even for kiosks, push is still superior since it allows for instant monitoring of every client in the field.

The term coined for server side push is comet, or reverse Ajax, and you use this to get to your clients, be it a web application or not, to achieve a server side push. However, the problem is that too often the solutions ends up looking something like this:

but this is not the most optimal solution. First, it does not get rid of the extra request traffic that we are trying to avoid. This is not really server side push but more of a constant poll where the response timeout is extremely high. With comet, the idea is to get to a point where you're shaping your traffic like this:


Not only does this get rid of our unwanted traffic, it truly does provide a scalable way for the server to publish change sets to clients without having to exert itself.

Now the bad news, achieving comet isn't just about behaving in a certain way at the network level. Reaching this level of sophistication means that the server is maintaining work queues of the data that it needs to send to clients. This isn't too bad but remember that not every client will be synchronized to the latest updates when sending out new updates. Your clients will still need to get to the synchronized state using regular requests before they can get updates via comet.

Finally, I would like to mention that there are several tools and kits and what not to add comet support to your application, you don't need to do it yourself. However, I have found that Grizzly, the http web server that ships by default in Glassfish, to be an excellent choice for all my comet needs. Grizzly is really an impressive piece of code, using non-blocking I/O everywhere with a minimal amount of threads to achieve maximum throughput. You'll still need to handle things like dropped connections and other exceptional conditions but the hard part of getting a scalable comet implementation going in the first place will be done at least. Furthermore the plan for Grizzly 2 calls for enhancing the server with the new asynchronous I/O facilities being introduced in Java SE 7, which should make this already impressive beast scale to new heights.

No comments: