Another Mike

Run Terracotta Jobs Across the Cluster

4/28/2009

In my I ntroduction to Terracotta , I alluded to the possibility of running jobs across the cluster but I didn't really explain how. My first attempts involved using the tim-messaging package, but there wasn't much there geared towards running the same job across all nodes. It mainly addresses dividing jobs across many workers (local or remote).

There are a few cases where the same job on all nodes is useful. The first time I ran into it was trying to pull some custom statistics from each node. Each node kept a moving average of requests per second for a specific user action I wanted to track. The other use case was notifying long poll HTTP clients. Server push could be initiated from one node but the client may be listening on another. Since Socket instances aren't shareable, I needed a simple way to server push on all nodes.

I cast around until I found the old tclib forge project that predated many of the current TIMs. I modified the below class starting from that code.

To solve the problem of knowing what nodes are in the cluster, I simply have them register() themselves at startup. A ServletContextListener works great for that. That solves the problem of submitting jobs before the node is ready.

Of course, then you have to worry about nodes that may suddenly disappear. One can't count on them unregistering themselves since they might crash or something. To fix that I have each node hold a lock the entire time they're running. One of the necessities that Terracotta must have had to address early on is cleaning up cluster-wide locks when a client exists. To test if a node is still around, all the caller has to do is attempt to acquire the lock. If it's successful, then that node somehow exited it's run loop.

In practice this works pretty well. It's much simpler than the JMX solution, although it doesn't bother trying to resubmit jobs or some of the other neat features of tim-messaging. The tryLock() below was plenty fast enough for my purposes. Of course, it's easy to envision adding a thread to clean up the queues instead of forcing the caller to wait for the tests. Really, that would depend on the size of your cluster.

Depending on it's use, you might tweak the thread pool used for this. This code also depends on the tim-annotations project but you can easily configure tc-config.xml.