Thursday, July 21, 2011

The Great Worker Pile Up Problem: Solved!


In our configuration of the Distributor pattern, we have the Distributor clustered across multiple VMs. We also run Workers on separate nodes in the same cluster. We are doing this so that we can deploy the same image to all nodes. With this configuration any Distributor/Worker can move from one node to the other in case of a failure. With all this sliding of nodes around we ran into one main issue.

Everything worked great up until we had the big pile up. What would happen is we would intentionally drop all the nodes in succession until we got down to one node. Usually this node would have the Distributor already there and Worker1 would slide over. No problems. When Worker2 slid over to that node all of a sudden one of the Workers would take itself out of business, all the work would move to the other Worker and then we'd have a couple messages disappear. Disappear temporarily that is. Eventually they would come back and be processed.

We initially thought this was due to MSDTC since we saw messages that weren't get ack'd hanging out. After chasing this awhile we then figured it must be some networking issue. Like just about any clustered VM we have 2 NICs configured on the VM, one for the SAN and one for everything else. Come to find out when the last Worker slid over, it was picking up the NIC for the SAN and not the general network. Therefore the messages weren't getting ack'd, but eventually they'd find their way back.

We tried setting the priority of the NICs and so on to no avail. After speaking with MS, we ended up pinning the IPs of the local machines in the registry to the correct network interface. We also had to do this on the clustered services as well. Now when the nodes slide about the local IPs don't change and we have a script that the services depend on that updates the clustered IPs if necessary(like when you change data centers).

I'd like to that our team for being persistent and finally tracking this one down. Now we can deploy the same image to all nodes and have them move about freely. Everything works as designed with full load balancing.