Monday, October 08, 2007

Another voice on low-latency computing

I haven't blogged about the Univa-UD merger (mostly because it confuses me a bit, but that may be because I haven't talked firsthand to any of the players). But I liked a piece that I read that was written by Rich Wellner of the new Univa UD. I found it here on Grid Today, but it was originally published over on Grid Gurus.

Rich takes on the fallacy that utilization rate is the most important measure of a grid's (or cluster's) effectiveness. He cites a theme he hears "over and over again: 'How can grid computing help us meet our goal of 80 percent utilization?'"

As Rich points out, having a grid running at 80% does nothing to help your business directly. Quoting again..."How does 80 percent create a new chip? How does 80 percent get financial results or insurance calculations done more quickly?"

The questions are rhetorical, because the answer is obvious: it doesn't. The goal of a grid is not to use your hardware more (or more efficiently): it is to get your answers faster. Create that chip faster. Get those results more quickly.

Sometimes, the best way to answer questions is to take them to logical extremes. What's the best way to increase your utilization? Why, it's obvious: reduce the number of machines. Make the cluster smaller, and its utilization will go up. Only have 75% utilization on your 200 node cluster? Throw away half of the machines--sure, your wait times during peaks will more than double, but your utilization may hit 100%! Will your users be happy now?

In truth, I've never had a customer ask a question like that. A much more common question: how can I reduce the time it will take to run my end-of-trading day jobs?

That fits exactly what Rich sees:

For most businesses, it's queue time and latency that matters more than utilization rates. Latency is the time that your most expensive resources -- your scientists, designers, engineers, economists and other researchers -- are waiting for results from the system.
That's what real-world users are concerned with. "If I add 100 nodes to my grid, how will that affect wait times during the day? How will it affect processing times on my most important jobs?"

As an aside, here's something I've noticed in potential customers. Sometimes, I'll have someone call up and say "We've got all of these CPUs that sit around all night doing nothing, can you guys help us use them?" Of course, the answer is "Yes," but there's not a great likelihood of a sale there. They have hardware they could use more efficiently, but they don't have a need.

Sometimes, someone calls up and says "I'm running analysis jobs that take 27 hours on a single machine and I need them to run in under an hour--can you guys help?" Again, the answer is yes--but now there's a very good likelihood of a sale, because there is an actual need.

Photo credit: Jane M. Sawyer
Technorati tags: ,