Tuesday, 9 February 2010

... how many jobs‽

During the course of trying to track down a problem with one users jobs, we ran gqstat.

Which started collecting status on a job. And then another. And more.

Then it kept going.

Turned out that there were around 1500 jobs (each with their own set of subjobs) which had status to collect. This took around 10 minutes to gather...

A little bit sub-optimal methinks. I've done some profiling since then, and it looks like the dominating factor in the the time taken is the connection setup / tear down - no doubt due in a large part to the SSL certificate overhead. gqstat collects each job individually, so we incurr that hit every job.

In a way, this is a good problem to have - it shows that the tool is sufficiently usable for an end user to submit thousands of jobs, and it validates the illusionary shared filesystem approach. Had that illusion not be created, then the monitoring tools would have to be used more.

Of course, just because that fits one persons workload doesn't mean that it's going to be great for everyone, so it's not the end of the road yet. Still, a nice milestone to hit.

Anyway, back to this performance issue - thankfully I had it report when it collects job information, so it was clear _what_ was happening. This does suggest an improvement, however - collect a group of jobs, and then display them, then collect, display etc.

The speed hit from the SSL setup/tear down can be mitigated if we can collect information about several job at once. gqstat keeps the status on disc, so I'd need to be able to separate the status data for each job after that, but I think this is straight forward. It'll involve creating a temporary file listing the JobIDs of several jobs, and then querying that. Of course, the ideal way would be to have a nice Python API for querying job information... Given that it's gLite 3.2 now, it might be worth giving the Python API another shot, when I get some free time (ha!).

So, in summary, there's performance issues with gqstat at around 1000 jobs in flight, and I have a plan for how to deal with that. As to _when_ ... I think that'll have to wait till I get the data staging sorted out, putting it in 1.6 timeframe.

2 comments:

  1. Did you ever use the Python API and see if that helped? We have an SSL certificate and that seems to be a bottleneck for certain queries we are running at the moment on our server!

    ReplyDelete
  2. Had another look at doing it in Python ... and it turns out that it's basically too painful right now.

    Any particular way of doing things needs to use authenticated SSL handshaking - and with Python before 2.6, that's not possible out of the box. So it died before it started, because otherwise I'd have a dependancy that is not normally installed. And for a performance enhancement, that's not an acceptable trade off.

    At some point, I'll write it with intelligent fail back; but given that I can't get it working at all off a standard gLite UI, I'm going to focus on more useful things in the short term.

    ReplyDelete