gqsub dev blog: 2010

Monday 6 December 2010

1.5.2 bugfix release

Fixed a problem with gqdel, and disabled deep resubmission with direct stage out to GridFTP servers.

Deep resubmission can be triggered if there is a problem in the final stage out; and if that happens you end up with an uncertain state on the remote server. So deep resubmission is disabled, to make the job fast fail.

Having a couple of problems accessing the SVN reporsitary right now; so whilst I have more feature written, they're being held up with procedural issues.

Repo contains the new rpm, and tarball is available.

Monday 2 August 2010

Not dead! And 1.5.1

I've not updated since the Uppsala conference; mostly due to unexpectedly high levels of work. As the title subtly hints at, I'm not dead yet (although in spite of a few things).

Much more relevantly, there's a 1.5.1 release containing some modest restructuring. This is primarily to make it fit in better with the Python conventions; so that I could use the more powerful packaging it offers.

In particular, there's an RPM repository set up, at http://www.scotgrid.ac.uk/gqsub/repo which contains RPM for gqsub. There's also a .tar.gz bundle that works much like the previous versions. If you're a system administrator, you probably want the RPM; or if you're and end user, the tar.gz bundle can be unzipped into your own directory and used from there. There is no functionality difference between the two.

Beyond that, work continues on the various features, albeit with some slippage.

Wednesday 14 April 2010

Array jobs and orders arrays

The current multiple job interface - the array job semantics - was pulled from SGE/Torque, without a lot of reflection.

This is not necessarily a bad thing, given that it matched existing interfaces, and provided functionality that was needed by users. However it offers only a single model of sub-job interactions - the case of trivially parallel.

Within the home context of qsub, this is not unreasonable. If there are other dependency models, they could be handled separately, with low overhead. For example, an intial calculation needed by subsequent jobs would be run first, a final gather step run afterword, etc. In general, if a user had complex job chaining needs, they could always get qsub installed on the cluster, and have jobs launch jobs. That's not available from Grid contexts, and the latency on jobs makes doing initial/final jobs more painful.

So, I'm thinking about doing four specific modes of operation for array jobs. Firstly, the current and what will be the default - bag of tasks (unordered). The next two are slight changes on that - first-first and last-last; guaranteeing that the indicated task will be run to completion before any other job starts, or the final job won't be started until all the other tasks have completed. This would typically be used with the environment variables, so the same script will have all the job control built in, and select the behaviour at run time. Clearly, first-first and last-last could both be used at the same time.

The final sequencing mode would be strict-ordering. Each task in perfromed in order, and all tasks will complete before the next job is started. This allows any task to refer to the results of previous tasks. It's obvious that this also means first-first and last-last, as side effect.

I can compile these sequencing modes down to the WMS's DAG jobs - as they are clearly a subset of the Directed Acyclic Graph of tasks. My gut feeling is that these modes will give the most common sequencing types that are used, without having to handle complex specifications - although I think I need to see if I can dig out some data on that.

At the moment, I'm tagging these job ordering features as unscheduled - so many ideas, working on them ordered by user feedback. If you'd use this ordered sub tasks, let me know, and I'll bump it up the list!

Sunday 11 April 2010

EGEE User Forum

Arrived in Uppsala, Sweden for the EGEE User Forum. There's a poster about the data handling side of things, and I'll be demoing gqsub from the UKI stand at various points. Feel free to grab me to talk about gqsub, and other things if you're around!

Given that I've just got all the ideas that came from the last User Forum implemented, it's time to fill that buffer back up again...

Monday 29 March 2010

CREAM Engine

As Stuart mentioned, I'm putting a CREAM engine together. A basic first draft has been checked in, and I'm in the process of writing a set of tests specifically for CREAM rather than amending the original test set (so I can make sure I haven't broken gLite submission, as well as checking the CREAM one works).

Thursday 18 March 2010

1.5.0 release, and further developments

That's the 1.5.0 release up on the gqsub page.

This handles data staging to and from Storage elements. It's a little ... direct ... in a few cases, so if you are staging multiple files and they have multiple replica's, then it doesn't guarantee to use the optimal replica. This is, however, a fairly minor issue - if you have to handle that much data then we're quite a bit beyond something that could run on a local cluster. Nevertheless, I will work on that part, and tighten it up. As it stands, it's comparable with other user tools that weren't written for a particular VO.

In parallel with that, Morag is working on a second submission engine on the back end - this one for direct submission to CREAM. It's not very different, which makes it an excellent 2nd target, and a solid step on the way to backend independence. In particular, direct submission to CREAM requires a GridFTP server, or something similar, in order to get the data back, so requirements for backends need to be handled. She's already cleaned up some of the job handling code; so look for direct CREAM submission (no WMS needed) in around 1.6 - 1.7 timeframe.

Now, on to more data handling code ...

Thursday 4 March 2010

Data transfers - now with workyness

As of the current SVN tree, data handling from SE's is supported. (It would have been yesterday, but there was a blip in the SVN severs).

The general method of specification is:

#GQSUB -W stagein=jobName@sourceName

or

#GQSUB -W stageout=jobName@targetName

The syntax is derived from the PBS/Torque/SGE model for specifying stagein/out files, with a few extensions.

The jobName is the name the file should have at the time of execution on the worker node. The sourceName and targetName are the specification of where to get the file from / put it to. If the jobName is ommitted, then the file arrives with the same name as it had remotely.

The valid ways of requesting a file are:

simpleFileName
local/path/to/file: A file local to the submission machine
host:file
host:path/to/file: An SCP destination, which the submission machine can access.
gsiftp://host/path/file: GridFTP URL
srm://host/path/to/file: SRM Storage Element
lfn:/path/to/file: Logical File Name (via LFC)

For SRM and LFC modes, these are converted at submission time into GridFTP urls. This is fine if the data is only at one place, but I'll change that later to automatically make it pull from the nearest.

Unfortunatly, determining the topologically nearest SE is ... tricky. Most of the usual tools (ping times, packet pairing for bandwidth assesment, etc) are unreliable, as ICMP is not guarenteed. So I think I'll have to get down and dirty with TCP/IP and HTTPS, and use that to make the estimates. Fun ... in an ironic sort of way.

Still, it's there, it works, and it does the job.

I'm going to do some more testing and stressing, before I wrap up 1.5 release - probably next week for the release.

Oh, look - our cluster is quiet right now. Time to find out what breaks when I throw throusands of jobs at it...

Thursday 25 February 2010

Buckets and sieves

Deep into data management at the moment - hence buckets of data.

One of the more annoying aspects of working with computers is the leaky abstraction; and that's where the sieve comes in.

A leaky abstraction is a particularly cruel thing - it appears to promise things that don't hold true. For example, if something acts like a list that you can add things to, you expect to be able to add things to it. If it didn't allow things that started with the word 'aardvark', most people wouldn't notice it. The moment an entomologist gets involved, and the abstraction breaks apart.

The particular problem I've been having is that when staging in files, there's an established syntax for naming files, which allows them to be different names on the worker node than on the source filesystem. So what should happen if you want several files with the same source filename, but distinct names on the worker node?

Well, at the moment, you get a right mess. Files are staged in to a local file with the same filename as the remote. Then, if you need it, gqsub will rename them. If there's two files with the same remote name, then the last specified one clobbers the others. That's a problem that the gLite middleware has had for a while (mostly as it doesn't support renaming files at all), but it's a leak - it breaks the mental model we're trying to support here.

In addition, what happens if the user wants 2 copies of the same file, under different names, on the worker node? It's perhaps a rare case, but I can think of a few cases (mostly where the file is mutated by the work, and it's the differences that are important, not the absolute data). In this case, we have to consider where do we do the duplication, and how. I plumped for doing on the worker node - most filesystems will be able to do copy-on-write, and the network transfer is likely to be more expensive than the local copy.

Anyway, this is mostly about the issues, as a way of thinking about the resolutions.

Ultimatly, I'd like the abstractions presented to be as leak free as possible, so that people can reason about what will happen without having to make reference to the underlying system.

Monday 15 February 2010

Data, data everywhere ...

and not a drop to download.

(I know it doesn't scan right - work with me here...)

Currently working on the data management aspects; aiming to make data as easy to work with as compute. And as efficient as possible...

There's a certain amount of mis-match with the various data tools in gLite / LCG. It's as if one group went with round wheels and the other build roads of inverted catenaries. Both fine ideas, but not when used together...

Within the Job Description Language, we can specify hard requirements on data files that we want to be 'close' to [0]. However, we can't get the job wrapper to handle staging in those files for us - so the job has to do that itself.

The primary tools [1] for handling data is the lcg-utils; which are pretty good. They work nearly transparently with gsiftp (single file) urls, srm (storage element) urls, and lfn (logical file name) uri's - so as you move through the hierachy, the tools are very similar. They're also auto load balancing, so picking a random replica each time for cases when the same data is in multiple locations.

Alas, these two don't play well - if we use the JDL DataRequirements, we'll be on an worker node 'close' to a Storage Element that has the data; but using lcg-utils naively, we'll pull the data from somewhere random. That's not good, given we already know we're close to one.

So the plan is, in the short term, do it that way, in all it's simple to write, inefficent glory. Once I have a working infrastructure for data staging, then I'll refine it. There's interesting trade-off's in the load on the Logical File Catalogue, versus runtime on the worker node, against reliability. But the first stab will one that 'works', in the sense of gets the job and the data in one place.

[0] Although if you specify a set of files that are not all present in a single Storage Element, then nowhere matches. Not the most helpful ...
[1] Read: most usable ones. You can do the same thing in more complicated ways, if you really want to...

Tuesday 9 February 2010

... how many jobs‽

During the course of trying to track down a problem with one users jobs, we ran gqstat.

Which started collecting status on a job. And then another. And more.

Then it kept going.

Turned out that there were around 1500 jobs (each with their own set of subjobs) which had status to collect. This took around 10 minutes to gather...

A little bit sub-optimal methinks. I've done some profiling since then, and it looks like the dominating factor in the the time taken is the connection setup / tear down - no doubt due in a large part to the SSL certificate overhead. gqstat collects each job individually, so we incurr that hit every job.

In a way, this is a good problem to have - it shows that the tool is sufficiently usable for an end user to submit thousands of jobs, and it validates the illusionary shared filesystem approach. Had that illusion not be created, then the monitoring tools would have to be used more.

Of course, just because that fits one persons workload doesn't mean that it's going to be great for everyone, so it's not the end of the road yet. Still, a nice milestone to hit.

Anyway, back to this performance issue - thankfully I had it report when it collects job information, so it was clear _what_ was happening. This does suggest an improvement, however - collect a group of jobs, and then display them, then collect, display etc.

The speed hit from the SSL setup/tear down can be mitigated if we can collect information about several job at once. gqstat keeps the status on disc, so I'd need to be able to separate the status data for each job after that, but I think this is straight forward. It'll involve creating a temporary file listing the JobIDs of several jobs, and then querying that. Of course, the ideal way would be to have a nice Python API for querying job information... Given that it's gLite 3.2 now, it might be worth giving the Python API another shot, when I get some free time (ha!).

So, in summary, there's performance issues with gqstat at around 1000 jobs in flight, and I have a plan for how to deal with that. As to _when_ ... I think that'll have to wait till I get the data staging sorted out, putting it in 1.6 timeframe.

Friday 5 February 2010

1.4.1 release, and MyProxy

Small update over yesterdays 1.4.0, fixing problems with existing default.jdl - mostly as a way of retiring that feature. This had been used to enable use of MyProxy proxy renewals.

However, the 1.4 separation work was with the express purpose of supporting MyProxy natively. So, if you are in the situation where you want long running jobs, here's how to enable that:

credentials = MyProxy

in your gqsubrc file.

And then forget about it. When it's needed to refresh credentials, you will be prompted for the passphrase for the certificate. Let the computer work out what's needed, and get on with the research ... Hrm, that would make a good motto for gqsub.

This does assume that the user interface machine is set up for MyProxy appropriately.

The other alternative for credentials is 'PlainVoms', which menas to use an ordinary proxy certificate.

At some point, I'll be extending this to allow for the voms-proxy-from-proxy mechanism used in some UI's - that will make it simpler to use on non-institutional UI machines.

Thursday 4 February 2010

Beginning in the middle

After all, beginning at the start is such a cliché...

So this is a blog primarily to talk about gqsub, a hunk of Python code that exists to provide a better interface to Grid job submission and management. I'll be talking about the development process, and showing some of the ways in which it can be used.

gqsub has been around for a while now, currently on the 1.4.0 release. I'll put a post up shortly talking about the latest changes in 1.4.0.

Previously I've stuck a couple of posts on Scotgrid on Fire! about it, but it's really a bit separate from the Scotgrid work - and I don't want to overly clutter that blog with the minutia of gqsub development.

gqsub dev blog