gqsub dev blog: Data, data everywhere ...

and not a drop to download.

(I know it doesn't scan right - work with me here...)

Currently working on the data management aspects; aiming to make data as easy to work with as compute. And as efficient as possible...

There's a certain amount of mis-match with the various data tools in gLite / LCG. It's as if one group went with round wheels and the other build roads of inverted catenaries. Both fine ideas, but not when used together...

Within the Job Description Language, we can specify hard requirements on data files that we want to be 'close' to [0]. However, we can't get the job wrapper to handle staging in those files for us - so the job has to do that itself.

The primary tools [1] for handling data is the lcg-utils; which are pretty good. They work nearly transparently with gsiftp (single file) urls, srm (storage element) urls, and lfn (logical file name) uri's - so as you move through the hierachy, the tools are very similar. They're also auto load balancing, so picking a random replica each time for cases when the same data is in multiple locations.

Alas, these two don't play well - if we use the JDL DataRequirements, we'll be on an worker node 'close' to a Storage Element that has the data; but using lcg-utils naively, we'll pull the data from somewhere random. That's not good, given we already know we're close to one.

So the plan is, in the short term, do it that way, in all it's simple to write, inefficent glory. Once I have a working infrastructure for data staging, then I'll refine it. There's interesting trade-off's in the load on the Logical File Catalogue, versus runtime on the worker node, against reliability. But the first stab will one that 'works', in the sense of gets the job and the data in one place.

[0] Although if you specify a set of files that are not all present in a single Storage Element, then nowhere matches. Not the most helpful ...
[1] Read: most usable ones. You can do the same thing in more complicated ways, if you really want to...

gqsub dev blog

Monday, 15 February 2010

Data, data everywhere ...

No comments:

Post a Comment

Blog Archive

Contributors