Thursday, 25 February 2010

Buckets and sieves

Deep into data management at the moment - hence buckets of data.

One of the more annoying aspects of working with computers is the leaky abstraction; and that's where the sieve comes in.

A leaky abstraction is a particularly cruel thing - it appears to promise things that don't hold true. For example, if something acts like a list that you can add things to, you expect to be able to add things to it. If it didn't allow things that started with the word 'aardvark', most people wouldn't notice it. The moment an entomologist gets involved, and the abstraction breaks apart.

The particular problem I've been having is that when staging in files, there's an established syntax for naming files, which allows them to be different names on the worker node than on the source filesystem. So what should happen if you want several files with the same source filename, but distinct names on the worker node?

Well, at the moment, you get a right mess. Files are staged in to a local file with the same filename as the remote. Then, if you need it, gqsub will rename them. If there's two files with the same remote name, then the last specified one clobbers the others. That's a problem that the gLite middleware has had for a while (mostly as it doesn't support renaming files at all), but it's a leak - it breaks the mental model we're trying to support here.

In addition, what happens if the user wants 2 copies of the same file, under different names, on the worker node? It's perhaps a rare case, but I can think of a few cases (mostly where the file is mutated by the work, and it's the differences that are important, not the absolute data). In this case, we have to consider where do we do the duplication, and how. I plumped for doing on the worker node - most filesystems will be able to do copy-on-write, and the network transfer is likely to be more expensive than the local copy.

Anyway, this is mostly about the issues, as a way of thinking about the resolutions.

Ultimatly, I'd like the abstractions presented to be as leak free as possible, so that people can reason about what will happen without having to make reference to the underlying system.

No comments:

Post a Comment