Data repositories

MSIAC has a thoughtful analysis of data repositories for simulation, concluding that small, tightly controlled and focussed repositories are best. Current US practice is to require specific data repositories to be built and used for major military contracts.

There are several excellent internet journals devoted to siimulation. The latest issue of the MSIAC Journal, has an article on data repositories. (“M&S Repositories – Lessons Observed” by Jerry Feinberg, Gary Misch, and Laurie Talbot, of Alion Science and Technology. Copyright 2008, SISO, Inc.)

This looks at what it calls a “plethora of repositories in general, and many M&S repositories in specific”, and asks whether data respositories actually help simulators. They sound like a good idea:
– more efficient use of resources
– data only needs to be assessed once

Against this
– “the lack of a business model encouraging their population, use, and content sharing”, and the reluctance to share data without safeguards
– the need to review the repository carefully to ensure that the data (and its format) is suitable.

The article describes three US military data repositories:
– MSRR: (Modeling and Simulation Resource Repository) Started in 1993, this is “only a catalogue of resources… the current MSRR System provides retrieval of metadata descriptions only of modeling and simulation resources.”
– OAML (Oceanographic and Atmospheric Master Library). Started in 1984, includes “ocean models, electromagnetic (EM) models, acoustic models, meteorological models, and a category called other models. OAML also contains ocean databases, meteorological databases, acoustic databases, and electromagnetic databases.”
– ACAT I IDEs (“Integrated Digital Environments (or IDEs) that have been developed for most recent ACAT I procurements (also known as Major Defense Acquisition Programs).”) These are linked and limited to single defence programmes, such as the Joint Strike Fighter. Contractors are required to populate and use them.

MSIAC develop a framework for repository evaluation, on the basis of
– justification (eg cost savings)
– policy support (eg enforceability)
– scope (eg what and how much is in it)
– access (how many users, and how easy is it to use?)
– implementation (“how the repository is populated, how the information contained in the repository is validated, how the repository is managed, and how the repository is funded.”)

The authors then compare MSRR, OAML and ACAT IDEs on these criteria.

They derive some interesting lessons:
– repositories sound like a good idea – if you build it, they will come – but this doesnt always work in practice.
– they key to success is a clear policy mandating the use of the repository (which OAML and IDEs have and MSRR doesnt)
– a clear and narrow scope makes a repository easier to develop and maintain
– ease of access is more important than the number of authorised users
– respositories need a single authority which validates and checks all data and its formatting or configuration, and sets standards – you cant rely on the data providers to do this.

In conclusion it hints that the MSIAC is also testing a centrally controlled wiki approach, about which it promises a later paper. It quotes a British knight: “…you can’t always get what your want, but if you try sometimes, you just might find, you get what you need”.

I dont think Sir Michael Jagger had data in mind, but its a fair summing up. This has been on my mind recently because of some work Im doing for a client which incvolves storing a small amount of data. Another company is also running programmes which store similar data. The systems arent interoperable, but there is a common point of contact – much of the data relates to locations, which have of course got coordinates. One ideal might be to have a web service approach, in which all programmes can respond to an XMLRPC query from each other which says, in effect, tell me all you know about lat x/ long y. (or, identify me, and tell me all Im allowed to know, or need to know, of what you know about it.)

I suppose the ideal would be for them to share a database, but since they are all doing slightly different things with the same information this would lead to formatting and content conflicts (or to each app having to wrestle with a larger and more complex database, as well as a large number of significant code changes.) The client finds itself storing very large amounts of data, often repetitive, scattered amongst text documents, with limited version control. Using this typically requires human extraction and re-formatting or re-purposing, and each piece of data has to be re-validated in case it has been overtaken. Its not that its a legacy system, its the only system! But in my experience this is entirely typical of current data management.

Leave a Reply

Your email address will not be published. Required fields are marked *