Model code base and configuration meta data
... plans for extending the Cera data model by numerical model describing metadata
Model Metadata in Cera
The Cera metadata model (MDM), and therefore the WDCC data base (DB) run by M&D/DKRZ, is focussing on metadata (MD) describing the data, whether observed or generated by numerical models, included in its archive.
The Cera MDM does not give much details on the method used for the data generation, e.g. the numerical models per se.
Information about the method used for the generation of model data is given in free text style in the project or experiment description fields. This information can be retrieved and displayed from the DB, however, due to its free style characteristics, specific information pieces it may contain can not be searched. In addition, there is no way to guarantee that required information is included.
Another way to include information about data generation methods in the WDCC is by keywords attached to projects or experiments. This can be a model name, e.g. HADCM3. Provided the user has some expert knowlegde about the particular model, this keyword implies a number of additional information. Without this knowledge, the user has to fall back on other information systems, e.g. the internet. A keyword can also indicate a project name, e.g. IPCC-AR4. Again, this may imply a series of additional information for experts, however, not for non-expert users.
Purpose of Model Metadata
MD about the numerical model with which a data file has been generated may serve several purposes.
One purpose is to enable the reproduction or reconstruction of scientific results. This has 2 facets.
Firstly, it allows later examination of the correctness of results and the way results have been obtained, as layed down by the Max Planck Society in Regeln zur Sicherung guter wissenschaftlicher Praxis. These 'rules to ensure good scientific practices' say that sufficient and complete documentation that allows reproduction of data has to be accessible to people with vested interest for at least 10 years. 10 years is also the time primary data should be archived for later checks and control. With regard to data archived in the WDCC, this means that all data publicly accessible should be provided with publicly accessible documentation.
Reproducibility is also an issue for individual scientists who are often not able to reproduce their own results after some time because they dont have the right instruments to savely archive documentation in a way that allows to retrieve it when necessary and relate it to the correct data sets.
Secondly, model describing MD help data users to understand how data have been generated. This become an important issue in projects where multi-model data are analyzed by scientists not knowing the numerical models in detail (e.g. ?MIP projects, IPCC). For these users it may be sufficient to know which schemes have been used and perhaps the values of characteristic parameters. While this will not enable them to reproduce model output, it may put them into the position to understand differences in model results from different models.
Requirements
MD which serve the first of the above noted purposes must include information about the full source code of the programs used, as well as information about all configuration aspects at compilation or run time.
The first may be in form of the source code itself, or in form of a pointer to the repository where the code is archived. In the latter case, the source code archive must have a live time at least as long as that of the data archive. This has not been the case in the past if we look at the WDCC data set live times and those of the source code archives (e.g. at the Max Planck Institute for Meteoroly: nupdate in the early 90ties, clearcase in the first half of this decade, SVN now). Information on configuration details, such as conditional compilation control flags, or namelist input are presumably lost even quicker than the source code itself. Other critical MD on numerical experiments is the information about initial, boundary, and other forcing data.
Thornton et al. (2005) discuss the requirements on a model source code archive serving the above purposes in Archiving Numerical Models of Biogeochemical Dynamics.
An extension of the Cera MDM to include model describing MD must be able to satisfy the needs of two kinds of users.
The first tend to download data produced with the same (well-known) model and therefore is mainly interested to use the MD as a kind of note book on the exact details of the run that produced the data (e.g. the COSMOS community).
The second kind of user wants to download data from multiple models without being experts of the models and therefore has a need for descriptive model MD enabling them to differentiate between models by the formulation of processes (e.g. scientists undertaking IPCC studies). These MD should be searchable.
It must be kept in mind that Cera is hosting the WDCC with the mission to collect, store, and disseminate data for climate research to serve the scientific community.
Any activities defining new MD for the Cera DB should have an obvious benefit to the users of the WDCC and should not disturb the users activities.
Therefore new MD must seamless integrate into the existing model, and must be very stable.
It is thus discussed to include model describing MD through optional MODULES or LOCAL EXTENSIONS of the kind already used in the Cera data model. One such module e.g. describes the model input at run time.
The definition of such MODULES or EXTENSIONS for more comprehensive model MD is embedded in discussions with other international initiatives on model MD, as e.g. PRISM, NMM, FLUME, and NetCDF/CF, in order to arrive at an implementation compatible with other MDM and international standards. The most outstanding and comprehensive model describing MDM seems to be that of the NMM workgroup embedded in the GO-ESSP.
A different approach on how MMD can technically be included into the Cera DB without creating unstable structures due to the fact that there is no agreed-on international MD standard for numerical model MD so far is the concept of Specific MD or Additional Information.
In contrast to the Catalogue MD, which can be used for browsing, searching, and retrieval activities, and therefore should have low complexity and high stability, the Specific MD may have an unstable structure and can more readily be adapted to new agreed-on (international) standards.
They will not enter the general Cera DB table space. However, they can be downloaded and displayed on the DB GUI.
This concept serves presumably rather the first purpose of model MD mentioned above, i.e. to provide means to reproduce data.
The content will be mainly 'lump' data, i.e. data lumped together as they are used during download, compilation, or running of a model (e.g. namelists, compile/run scripts, source code). These content data can be binary, text files, or xml formatted files. The method enables the Cera MD model to also host other model MD, e.g. NMM data, since the complete NMM xml or xsd specification files can be treated as 'lump' elements as well.
Acceptance
Besides the specification of a MDM, remains the question of how the correct values describing a model and its configuration used when the data were generated can be specified. The filling in of MD forms means an additional work load which is not readily accepted. It has to be found a way to provide the information without putting extra work load on the scientists or on the people who fill in and check MD at M&D. This can be done by taking the information already known in other system, as e.g. in the SCE and SRE, and fill in the forms during compile and run time.
After the automatic filling and before it enters the Cera DB, the user can be given the occasion to add additional information by hand (those that can not be retrieved during running and compilation) preferably through a GUI.
It would also be possible to write scripts which translate the information available at run time, and meaningful to experts only (e.g. namelists), into descriptive information useful for the whole scientific community. xslt transform scripts can be used to transform or scatter the namelist or xml element entries into any xml scheme.
Since the complete information about the model is stored in the 'lump' elements, this can be done at any time enabling the scientists to flexibly provide information required in future projects (e.g. IPCC).
This procedure may help to increase acceptance, completeness, and correctness of model MD.
Presentation by M&D on the subject
Nov 2005, Toulouse PRISM Support Initiative meeting on metadata
May 2006, Exeter PRISM Support Initiative meeting on metadata
May 2006, Lueneburg COSMOS metting





