Recent Changes - Search:

UM User Group

PmWiki

pmwiki.org

edit SideBar

Job Portability

UM Job Portability

Author: Adam Clayton

Addition comments from:

  • JD: John Donners
  • PLV: Pier-Luigi Vidale
  • AC: Adam Clayon

Executive Summary

Effective collaboration between the Hadley Centre, CGAM/HiGEM and the UJCC (UK-Japan Climate Collaboration) group based at the Earth Simulator Center in Japan depends on us being able to exchange UM jobs and data easily, and reliably adapt them to work on the various computer platforms. At present, the procedure for importing UM jobs is unnecessarily complicated and prone to error. With some coordination between groups the task could be made much easier.

The groups outside the Hadley Centre are currently in the process of installing a portable version of UM 6.1, a model version which will not be superceded at the Met Office for at least a year. Since the core Had GEM jobs at the Hadley Centre will shortly be upgraded to this version, this is the ideal time to get organised. The main suggestions put forward in this document are as follows:

  1. At the Hadley Centre, organise all the input data (dumps, ancillary data, source and script mods, prestash files, etc.) required for the core Had GEM jobs under a single directory that can be copied or mirrored to the other platforms. JD: I propose not to change the organisation of input data at first, but only make a list of common environment variables. These can point to different locations on different platforms. Input data does not need to be reorganised, but jobs can still be shared as long as the correct path for the environment variable is used. PLV: I agree completely that we should have a common set of environmental variables. I disagree with the idea that input data do not need to be organised in a conformal way. While it is true that using the environmental variables the user need not worry where the files ultimately are, this is likely to create a problem with administration of file versions. Ideally, dumps and ancillaries are NOT supposed to change much in time, but I can cite a simple and recent example of soil temperatures in the dumps for Had GEM, which are being updated, because they contained too cold values. I am not sure that a version control is in place for these, but it would be important to have a regular system of updating, importing, validating all these input files. The task of maintaining all this across different platforms which have wildly different structures is not an easy one. Again, it is possible that technology can help and that, in the end, it will not matter much where files are, but the issue here is not merely on the user side, it is also on the administrator’s side.AC: The fact is that the data at the Met Office desperately needs to be organised. At the moment, it is scattered far and wide under many accounts whose owners are, it seems, simply trusted not to change or move anything. If the data is organised together in one place, under one account, things are much easier to manage and control. In that case, we only need a handful of environment variables pointing to the top-level directories.

  1. Do the same for core HiGEM data on HPCx, and core NUGEM data on the Earth Simulator.
  2. As far as possible, remove hard-wired paths from UMUI jobs that are to be shared (such as the core Had GEM jobs at the Hadley Centre), and use environment variables to point to the main data locations.
  3. Once they become available, include the small number of 6.1 mods required for portability within core Had GEM jobs at the Hadley Centre.
  4. Pass on increases to path lengths etc. in the portable 6.1 UMUI release to the 6.1 UMUI installation at the Met Office.JD: If the common environment variables are moved to the general .umsetvars file, the problem of path lengths is less, because these are defined outside the UMUI.AC: One slight problem with this is that people may wonder where the environment variables are being set, so some kind of comment would be required in the ‘File and Directory Naming’ panel. I would prefer not to set things up in the .umsetvars file. If data is moved, and if we only have environment variables for the top-level directories, it’s a small job to alter one or two paths in the ‘File and Directory Naming’ panel. It could also be done automatically by the UMUI administrator by altering the basis files in the UMUI database.JD:The advantage of moving these environment variables to the .umsetvars file is exactly that it is not necessary anymore to care about the problems when the administrator decides to move data around. Only the .umsetvars file needs to be changed, not ALL basis files. Also, when exchanging jobs, it is not necessary to change the environment variables. We can include a list of preset environment variables in the UMUI window.
  5. Maintain a UNIX script to help with sharing of UM jobs.

Introduction

With the HiGEM and Earth Simulator projects now well under way, there is a growing need to share UM jobs between different groups using different computer platforms. The main players are as follows:

  Group          Platform 
  Hadley Centre  Met Office SX-6/8 
  CGAM/HiGEM     HPCX/Newton 
  UJCC           Earth Simulator SX-6

Of course, one cannot pick up a job from someone working on a different platform and expect to be able to run it immediately on one’s own platform. The main reasons for this are as follows:

  1. UM installation differences.
  2. Platform and user dependent settings and mods.
  3. Location of input data: dumps, ancillary files, mods, hand edits, prestash files, etc.

Thus, a certain amount of work is always required to translate a job between different platforms. This can be a laborious and error-prone task, but with some coordination I believe the task could be made much simpler. All of the above groups will shortly be concentrating their efforts on UM vn6.1, which will not be superceded for at least a year. This is therefore the ideal time to get organised.

The following is an attempt to highlight the main issues, and suggest some solutions. Once we have come to some agreement, we can write and distribute a guide on how to set up and swap UMUI jobs, and how to manage input data.

Vn6.1 UM/UMUI Installations

The Hadley Centre will shortly be moving their core Had GEM jobs to UM vn6.1. Meanwhile, the participating groups outside the Hadley Centre are currently installing a portable version of 6.1, which when complete will become the basis for all development work.

These UM installations - in particular the script and program libraries - will be essentially identical. However, jobs run outside the Hadley Centre will need to include a series of portability mods, to the scripts and the C/Fortran source code. The script mods are not really an issue, but it would make sense to add the small number of source mods required for portability to the core Had GEM jobs. This is a simple task, but will remove any need to resolve clashes with the portability mods when importing jobs from the Hadley Centre.

The vn6.1 UMUI installations outside the Met Office will include some additional options for the job submission panel. This means that when jobs are swapped with the Hadley Centre, there will be some warning messages about unspecified variables. However, this will be pretty harmless, so is not really an issue. However, users outside the Hadley Centre have also found it necessary to increase the maximum number of Fortran mods, and also some of the maximum path lengths; for example for prestash files. In order to avoid problems with job portability, the same changes should be made to the Met Office UMUI installation. Again, this is a trivial task, and in any case will be beneficial for users at the Met Office.

The UMUI installation at the Met Office is not fixed. Sometimes changes are made after the initial installation to accommodate new platforms etc. If possible, this should be done in such a way that job portability is maintained. If this is not possible, the other groups should be informed of the changes so that local installations can be adjusted accordingly. PLV: I did not understand this point. Could you make it more clear ? AC: I am thinking of the change made to the job submission panel at the Met Office to support the new SX8. Of course, it was done in such a way that UKMO users would not be affected. However, for users of other platforms, the update broke their jobs. It’s just a matter of making the UMUI team at the Met Office aware that there are people outside the Met Office who wish to import jobs from the Met Office.

Organisation of Data

Each UM job requires a series of input files; for example:

  • UMUI hand edits and pre-STASH files.
  • Script and source mods.
  • Start dumps.
  • Ancillary files.
  • Miscellaneous namelists.

In the core Had GEM jobs at the Hadley Centre, these data files are scattered throughout a number of individual user accounts, with virtually no protection against accidental deletion or modification. Even without considering the job portability issues, this lack of control is risky and needs to be addressed.

When a job is exported from one group to another, the necessary input data files need to be identified in some way, and if necessary transfered to the receiving platform. In practice, much of the input data will be available already on the receiving platform, but in theory one should check that data that appears to correspond really is identical. To do this, of course, it is necessary to transfer the data anyway, which is wasteful and time-consuming.

In order to overcome this problem, we clearly need to maintain copies of the core data on all participating platforms. In UMUI jobs, we can then point to the local locations of the data using top-level environment variables.JD: because of the use of common environment variables, there is no need for a common directory structure. In the long term, people will automatically start using the same directory structure (as it is easier). In the sort term, this would only mean a lot of work (moving around files, adjusting all jobs, CHAOS…), without any real advantage. PLV: I agree that this would be a little chaotic at the beginning, but I fear that it might get much more chaotic in the long term. While this plan is painful now, we are only just starting, so that the investment now is small. If we decide to do this in a year’s time, THEN it would be a lot more chaotic. AC: Firstly, I don’t think the plan is painful for the Hadley Centre, and the advantage for them of having the data better-organised is considerable. Also, this doesn’t need to be done for old jobs; just for new standard 6.1 jobs. My suggestion for the Hadley Centre is as follows:

  1. If they are not already available, set up specific accounts on the UMUI platform (Linux) and the UM platform (NEC) for Had GEM (or, more generally, Hadley Centre models).
  2. Starting just with the latest 6.1 Had GEM job, organise ALL the UMUI and UM input data required for the job under these accounts following an agreed structure.
  3. Put the 6.1 UMUI job under the Had GEM account in the UMUI, setting things up according to the recommendations. Having tested it, then pass it on to us so we can check it ports OK.
  4. Set up a Wiki page at the Met Office to document standard Had GEM jobs. If such a page exists already, I haven’t been able to find it!
  5. For later standard jobs, do likewise.JD:As has been mentioned, it is of course better if the directory structure is the same on all systems. However, I think this might not always be possible or desirable. For example, on the ES, we do not want to put hand edits and start dumps within the same directory: small files should be on one disk, and large ones on another. Also, at the ES all files necessary to run the UM are available from moon, while in Reading the files are spread between the local machines and Newton/HPCx. It might simply not be possible to use the same datastructure everywhere. That is wy it is vital to agree on the environment variables and desirable to agree on the directory structure.

My suggestion is that data structures are maintained for each of the main model branches; currently Had GEM, HiGEM and NUGEM (the name for the models that will be developed in Japan). These data structures should each be managed by the relevant group, with other groups either mirroring the data in its entirity, or copying the data only as required. So the initial suggestion is for the Hadley Centre to set up and maintain the core Had GEM data, CGAM to manage the core HiGEM data, and the Earth Simulator group to manage NUGEM data. One way of organising this data might be as follows:

UM data:

  ancil/atmos/n96
  ancil/ocean
  comp_overrides/met_office
  comp_overrides/necsx
  comp_overrides/hpcx
  comp_overrides/es (Earth Simulator)
  dumps/atmos/n96
  dumps/ocean
  namelists
  spectral
  mods/scripts/vn6.1
  mods/source/vn6.1

UMUI data:

  hand_edits/vn6.1
  prestash/vn6.1

So, for example, at the Hadley Centre the core Had GEM data could be organised under directories called Had GEM_CoreData_um and Had GEM_CoreData_umui on the UM and UMUI platforms respectively, and if there was a need to do some HiGEM runs at the Hadley Centre, the HiGEM data maintained at CGAM could be mirrored or partially mirrored under directories called Hi GEM?_CoreData_um and Hi GEM?_CoreData_umui.

One of the issues with porting jobs is to decide which mods and compile overrides are platform or location dependent and make the necessary adjustments. This could be made easier by organising the relevant mods under suitably named subdirectories, as indicated above for the comp_overrides directory.

One of the rules for these data structures is that any file contained within them must not be altered or removed without a very good reason (such as large amounts of disk space being used by now redundant files). If a particular file is superceded by a revised version, therefore, the revised file must either be given a new name or be put in a different directory. For example, mods that are lodged under $UMDIR/vn6.2/mods at the Met Office, and therefore still under development, could be placed in direcories named according to the RCS version number. For example, revision 1.3 of $UMDIR/vn6.2/mods/source/orh0602/orhf0602.mf77 could be copied to ‘’mods/source/vn6.1/hadgem/orh0602/1.3/orhf0602.mf77′. JD:For this, we really need an online CVS server (unfortunately, RCS can’t do this), accessible by http or ssh. This would ensure that all mods, and their complete history, are archived. The job file can be saved together with the versioning of all mods, ancillary files, compile overrides, … This ensures that in a later stage, the job, together with all correct files can be retrieved. You could even be warned when there are new versions of mods that you use, and you can decide whether you want to use them or not. Again, this server should use the same common environment variables. Everybody should have read access, but write access should be limited to prevent chaos.. PLV: I agree here, completely. we need to look into the possibility of implementing such a server. AC: Sounds good, but I expect this would take some time to set up and probably has political implications that would need to ironed out in advance, so we should view it as a long-term goal. We should start with something simple and get going with it quickly.JD:Integrate the CVS server into the UM/UMUI. Just point script mods/source mods to revision x.xx of some online CVS file. To integrate this with everything that is already implemented, I propose to use another script based on my job portability one;) The UMUI generates a list of files (both for the local and remote machine) that need to be fetched from the CVS server. To prevent overloading of this server, the script checks if the file is already available and otherwise it copies it from the CVS to a local filesystem. All UM scripts just use the local copy of the CVS file. The advantages are that a file (or, more specifically, a revision of a file) in a CVS archive cannot be moved, renamed or deleted. Of course, users can easily be notified of updates of mods they use. As this system is voluntary, this doesn’t need to be political. Advantages: developers easily serve their mods to everyone, users can be certain that their mod is available, also in a years time, jobs are portable.

Job Setup

Assuming data has been organised as above, in UMUI jobs one could then set up pointers in “File and Directory Naming” to the common data. The remaining data that is not common will then be easy to identify.JD: Once we have agreed on common environment variables, these can be moved from the UMUI to the general .umsetvars file. This ensures that the correct variables are always used, and that the common environment variables need not be edited in the UMUI when transferring a job. Directories can be moved around as long as the .umsetvars is updated accordingly. Also, the problem of pathlengths is reduced as these are defined outside the UMUI.

An example setup in the “File and Directory Naming” panel for a HiGEM job might then be something like:

DATADIR=/S/data003/m0203
UM_DATAW=$DATADIR/$RUNID
UM_DATAM=$DATADIR/$RUNID
HADGEM_CORE=/S/data003/m0151/Had GEM_CoreData_um
HIGEM_CORE=/S/data003/m0151/Hi GEM?_CoreData_um
UM_SOURCE_MODS_HADGEM=$HADGEM_CORE/mods/source/vn6.1
UM_SOURCE_MODS_HIGEM=$HIGEM_CORE/mods/source/vn6.1
UM_SOURCE_MODS_LOCAL1=~m0203/srcmods/vn6.1
UM_SOURCE_MODS_LOCAL2=~m0204/mods/vn6.1
UM_SCRIPT_MODS_HADGEM=$HADGEM_CORE/mods/scripts/vn6.1
UM_SCRIPT_MODS_HIGEM=$HIGEM_CORE/mods/scripts/vn6.1
UM_SCRIPT_MODS_LOCAL1=~m0203/scriptmods/vn6.1
UM_SCRIPT_MODS_LOCAL2=~m0204/scrmods/vn6.1
UM_COMP_OVERRIDES_HADGEM=$HADGEM_CORE/comp_overrides
UM_ANCIL_HIGEM=$HIGEM_CORE/ancil
UM_ANCIL_LOCAL=/S/data003/m0203/ancil
UM_DUMPS_HIGEM=$HIGEM_CORE/dumps
UM_DUMPS_LOCAL=/S/data003/m0203/dumps/test1

JD:the *LOCAL environment variables do not need to be agreed on by everyone, as these are, indeed, local.There is no need to be too prescriptive here. The point is that there should be agreed variables pointing to the core data locations, and any other variables refering core data should be defined in terms of them.

All data filepaths in the job should then use these environment variables. Hard-wired paths should only then be necessary for the UMUI hand edits and pre-STASH files, so users will have to visit the relevant panels and make the necessary changes.

UNIX Scripts to Aid Job Sharing

The process of sharing jobs between groups using different computer platforms could be aided by some UNIX scripts. For example, John Donners (part of the UJCC group based in Japan) has written a script to package together the data files required by a UMUI job.

The script is generated when processing the job in the UMUI. There are four *.in* files in the output directory: 1) define the environment variables that are used in the job, 2) files on the local system (where the UMUI resides), 3) files on the remote system (where the UM runs), and 4) ancillary files. The script uses these files to archive all files (hand edits, userSTASHmaster files, source mods, script mods, checksums of ancillary files, ..) that are necessary for the job. The archive can then be used as a backup of the job (about 200kb when zipped) or to transfer it to another machine. When transferring a job, you should edit the *.out* files. These are the locations for all files on the new system. If some files are already available on the destination sytem, just point to these files. The script checks whether files are available, compares them to the archived ones and shows the diffferences if the files aren’t the same. The script copies the files and generates directories if files are not available yet. The option -n can be used to check the installation process without actually writing anything. The final stage is to write a new basis file with the new locations and environment variables.

This script, and other helpful scripts, should be added to the shared data structure somewhere so that participating groups can make use of them, and contribute to their development.

Add Comment 
Sign as Author 
Enter code 294

Edit - History - Print - Recent Changes - Search
Page last modified on April 25, 2006, at 10:09 AM