UM User Group |
Main /
Job PortabilityUM Job PortabilityAuthor: Adam Clayton Addition comments from:
Executive SummaryEffective collaboration between the Hadley Centre, CGAM/HiGEM and the UJCC (UK-Japan Climate Collaboration) group based at the Earth Simulator Center in Japan depends on us being able to exchange UM jobs and data easily, and reliably adapt them to work on the various computer platforms. At present, the procedure for importing UM jobs is unnecessarily complicated and prone to error. With some coordination between groups the task could be made much easier. The groups outside the Hadley Centre are currently in the process of installing a portable version of UM 6.1, a model version which will not be superceded at the Met Office for at least a year. Since the core Had GEM jobs at the Hadley Centre will shortly be upgraded to this version, this is the ideal time to get organised. The main suggestions put forward in this document are as follows:
IntroductionWith the HiGEM and Earth Simulator projects now well under way, there is a growing need to share UM jobs between different groups using different computer platforms. The main players are as follows: Group Platform Hadley Centre Met Office SX-6/8 CGAM/HiGEM HPCX/Newton UJCC Earth Simulator SX-6 Of course, one cannot pick up a job from someone working on a different platform and expect to be able to run it immediately on one’s own platform. The main reasons for this are as follows:
Thus, a certain amount of work is always required to translate a job between different platforms. This can be a laborious and error-prone task, but with some coordination I believe the task could be made much simpler. All of the above groups will shortly be concentrating their efforts on UM vn6.1, which will not be superceded for at least a year. This is therefore the ideal time to get organised. The following is an attempt to highlight the main issues, and suggest some solutions. Once we have come to some agreement, we can write and distribute a guide on how to set up and swap UMUI jobs, and how to manage input data. Vn6.1 UM/UMUI InstallationsThe Hadley Centre will shortly be moving their core Had GEM jobs to UM vn6.1. Meanwhile, the participating groups outside the Hadley Centre are currently installing a portable version of 6.1, which when complete will become the basis for all development work. These UM installations - in particular the script and program libraries - will be essentially identical. However, jobs run outside the Hadley Centre will need to include a series of portability mods, to the scripts and the C/Fortran source code. The script mods are not really an issue, but it would make sense to add the small number of source mods required for portability to the core Had GEM jobs. This is a simple task, but will remove any need to resolve clashes with the portability mods when importing jobs from the Hadley Centre. The vn6.1 UMUI installations outside the Met Office will include some additional options for the job submission panel. This means that when jobs are swapped with the Hadley Centre, there will be some warning messages about unspecified variables. However, this will be pretty harmless, so is not really an issue. However, users outside the Hadley Centre have also found it necessary to increase the maximum number of Fortran mods, and also some of the maximum path lengths; for example for prestash files. In order to avoid problems with job portability, the same changes should be made to the Met Office UMUI installation. Again, this is a trivial task, and in any case will be beneficial for users at the Met Office. The UMUI installation at the Met Office is not fixed. Sometimes changes are made after the initial installation to accommodate new platforms etc. If possible, this should be done in such a way that job portability is maintained. If this is not possible, the other groups should be informed of the changes so that local installations can be adjusted accordingly. PLV: I did not understand this point. Could you make it more clear ? AC: I am thinking of the change made to the job submission panel at the Met Office to support the new SX8. Of course, it was done in such a way that UKMO users would not be affected. However, for users of other platforms, the update broke their jobs. It’s just a matter of making the UMUI team at the Met Office aware that there are people outside the Met Office who wish to import jobs from the Met Office. Organisation of DataEach UM job requires a series of input files; for example:
In the core Had GEM jobs at the Hadley Centre, these data files are scattered throughout a number of individual user accounts, with virtually no protection against accidental deletion or modification. Even without considering the job portability issues, this lack of control is risky and needs to be addressed. When a job is exported from one group to another, the necessary input data files need to be identified in some way, and if necessary transfered to the receiving platform. In practice, much of the input data will be available already on the receiving platform, but in theory one should check that data that appears to correspond really is identical. To do this, of course, it is necessary to transfer the data anyway, which is wasteful and time-consuming. In order to overcome this problem, we clearly need to maintain copies of the core data on all participating platforms. In UMUI jobs, we can then point to the local locations of the data using top-level environment variables.JD: because of the use of common environment variables, there is no need for a common directory structure. In the long term, people will automatically start using the same directory structure (as it is easier). In the sort term, this would only mean a lot of work (moving around files, adjusting all jobs, CHAOS…), without any real advantage. PLV: I agree that this would be a little chaotic at the beginning, but I fear that it might get much more chaotic in the long term. While this plan is painful now, we are only just starting, so that the investment now is small. If we decide to do this in a year’s time, THEN it would be a lot more chaotic. AC: Firstly, I don’t think the plan is painful for the Hadley Centre, and the advantage for them of having the data better-organised is considerable. Also, this doesn’t need to be done for old jobs; just for new standard 6.1 jobs. My suggestion for the Hadley Centre is as follows:
My suggestion is that data structures are maintained for each of the main model branches; currently Had GEM, HiGEM and NUGEM (the name for the models that will be developed in Japan). These data structures should each be managed by the relevant group, with other groups either mirroring the data in its entirity, or copying the data only as required. So the initial suggestion is for the Hadley Centre to set up and maintain the core Had GEM data, CGAM to manage the core HiGEM data, and the Earth Simulator group to manage NUGEM data. One way of organising this data might be as follows: UM data: ancil/atmos/n96 ancil/ocean comp_overrides/met_office comp_overrides/necsx comp_overrides/hpcx comp_overrides/es (Earth Simulator) dumps/atmos/n96 dumps/ocean namelists spectral mods/scripts/vn6.1 mods/source/vn6.1 UMUI data: hand_edits/vn6.1 prestash/vn6.1 So, for example, at the Hadley Centre the core Had GEM data could be organised under directories called Had GEM_CoreData_um and Had GEM_CoreData_umui on the UM and UMUI platforms respectively, and if there was a need to do some HiGEM runs at the Hadley Centre, the HiGEM data maintained at CGAM could be mirrored or partially mirrored under directories called Hi GEM?_CoreData_um and Hi GEM?_CoreData_umui. One of the issues with porting jobs is to decide which mods and compile overrides are platform or location dependent and make the necessary adjustments. This could be made easier by organising the relevant mods under suitably named subdirectories, as indicated above for the comp_overrides directory. One of the rules for these data structures is that any file contained within them must not be altered or removed without a very good reason (such as large amounts of disk space being used by now redundant files). If a particular file is superceded by a revised version, therefore, the revised file must either be given a new name or be put in a different directory. For example, mods that are lodged under $UMDIR/vn6.2/mods at the Met Office, and therefore still under development, could be placed in direcories named according to the RCS version number. For example, revision 1.3 of $UMDIR/vn6.2/mods/source/orh0602/orhf0602.mf77 could be copied to ‘’mods/source/vn6.1/hadgem/orh0602/1.3/orhf0602.mf77′. JD:For this, we really need an online CVS server (unfortunately, RCS can’t do this), accessible by http or ssh. This would ensure that all mods, and their complete history, are archived. The job file can be saved together with the versioning of all mods, ancillary files, compile overrides, … This ensures that in a later stage, the job, together with all correct files can be retrieved. You could even be warned when there are new versions of mods that you use, and you can decide whether you want to use them or not. Again, this server should use the same common environment variables. Everybody should have read access, but write access should be limited to prevent chaos.. PLV: I agree here, completely. we need to look into the possibility of implementing such a server. AC: Sounds good, but I expect this would take some time to set up and probably has political implications that would need to ironed out in advance, so we should view it as a long-term goal. We should start with something simple and get going with it quickly.JD:Integrate the CVS server into the UM/UMUI. Just point script mods/source mods to revision x.xx of some online CVS file. To integrate this with everything that is already implemented, I propose to use another script based on my job portability one;) The UMUI generates a list of files (both for the local and remote machine) that need to be fetched from the CVS server. To prevent overloading of this server, the script checks if the file is already available and otherwise it copies it from the CVS to a local filesystem. All UM scripts just use the local copy of the CVS file. The advantages are that a file (or, more specifically, a revision of a file) in a CVS archive cannot be moved, renamed or deleted. Of course, users can easily be notified of updates of mods they use. As this system is voluntary, this doesn’t need to be political. Advantages: developers easily serve their mods to everyone, users can be certain that their mod is available, also in a years time, jobs are portable. Job SetupAssuming data has been organised as above, in UMUI jobs one could then set up pointers in “File and Directory Naming” to the common data. The remaining data that is not common will then be easy to identify.JD: Once we have agreed on common environment variables, these can be moved from the UMUI to the general .umsetvars file. This ensures that the correct variables are always used, and that the common environment variables need not be edited in the UMUI when transferring a job. Directories can be moved around as long as the .umsetvars is updated accordingly. Also, the problem of pathlengths is reduced as these are defined outside the UMUI. An example setup in the “File and Directory Naming” panel for a HiGEM job might then be something like: DATADIR=/S/data003/m0203
JD:the *LOCAL environment variables do not need to be agreed on by everyone, as these are, indeed, local.There is no need to be too prescriptive here. The point is that there should be agreed variables pointing to the core data locations, and any other variables refering core data should be defined in terms of them. All data filepaths in the job should then use these environment variables. Hard-wired paths should only then be necessary for the UMUI hand edits and pre-STASH files, so users will have to visit the relevant panels and make the necessary changes. UNIX Scripts to Aid Job SharingThe process of sharing jobs between groups using different computer platforms could be aided by some UNIX scripts. For example, John Donners (part of the UJCC group based in Japan) has written a script to package together the data files required by a UMUI job. The script is generated when processing the job in the UMUI. There are four *.in* files in the output directory: 1) define the environment variables that are used in the job, 2) files on the local system (where the UMUI resides), 3) files on the remote system (where the UM runs), and 4) ancillary files. The script uses these files to archive all files (hand edits, userSTASHmaster files, source mods, script mods, checksums of ancillary files, ..) that are necessary for the job. The archive can then be used as a backup of the job (about 200kb when zipped) or to transfer it to another machine. When transferring a job, you should edit the *.out* files. These are the locations for all files on the new system. If some files are already available on the destination sytem, just point to these files. The script checks whether files are available, compares them to the archived ones and shows the diffferences if the files aren’t the same. The script copies the files and generates directories if files are not available yet. The option -n can be used to check the installation process without actually writing anything. The final stage is to write a new basis file with the new locations and environment variables. This script, and other helpful scripts, should be added to the shared data structure somewhere so that participating groups can make use of them, and contribute to their development. |