Chapter 5. Running a case

Table of Contents
Customizing runtime settings
Load balancing a case
The Run
Testing a case

To run a case, the user must submit the batch script $CASE.$MACH.run. In addition, the user needs to also modify env_run.xml for their particular needs.

env_run.xml contains variables which may be modified during the course of a model run. These variables comprise coupler namelist settings for the model stop time, model restart frequency, coupler history frequency and a flag to determine if the run should be flagged as a continuation run. In general, the user needs to only set the variables $STOP_OPTION and $STOP_N. The other coupler settings will then be given consistent and reasonable default values. These default settings guarantee that restart files are produced at the end of the model run.

Customizing runtime settings

As mentioned above, variables that control runtime settings are found in env_run.xml. In the following, we focus on the handling of run control (e.g. length of run, continuing a run) and output data. We also give a more detailed description of CCSM restarts.

Setting run control variables

Before a job is submitted to the batch system, the user needs first check that the batch submission lines in $CASE.$MACH.run are appropriate. These lines should be checked and modified accordingly for appropriate account numbers, time limits, and stdout/stderr file names. The user should then modify env_run.xml to determine the key run-time settings, as outlined below:

CONTINUE_RUN

Determines if the run is a restart run. Set to FALSE when initializing a startup, branch or hybrid case. Set to TRUE when continuing a run. (logical)

When you first begin a branch, hybrid or startup run, CONTINUE_RUN must be set to FALSE. When you successfully run and get a restart file, you will need to change CONTINUE_RUN to TRUE for the remainder of your run. Details of performing model restarts are provided below.

RESUBMIT

Enables the model to automatically resubmit a new run. To get multiple runs, set RESUBMIT greater than 0, then RESUBMIT will be decremented and the case will be resubmitted. The case will stop automatically resubmitting when the RESUBMIT value reaches 0.

Long CCSM runs can easily outstrip supercomputer queue time limits. For this reason, a case is usually run as a series of jobs, each restarting where the previous finished.

STOP_OPTION

Ending simulation time.

Valid values are: [none, never, nsteps, nstep, nseconds, nsecond, nminutes, nminute, nhours, nhour, ndays, nday, nmonths, nmonth, nyears, nyear, date, ifdays0, end] (char)

STOP_N

Provides a numerical count for $STOP_OPTION. (integer)

STOP_DATE

Alternative yyyymmdd date option, negative value implies off. (integer)

REST_OPTION

Restart write interval.

Valid values are: [none, never, nsteps, nstep, nseconds, nsecond, nminutes, nminute, nhours, nhour, ndays, nday, nmonths, nmonth, nyears, nyear, date, ifdays0, end] (char)

Alternative yyyymmdd date option, negative value implies off. (integer)

REST_N

Number of intervals to write a restart. (integer)

REST_DATE

Model date to write restart, yyyymmdd

STOP_DATE

Alternative yyyymmdd date option, negative value implies off. (integer)

By default,


STOP_OPTION = ndays
STOP_N = 5
STOP_DATE = -999

The default setting is only appropriate for initial testing. Before a longer run is started, update the stop times based on the case throughput and batch queue limits. For example, if the model runs 5 model years/day, set RESUBMIT=30, STOP_OPTION= nyears, and STOP_N= 5. The model will then run in five year increments, and stop after 30 submissions.

Output data

Each CCSM component produces its own output datasets consisting of history, restart and output log files. Component history files are in netCDF format whereas component restart files may be in netCDF or binary format and are used to either exactly restart the model or to serve as initial conditions for other model cases.

Archiving is a phase of a CCSM model run where the generated output data is moved from $RUNDIR (normally $EXEROOT/run) to a local disk area (short-term archiving) and subsequently to a long-term storage system (long-term archiving). It has no impact on the production run except to clean up disk space and help manage user quotas. Short and long-term archiving environment variables are set in the env_mach_specific file. Although short-term and long-term archiving are implemented independently in the scripts, there is a dependence between the two since the short-term archiver must be turned on in order for the long-term archiver to be activated. In env_run.xml, several variables control the behavior of short and long-term archiving. These are described below.

LOGDIR

Extra copies of the component log files will be saved here.

DOUT_S

If TRUE, short term archiving will be turned on.

DOUT_S_ROOT

Root directory for short term archiving. This directory must be visible to compute nodes.

DOUT_S_SAVE_INT_REST_FILES

If TRUE, perform short term archiving on all interim restart files, not just those at the end of the run. By default, this value is FALSE. This is for expert users ONLY and requires expert knowledge. We will not document this further in this guide.

DOUT_L_MS

If TRUE, perform long-term archiving on the output data.

DOUT_L_MSROOT

Root directory on mass store system for long-term data archives.

DOUT_L_HTAR

If true, DOUT_L_HTAR the long-term archiver will store history data in annual tar files.

DOUT_L_RCP

If TRUE, long-term archiving is done via the rcp command (this is not currently supported).

DOUT_L_RCP_ROOT

Root directory for long-term archiving on rcp remote machine. (this is not currently supported).

Several important points need to be made about archiving:

  • By default, short-term archiving is enabled and long-term archiving is disabled.

  • All output data is initially written to $RUNDIR.

  • Unless a user explicitly turns off short-term archiving, files will be moved to $DOUT_S_ROOT at the end of a successful model run.

  • If long-term archiving is enabled, files will be moved to $DOUT_L_MSROOT by $CASE.$MACH.l_archive, which is run as a separate batch job after the successful completion of a model run.

  • Users should generally turn off short term-archiving when developing new CCSM code.

  • If long-term archiving is not enabled, users must monitor quotas and usage in the $DOUT_S_ROOT/ directory and should manually clean up these areas on a frequent basis.

Standard output generated from each CCSM component is saved in a "log file" for each component in $RUNDIR. Each time the model is run, a single coordinated datestamp is incorporated in the filenames of all output log files associated with that run. This common datestamp is generated by the run script and is of the form YYMMDD-hhmmss, where YYMMDD are the Year, Month, Day and hhmmss are the hour, minute and second that the run began (e.g. ocn.log.040526-082714). Log files are also copied to a user specified directory using the variable $LOGDIR in env_run.xml. The default is a 'logs' subdirectory beneath the case directory.

By default, each component also periodically writes history files (usually monthly) in netCDF format and also writes netCDF or binary restart files in the $RUNDIR directory. The history and log files are controlled independently by each component. History output control (i.e. output fields and frequency) is set in the Buildconf/$component.buildnml.csh files.

The raw history data does not lend itself well to easy time-series analysis. For example, CAM writes one or more large netCDF history file(s) at each requested output period. While this behavior is optimal for model execution, it makes it difficult to analyze time series of individual variables without having to access the entire data volume. Thus, the raw data from major model integrations is usually postprocessed into more user-friendly configurations, such as single files containing long time-series of each output fields, and made available to the community.

As an example, for the following example settings


DOUT_S = TRUE
DOUT_S_ROOT = /ptmp/$user/archive
DOUT_L_MS = TRUE
DOUT_L_MSROOT /USER/csm/b40.B2000

the run will automatically submit the $CASE.$MACH.l_archive to the queue upon its completion to archive the data. The system is not bulletproof, and the user will want to verify at regular intervals that the archived data is complete, particularly during long running jobs.