To run a case, the user must submit the batch script $CASE.$MACH.run. In addition, the user needs to also modify env_run.xml for their particular needs.
env_run.xml contains variables which may be modified during
the course of a model run. These variables comprise coupler namelist
settings for the model stop time, model restart frequency, coupler
history frequency and a flag to determine if the run should be flagged
as a continuation run. In general, the user needs to only set the
variables $STOP_OPTION
and $STOP_N
. The
other coupler settings will then be given consistent and reasonable
default values. These default settings guarantee that restart files
are produced at the end of the model run.
As mentioned above, variables that control runtime settings are found in env_run.xml. In the following, we focus on the handling of run control (e.g. length of run, continuing a run) and output data. We also give a more detailed description of CESM restarts.
Before a job is submitted to the batch system, the user needs first check that the batch submission lines in $CASE.$MACH.run are appropriate. These lines should be checked and modified accordingly for appropriate account numbers, time limits, and stdout/stderr file names. The user should then modify env_run.xml to determine the key run-time settings, as outlined below:
CONTINUE_RUN
Determines if the run is a restart run. Set to FALSE when initializing a startup, branch or hybrid case. Set to TRUE when continuing a run. (logical)
When you first begin a branch, hybrid or startup run, CONTINUE_RUN must be set to FALSE. When you successfully run and get a restart file, you will need to change CONTINUE_RUN to TRUE for the remainder of your run. Details of performing model restarts are provided below.
RESUBMIT
Enables the model to automatically resubmit a new run. To get multiple runs, set RESUBMIT greater than 0, then RESUBMIT will be decremented and the case will be resubmitted. The case will stop automatically resubmitting when the RESUBMIT value reaches 0.
Long CESM runs can easily outstrip supercomputer queue time limits. For this reason, a case is usually run as a series of jobs, each restarting where the previous finished.
STOP_OPTION
Ending simulation time.
Valid values are: [none, never, nsteps, nstep, nseconds, nsecond, nminutes, nminute, nhours, nhour, ndays, nday, nmonths, nmonth, nyears, nyear, date, ifdays0, end] (char)
STOP_N
Provides a numerical count for $STOP_OPTION. (integer)
STOP_DATE
Alternative yyyymmdd date option, negative value implies off. (integer)
REST_OPTION
Restart write interval.
Valid values are: [none, never, nsteps, nstep, nseconds, nsecond, nminutes, nminute, nhours, nhour, ndays, nday, nmonths, nmonth, nyears, nyear, date, ifdays0, end] (char)
Alternative yyyymmdd date option, negative value implies off. (integer)
REST_N
Number of intervals to write a restart. (integer)
REST_DATE
Model date to write restart, yyyymmdd
STOP_DATE
Alternative yyyymmdd date option, negative value implies off. (integer)
By default,
STOP_OPTION = ndays STOP_N = 5 STOP_DATE = -999 |
The default setting is only appropriate for initial testing. Before a
longer run is started, update the stop times based on the case
throughput and batch queue limits. For example, if the model runs 5
model years/day, set RESUBMIT
=30, STOP_OPTION
= nyears,
and STOP_N
= 5.
The model will then run in five year increments, and stop after 30
submissions.
Each CESM component produces its own output datasets consisting of history, restart and output log files. Component history files are in netCDF format whereas component restart files may be in netCDF or binary format and are used to either exactly restart the model or to serve as initial conditions for other model cases.
Most CESM component IO is handled by the Parallel IO library. This library is controled by settings in the file env_run.xml For each of these settings described below there is also a component specific setting that can be used to override the CESM wide default. A value of -99 in these component specific variables indicates that the default CESM wide setting will be used. If an out of range value is used for any component the model will revert to a suitable default. The actual values used for each component are written to the cesm.log file near the beginning of the model run.
PIO_NUMTASKS
Sets the number of component tasks to be used in the interface to lower level IO components, -1 indicates that the library will select a suitable default value. Using a larger number of IO tasks generally reduces the per task memory requirements but may reduce IO performance due to dividing data into blocksizes which are suboptimal. Note the the OCN_PIO_NUMTASKS overrides the system wide default value in most configurations.
ATM_PIO_NUMTASKS, CPL_PIO_NUMTASKS, GLC_PIO_NUMTASKS, ICE_PIO_NUMTASKS, LND_PIO_NUMTASKS, OCN_PIO_NUMTASKS
Component specific settings to override system wide defaults
PIO_ROOT
Sets the root task of the PIO subsystem relative to the root task of the model component. In most cases this value is set to 1, but due to limitations in the POP model OCN_PIO_ROOT must be set to 0.
ATM_PIO_ROOT, CPL_PIO_ROOT, GLC_PIO_ROOT, ICE_PIO_ROOT, LND_PIO_ROOT, OCN_PIO_ROOT
Component specific settings to override system wide defaults
PIO_STRIDE
Sets the offset between one IO task and the next for a given model component. Typically one would set either PIO_NUMTASKS or PIO_STRIDE and allow the model to set a reasonable default for the other variable.
ATM_PIO_STRIDE, CPL_PIO_STRIDE, GLC_PIO_STRIDE, ICE_PIO_STRIDE, LND_PIO_STRIDE, OCN_PIO_STRIDE
Component specific settings to override system wide defaults
PIO_TYPENAME
Sets the lowlevel library that PIO should interface. Possible values (depending on the available backend libraries) are netcdf, pnetcdf, netcdf4p and netcdf4c. netcdf is the default and requires the model to be linked with a netdf3 or netcdf4 library. pnetcdf requires the parallel netcdf library and may provide better performance than netcdf depending on a number of factors including platform and model decomposition. netcdf4p (parallel) and netcdf4c (compressed) require a netcdf4 library compiled with parallel hdf5. These options are not yet considered robust and should be used with caution.
ATM_PIO_TYPENAME, CPL_PIO_TYPENAME, GLC_PIO_TYPENAME, ICE_PIO_TYPENAME, LND_PIO_TYPENAME, OCN_PIO_TYPENAME
Component specific settings to override system wide defaults
PIO_DEBUG_LEVEL
Sets a flag for verbose debug output from the pio layer. Recommended for expert use only.
PIO_ASYNC_INTERFACE
This variable is reserved for future use and must currently be set to FALSE.
Archiving is a phase of a CESM model run where the generated output data is moved from $RUNDIR (normally $EXEROOT/run) to a local disk area (short-term archiving) and subsequently to a long-term storage system (long-term archiving). It has no impact on the production run except to clean up disk space and help manage user quotas. Short and long-term archiving environment variables are set in the env_mach_specific file. Although short-term and long-term archiving are implemented independently in the scripts, there is a dependence between the two since the short-term archiver must be turned on in order for the long-term archiver to be activated. In env_run.xml, several variables control the behavior of short and long-term archiving. These are described below.
LOGDIR
Extra copies of the component log files will be saved here.
DOUT_S
If TRUE, short term archiving will be turned on.
DOUT_S_ROOT
Root directory for short term archiving. This directory must be visible to compute nodes.
DOUT_S_SAVE_INT_REST_FILES
If TRUE, perform short term archiving on all interim restart files, not just those at the end of the run. By default, this value is FALSE. This is for expert users ONLY and requires expert knowledge. We will not document this further in this guide.
DOUT_L_MS
If TRUE, perform long-term archiving on the output data.
DOUT_L_MSROOT
Root directory on mass store system for long-term data archives.
DOUT_L_HTAR
If true, DOUT_L_HTAR the long-term archiver will store history data in annual tar files.
DOUT_L_RCP
If TRUE, long-term archiving is done via the rcp command (this is not currently supported).
DOUT_L_RCP_ROOT
Root directory for long-term archiving on rcp remote machine. (this is not currently supported).
Several important points need to be made about archiving:
By default, short-term archiving is enabled and long-term archiving is disabled.
All output data is initially written to $RUNDIR.
Unless a user explicitly turns off short-term archiving, files will be moved to $DOUT_S_ROOT at the end of a successful model run.
If long-term archiving is enabled, files will be
moved to $DOUT_L_MSROOT
by
$CASE.$MACH.l_archive, which is run as a
separate batch job after the successful completion of a model run.
Users should generally turn off short term-archiving when developing new CESM code.
If long-term archiving is not enabled, users must monitor quotas and usage in the $DOUT_S_ROOT/ directory and should manually clean up these areas on a frequent basis.
Standard output generated from each CESM component is saved in a "log
file" for each component in $RUNDIR
.
Each time the model is run, a
single coordinated datestamp is incorporated in the filenames of all
output log files associated with that run. This common datestamp is
generated by the run script and is of the form YYMMDD-hhmmss, where
YYMMDD are the Year, Month, Day and hhmmss are the hour, minute and
second that the run began (e.g. ocn.log.040526-082714). Log files are
also copied to a user specified directory using the variable $LOGDIR
in env_run.xml. The default is a 'logs' subdirectory beneath the
case directory.
By default, each component also periodically writes history files (usually monthly) in netCDF format and also writes netCDF or binary restart files in the $RUNDIR directory. The history and log files are controlled independently by each component. History output control (i.e. output fields and frequency) is set in the Buildconf/$component.buildnml.csh files.
The raw history data does not lend itself well to easy time-series analysis. For example, CAM writes one or more large netCDF history file(s) at each requested output period. While this behavior is optimal for model execution, it makes it difficult to analyze time series of individual variables without having to access the entire data volume. Thus, the raw data from major model integrations is usually postprocessed into more user-friendly configurations, such as single files containing long time-series of each output fields, and made available to the community.
As an example, for the following example settings
DOUT_S = TRUE DOUT_S_ROOT = /ptmp/$user/archive DOUT_L_MS = TRUE DOUT_L_MSROOT /USER/csm/b40.B2000 |
the run will automatically submit the $CASE.$MACH.l_archive to the queue upon its completion to archive the data. The system is not bulletproof, and the user will want to verify at regular intervals that the archived data is complete, particularly during long running jobs.