The Run

Setting the time limits

Before you can run the job, you need to make sure the batch queue variables are set correctly for the specific run being targeted. This is done currently by manually editing $CASE.$MACH.run. The user should carefully check the batch queue submission lines and make sure that you have appropriate account numbers, time limits, and stdout file names. In looking at the ccsm_timing.$CASE.$datestamp files for "Model Throughput", output like the following will be found:


Overall Metrics:
Model Cost: 327.14 pe-hrs/simulated_year (scale= 0.50)
Model Throughput: 4.70 simulated_years/day

The model throughput is the estimated number of model years that you can run in a wallclock day. Based on this, the user can maximize $CASE.$MACH.run queue limit and change $STOP_OPTION and $STOP_N in env_run.xml. For example, say a model's throughput is 4.7 simulated_years/day. On bluefire, the maximum runtime limit is 6 hours. 4.7 model years/24 hours * 6 hours = 1.17 years. On the massively parallel computers, there is always some variability in how long it will take a job to run. On some machines, you may need to leave as much as 20% buffer time in your run to guarantee that jobs finish reliably before the time limit. For that reason we will set our model to run only one model year/job. Continuing to assume that the run is on bluefire, in $CASE.bluefire.run set


#BSUB -W 6:00

and xmlchange should be invoked as follows in $CASEROOT:


./xmlchange -file env_run.xml -id STOP_OPTION   -val nyears
./xmlchange -file env_run.xml -id STOP_N        -val 1 
./xmlchange -file env_run.xml -id REST_OPTION   -val nyears
./xmlchange -file env_run.xml -id REST_N        -val 1 

Submitting the run

Once you have configured and built the model, submit $CASE.$MACH.run to your machine's batch queue system. For example on NCAR's IBM, yellowstone,


> # for yellowstone
> bsub < $CASE.yellowstone.run
> # for titan
> qsub $CASE.titan.run

You can see a complete example of how to run a case in the basic example.

When executed, the run script, $CASE.$MACH.run, will:

NOTE: This script does NOT execute the build script, $CASE.$MACH.build. Building CESM is now done only via an interactive call to the build script.

If the job runs to completion, you should have "SUCCESSFUL TERMINATION OF CPL7-CCSM" near the end of your STDOUT file. New data should be in the subdirectories under $DOUT_S_ROOT, or if you have long-term archiving turned on, it should be automatically moved to subdirectories under $DOUT_L_MSROOT.

If the job failed, there are several places where you should look for information. Start with the STDOUT and STDERR file(s) in $CASEROOT. If you don't find an obvious error message there, the $RUNDIR/$model.log.$datestamp files will probably give you a hint. First check cpl.log.$datestamp, because it will often tell you when the model failed. Then check the rest of the component log files. Please see troubleshooting runtime errors for more information.

REMINDER: Once you have a successful first run, you must set CONTINUE_RUN to TRUE in env_run.xml before resubmitting, otherwise the job will not progress. You may also need to modify the RESUBMIT, STOP_OPTION, STOP_N, STOP_DATE, REST_OPTION, REST_N and/or REST_DATE variables in env_run.xml before resubmitting.

Restarting a run

Restart files are written by each active component (and some data components) at intervals dictated by the driver via the setting of the env_run.xml variables, $REST_OPTION and $REST_N. Restart files allow the model to stop and then start again with bit-for-bit exact capability (i.e. the model output is exactly the same as if it had never been stopped). The driver coordinates the writing of restart files as well as the time evolution of the model. All components receive restart and stop information from the driver and write restarts or stop as specified by the driver.

It is important to note that runs that are initialized as branch or hybrid runs, will require restart/initial files from previous model runs (as specified by the env_conf.xml variables, $RUN_REFCASE and $RUN_REFDATE). These required files must be prestaged by the user to the case $RUNDIR (normally $EXEROOT/run) before the model run starts. This is normally done by just copying the contents of the relevant $RUN_REFCASE/rest/$RUN_REFDATE.00000 directory.

Whenever a component writes a restart file, it also writes a restart pointer file of the form, rpointer.$component. The restart pointer file contains the restart filename that was just written by the component. Upon a restart, each component reads its restart pointer file to determine the filename(s) to read in order to continue the model run. As examples, the following pointer files will be created for a component set using full active model components.

If short-term archiving is turned on, then the model archives the component restart datasets and pointer files into $DOUT_S_ROOT/rest/yyyy-mm-dd-sssss, where yyyy-mm-dd-sssss is the model date at the time of the restart (see below for more details). If long-term archiving these restart then archived in $DOUT_L_MSROOT/rest. DOUT_S_ROOT and DOUT_L_MSROOT are set in env_run.xml, and can be changed at any time during the run.

Backing up to a previous restart

If a run encounters problems and crashes, the user will normally have to back up to a previous restart. Assuming that short-term archiving is enabled, the user needs to find the latest $DOUT_S_ROOT/rest/yyyy-mm-dd-ssss/ directory that was create and copy the contents of that directory into their run directory ($RUNDIR). The user can then continue the run and these restarts will be used. It is important to make sure the new rpointer.* files overwrite the rpointer.* files that were in $RUNDIR, or the job may not restart in the correct place.

Occasionally, when a run has problems restarting, it is because the rpointer files are out of sync with the restart files. The rpointer files are text files and can easily be edited to match the correct dates of the restart and history files. All the restart files should have the same date.

Data flow during a model run

All component log files are copied to the directory specified by the env_run.xml variable $LOGDIR which by default is set to $CASEROOT/logs. This location is where log files are copied when the job completes successfully. If the job aborts, the log files will NOT be copied out of the $RUNDIR directory.

Once a model run has completed successfully, the output data flow will depend on whether or not short-term archiving is enabled (as set by the env_run.xml variable, $DOUT_S). By default, short-term archiving will be done.

No archiving

If no short-term archiving is performed, then all model output data will remain in the run directory, as specified by the env_run.xml variable, $RUNDIR. Furthermore, if short-term archiving is disabled, then long-term archiving will not be allowed.

Short-term archiving

If short-term archiving is enabled, the component output files will be moved to the short term archiving area on local disk, as specified by $DOUT_S_ROOT. The directory DOUT_S_ROOT is normally set to $EXEROOT/../archive/$CASE. and will contain the following directory structure:


atm/
    hist/ logs/
cpl/ 
    hist/ logs/
glc/ 
    logs/
ice/ 
    hist/ logs/
lnd/ 
    hist/ logs/
ocn/ 
    hist/ logs/
rest/ 
    yyyy-mm-dd-sssss/
    ....
    yyyy-mm-dd-sssss/

hist/ contains component history output for the run.

logs/ contains component log files created during the run. In addition to $LOGDIR, log files are also copied to the short-term archiving directory and therefore are available for long-term archiving.

rest/ contains a subset of directories that each contain a consistent set of restart files, initial files and rpointer files. Each sub-directory has a unique name corresponding to the model year, month, day and seconds into the day where the files were created (e.g. 1852-01-01-00000/). The contents of any restart directory can be used to create a branch run or a hybrid run or back up to a previous restart date.

Long-term archiving

For long production runs that generate many giga-bytes of data, the user normally wants to move the output data from local disk to a long-term archival location. Long-term archiving can be activated by setting $DOUT_L_MS to TRUE in env_run.xml. By default, the value of this variable is FALSE, and long-term archiving is disabled. If the value is set to TRUE, then the following additional variables are: $DOUT_L_MSROOT, $DOUT_S_ROOT DOUT_S (see variables for output data management ).

As was mentioned above, if long-term archiving is enabled, files will be moved out of $DOUT_S_ROOT to $DOUT_L_ROOT by $CASE.$MACH.l_archive,, which is run as a separate batch job after the successful completion of a model run.