Before you can run the job, you need to make sure the batch queue variables are set correctly for the specific run being targeted. This is done currently by manually editing $CASE.$MACH.run. The user should carefully check the batch queue submission lines and make sure that you have appropriate account numbers, time limits, and stdout file names. In looking at the ccsm_timing.$CASE.$datestamp files for "Model Throughput", output like the following will be found:
Overall Metrics: Model Cost: 327.14 pe-hrs/simulated_year (scale= 0.50) Model Throughput: 4.70 simulated_years/day |
The model throughput is the estimated number of model years that you can run in a wallclock day. Based on this, the user can maximize $CASE.$MACH.run queue limit and change $STOP_OPTION and $STOP_N in env_run.xml. For example, say a model's throughput is 4.7 simulated_years/day. On bluefire, the maximum runtime limit is 6 hours. 4.7 model years/24 hours * 6 hours = 1.17 years. On the massively parallel computers, there is always some variability in how long it will take a job to run. On some machines, you may need to leave as much as 20% buffer time in your run to guarantee that jobs finish reliably before the time limit. For that reason we will set our model to run only one model year/job. Continuing to assume that the run is on bluefire, in $CASE.bluefire.run set
#BSUB -W 6:00 |
and xmlchange should be invoked as follows in $CASEROOT:
./xmlchange -file env_run.xml -id STOP_OPTION -val nyears ./xmlchange -file env_run.xml -id STOP_N -val 1 ./xmlchange -file env_run.xml -id REST_OPTION -val nyears ./xmlchange -file env_run.xml -id REST_N -val 1 |
Once you have configured and built the model, submit $CASE.$MACH.run to your machine's batch queue system. For example on NCAR's IBM, bluefire,
> # for BLUEFIRE > bsub < $CASE.bluefire.run > # for CRAY > qsub $CASE.jaguar.run |
You can see a complete example of how to run a case in the basic example.
When executed, the run script, $CASE.$MACH.run, will:
Check to verify that the env files are consistent with the configure and build scripts
Verify that required input data is present on local disk (in
$DIN_LOC_ROOT_CSMDATA
) and run the buildnml script for
each component
Run the CCSM model. Put timing information in
$LOGDIR/timing. If $LOGDIR
is
set, copy log files back to $LOGDIR
If $DOUT_S
is TRUE, component history, log,
diagnostic, and restart files will be moved from
$RUNDIR
to the short-term archive directory,
$DOUT_S_ROOT
.
If $DOUT_L_MS
is TRUE, the long-term archiver,
$CASE.$MACH.l_archive, will be submitted to the batch queue upon
successful completion of the run.
If $RESUBMIT
>0, resubmit $CASE.$MACH.run
NOTE: This script does NOT execute the build script, $CASE.$MACH.build. Building CCSM is now done only via an interactive call to the build script.
If the job runs to completion, you should have "SUCCESSFUL TERMINATION OF CPL7-CCSM" near the end of your STDOUT file. New data should be in the subdirectories under $DOUT_S_ROOT, or if you have long-term archiving turned on, it should be automatically moved to subdirectories under $DOUT_L_MSROOT.
If the job failed, there are several places where you should look for information. Start with the STDOUT and STDERR file(s) in $CASEROOT. If you don't find an obvious error message there, the $RUNDIR/$model.log.$datestamp files will probably give you a hint. First check cpl.log.$datestamp, because it will often tell you when the model failed. Then check the rest of the component log files. Please see troubleshooting runtime errors for more information.
REMINDER: Once you have a successful first run, you must set CONTINUE_RUN to TRUE in env_run.xml before resubmitting, otherwise the job will not progress. You may also need to modify the RESUBMIT, STOP_OPTION, STOP_N, STOP_DATE, REST_OPTION, REST_N and/or REST_DATE variables in env_run.xml before resubmitting.
Restart files are written by each active component (and some data
components) at intervals dictated by the driver via the setting of the
env_run.xml variables, $REST_OPTION
and
$REST_N
. Restart files allow the model to stop and then
start again with bit-for-bit exact capability (i.e. the model output
is exactly the same as if it had never been stopped). The driver
coordinates the writing of restart files as well as the time evolution
of the model. All components receive restart and stop information
from the driver and write restarts or stop as specified by the driver.
It is important to note that runs that are initialized as branch or
hybrid runs, will require restart/initial files from previous model
runs (as specified by the env_conf.xml variables,
$RUN_REFCASE
and $RUN_REFDATE
). These
required files must be prestaged by the user to
the case $RUNDIR
(normally
$EXEROOT/run) before the model run starts. This
is normally done by just copying the contents of the relevant
$RUN_REFCASE/rest/$RUN_REFDATE.00000 directory.
Whenever a component writes a restart file, it also writes a restart pointer file of the form, rpointer.$component. The restart pointer file contains the restart filename that was just written by the component. Upon a restart, each component reads its restart pointer file to determine the filename(s) to read in order to continue the model run. As examples, the following pointer files will be created for a component set using full active model components.
rpointer.atm
rpointer.drv
rpointer.ice
rpointer.lnd
rpointer.ocn.ovf
rpointer.ocn.restart
If short-term archiving is turned on, then the model archives the
component restart datasets and pointer files into
$DOUT_S_ROOT/rest/yyyy-mm-dd-sssss, where
yyyy-mm-dd-sssss is the model date at the time of the restart (see below for more details).
If long-term archiving these restart then
archived in
$DOUT_L_MSROOT/rest. DOUT_S_ROOT
and DOUT_L_MSROOT
are set in env_run.xml, and can be
changed at any time during the run.
If a run encounters problems and crashes, the user will normally have
to back up to a previous restart. Assuming that short-term archiving
is enabled, the user needs to find the latest
$DOUT_S_ROOT/rest/yyyy-mm-dd-ssss/ directory that
was create and copy the contents of that directory into their run
directory ($RUNDIR
). The user can then continue the run
and these restarts will be used. It is important to make sure the new
rpointer.* files overwrite the rpointer.* files that were in
$RUNDIR
, or the job may not restart in the correct
place.
Occasionally, when a run has problems restarting, it is because the rpointer files are out of sync with the restart files. The rpointer files are text files and can easily be edited to match the correct dates of the restart and history files. All the restart files should have the same date.
All component log files are copied to the directory specified
by the env_run.xml variable $LOGDIR
which by default
is set to $CASEROOT/logs. This location is
where log files are copied when the job completes successfully. If the
job aborts, the log files will NOT be copied out of the $RUNDIR directory.
Once a model run has completed successfully, the output data
flow will depend on whether or not short-term archiving is enabled (as
set by the env_run.xml variable, $DOUT_S
). By
default, short-term archiving will be done.
If no short-term archiving is performed, then all model output
data will remain in the run directory, as specified by the
env_run.xml variable, $RUNDIR
. Furthermore, if
short-term archiving is disabled, then long-term archiving will not be
allowed.
If short-term archiving is enabled, the component output files will be
moved to the short term archiving area on local disk, as specified by
$DOUT_S_ROOT
. The directory DOUT_S_ROOT is normally
set to $EXEROOT/../archive/$CASE
.
and will
contain the following directory structure:
atm/ hist/ logs/ cpl/ hist/ logs/ glc/ logs/ ice/ hist/ logs/ lnd/ hist/ logs/ ocn/ hist/ logs/ rest/ yyyy-mm-dd-sssss/ .... yyyy-mm-dd-sssss/ |
hist/ contains component history output for the run.
logs/ contains component log files
created during the run. In addition to $LOGDIR
, log
files are also copied to the short-term archiving directory and
therefore are available for long-term archiving.
rest/ contains a subset of directories that each contain a consistent set of restart files, initial files and rpointer files. Each sub-directory has a unique name corresponding to the model year, month, day and seconds into the day where the files were created (e.g. 1852-01-01-00000/). The contents of any restart directory can be used to create a branch run or a hybrid run or back up to a previous restart date.
For long production runs that generate many giga-bytes of data,
the user normally wants to move the output data from local disk to a
long-term archival location.
Long-term archiving can be activated by setting $DOUT_L_MS
to TRUE in env_run.xml. By default, the value of this variable is
FALSE, and long-term archiving is disabled. If the value is set to
TRUE, then the following additional variables are:
$DOUT_L_MSROOT
, $DOUT_S_ROOT
DOUT_S
(see variables for output data
management ).
As was mentioned above, if long-term archiving is enabled,
files will be moved out of $DOUT_S_ROOT
to
$DOUT_L_ROOT
by
$CASE.$MACH.l_archive,, which is run as a
separate batch job after the successful completion of a model run.