Next: 6 Testing the CCSM Up: UsersGuide Previous: 4 Building the CCSM Contents

Subsections

5.1 Start up
5.2 What is the ccsm_joe file?
5.3 How does auto-RESUBMIT work?
- 5.3.1 Runaway jobs
5.4 Batch queuing challenges
5.5 Modifying source code
5.6 CCSM Data Management
5.7 What is harvesting doing?
5.8 Monitoring the integration
5.9 Data processing
5.10 Comparing output to NCAR controls

5 Running the CCSM

This section will describe some of the practical aspects of setting up a short test run or a longer production run. The best way to get scripts configured for either a test or production run is to use the ccsm gui, described elsewhere. This section will provide some guidance about how to set up a configuration manually.

5.1 Start up

There are a number of steps required to configure a CCSM run manually.

5.1.1 Running a simple test case of the fully coupled model on the NCAR IBM machine, blackforest

Assume that the new case name is mytest1 and the root directory where the CCSM model is located in /home/$LOGNAME/ccsm2.0/ .

Get a copy of the CCSM source code tarfile and untar it in the directory /home/$LOGNAME . Then create the a new case name for your experiment (in this example, it will be ``mytest1'') and copy the contents of the test.a1 into the new case directory as follows:
```
    cd /home/$LOGNAME/ccsm2.0/scripts 
    mkdir mytest1 
    cp test.a1/* mytest1/ 
    cd /home/$LOGNAME/ccsm2.0/scripts/mytest1
    mv test.a1.run mytest1.run
    mv test.a1.har mytest1.har
```

Modify the main run script by editing editing mytest1.run and making the following changes:

      change "job_name" to mytest1
      change "setenv CASE" to "mytest1"
      change "setenv CASESTR" to a useful string
      change "setenv CSMROOT" to /home/$LOGNAME/ccsm2.0
      change "setenv CSMDATA" to the local path to the inputdata directory
      change "setenv EXEROOT" to the local directory where the model will run
      change "setenv ARCROOT" to the local directory for archiving model output

Run the script by submitting /home/$LOGNAME/ccsm2.0/scripts/mytest1/mytest1.run to the batch queue with following command:
```
    llsubmit mytest1.run
```

5.1.2 Changing the configuration

To modify the model configuration, several changes must be made to the main run script. Assume a model case name of mytest2 for this case. Go through the section above using mytest2 as the case name. Then in addition to the changes in the section above, modify the main script, mytest2.run, as follows:

Modify SETUPS:
- change "set SETUPS" to reflect the configuration you'd like.
- atm can be set to atm, datm, or latm
- lnd can be set to lnd or dlnd
- ice can be set to ice or dice
- ocn can be set to ocn or docn
Modify NTASKS/NTHRDS:
change "set NTASKS" and "set NTHRDS" to reflect the configuration used in SETUPS above. NTASKS represent the number of MPI tasks, and NTHRDS represents the number of OpenMP threads per MPI task.
The total number of processors used for each CCSM component is NTASKS*NTHRDS. The NTASKS and NTHRDS array elements are aligned consistently with the MODELS and SETUPS arrays.
- The data models datm, docn, dice, dlnd, and latm should have NTASKS=1, NTHRDS=1
- atm can run all MPI, OPENMP, or combined MPI/OPENMP
- lnd can run all MPI, OPENMP, or combined MPI/OPENMP
- ice runs MPI only, therefore NTHRDS must be set to 1
- ocn runs MPI only, therefore NTHRDS must be set to 1
- cpl runs OPENMP only, therefore NTASKS must be set to 1
Modify the batch queue information to be consistent with the configuration specified my NTASKS and NTHRDS:
- For the IBM, this is ``task_geometry''
- For the SGI, this is on the ``QSUB -l'' line
- For the CPQ, this is on the ``PBS nodes/ppn'' line.

As an aside, there is a standard shorthand naming convention for configuration setups. In particular,

A = datm,dlnd,docn,dice,cpl
B = atm,lnd,ocn,ice,cpl
C = datm,dlnd,ocn,dice,cpl
D = datm,dlnd,docn,ice,cpl
F = atm,lnd,docn,ice (prescribed ice mode),cpl
G = latm,dlnd,ocn,ice,cpl
H = atm,dlnd,docn,dice,cpl
I = datm,lnd,docn,dice,cpl
K = atm,lnd,docn,dice,cpl
M = latm,dlnd,docn,ice (mixed layer ocean mode), cpl

This case naming convention is used in the gui, and the tested configurations are summarized in the section describing supported configurations.

5.1.3 Changing the RUNTYPE

RUNTYPE is set in the main run script and determines how the CCSM run is to be started. RUNTYPE can be "startup", "continue", "branch", or "hybrid".

startup represents a new case started from some model specific initial files or state.

continue is a continuation of a case and guarantees exact restart capability.

branch is like a continuation run, but the CASE name is changed. A set of restart files is used to start a branch run and exact restart is guaranteed if source code or model input hasn't changed. Typically, however, the purpose of a branch run is the evaluate the impact of a modification of the model. The exact restart guarantee ensures that any differences between the original run and the branch run are due to the modification introduced to the branch run.

hybrid is a startup from atmosphere and land initial condition files and ocean and ice restart files. The model is started as if it were a startup case with a 1 day lag in the start of the ocean model. An exact restart is not guaranteed due to the atmosphere and land models using ``initial'' files and because of the ocean time lag

startup	represents a new case started from some model specific initial files or state.
continue	is a continuation of a case and guarantees exact restart capability.
branch	is like a continuation run, but the CASE name is changed. A set of restart files is used to start a branch run and exact restart is guaranteed if source code or model input hasn't changed. Typically, however, the purpose of a branch run is the evaluate the impact of a modification of the model. The exact restart guarantee ensures that any differences between the original run and the branch run are due to the modification introduced to the branch run.
hybrid	is a startup from atmosphere and land initial condition files and ocean and ice restart files. The model is started as if it were a startup case with a 1 day lag in the start of the ocean model. An exact restart is not guaranteed due to the atmosphere and land models using ``initial'' files and because of the ocean time lag

For startup and continue runs, no specific script changes are required. For continue runs, the appropriate restart files must be placed in the executable directories. The scripts attempt to do this for the case automatically by searching directories for restart files. For branch runs, the environment variables REFCASE and REFDATE must be set to the name of the previous case and date in the REFCASE run where the new branch run will started from. Those restart files must be available to the new case. For hybrid runs, the environment variables REFCASE, REFDATE, and BASEDATE must be set in the main script. Those represent the prior case and date and the new starting date for this case. Hybrid runs allow a change to both case and starting date.

The RUNTYPES startup, branch, and hybrid are all used to start a new case. The RUNTYPE continue is used to continue any run, no matter what initial RUNTYPE was used.

5.1.4 Running on a new machine

The CCSM release is targeted at the NCAR IBM SP. Due to the distinct nature of individual computer sites, many aspects of the CCSM scripts may need to be changed when running on a new machine. These include

batch queue commands at the top of the main script
values of environment variables OS, SITE, MACH, and ARCH in the main script
the names of CCSM paths set by environment variables in the main script
the check of the submission status of the harvester script near the end of the main script
the harvester script
interaction with the local mass storage system via the tools scripts ccsm_msread, ccsm_msmkdir, and ccsm_mswrite
machine dependent modifications to the models/bld/Macros.* file

General guidance about specific aspects of the scripts that need to be changed in order to run on different machines can be found in the scripts/tools/test.a1.mods.* files. The * represents the hostname of various machines where the model has already been run. Each of the files contains the machine-specific modification that are required.

5.1.5 Getting into production

There is a specific procedure that should be used to start a production run. The model defaults are set to run a short test case. The general procedures for starting a production run are as follows.

Setup a startup, branch, or hybrid case
For branch or hybrid
- mkdir $ARCROOT/restart
- place all required restart/initial files in $ARCROOT/restart
- make a backup copy of all these files, this directory will be deleted at the end of the first run:
```
             mkdir $ARCROOT/restart.bu 
             cp $ARCROOT/restart/* $ARCROOT/restart.bu
```
Activate (uncomment) the following lines in the main script
- #$TOOLS/ccsm_getrestart
- #$SCRIPTS/ccsm_archive
Modify the namelist variables in the scripts/$CASE/cpl.setup.csh script to carry out a production length run. A good starting point is to set the following namelists to make a three month run.
```
    diag_n      = 10    
    stop_option = 'nmonths' 
    stop_n      = 3 
    info_bcheck = 0
```
Submit the first run
Verify the first run completed as expected, including archiving.
Once the first run has completed, set RUNTYPE to "continue"
Activate (uncomment) the following line in the main script to allow the submission of the harvester at the end of the run.
```
   #  if ($num < 1) llsubmit $CASE.har
```
Optionally re-comment the ccsm_getrestart line in the main script
```
    #$TOOLS/ccsm_getrestart
```
Submit the continue run
Verify the continue run completed as expected including archiving and harvesting.
Place a non-zero positive integer in the file scripts/$CASE/RESUBMIT. The file RESUBMIT controls automatic resubmission. This integer counter in this file decrements each time the model is resubmitted until it gets to zero, at which point the model will not resubmit itself.
```
         echo 10 >! \$SCRIPTS/RESUBMIT
```
Resubmit the continue run
Monitor the run including archiving and harvesting

5.2 What is the ccsm_joe file?

The ccsm_joe (Job Operating Environment) file is created by the main CCSM run script every time it executes. It can be thought of as a case specific resource file for CCSM. The ccsm_joe file contains a summary of some of the CCSM-specific environment variables. It is a useful debugging tool as it summarizes many of the important variables in the latest run.

It is also used by other scripts to determine the case specific variables. The CCSM harvester ($CASE.har), archiver (ccsm_archive), and a number of the scripts in the CCSM tools directory ($SCRIPTS/tools) use this file to set case specific variables.

5.3 How does auto-RESUBMIT work?

The CCSM model will automatically resubmit the main run script if the integer parameter in the file $SCRIPTS/RESUBMIT is greater than zero. In section (h) of the main run script, the script captures the value in the RESUBMIT file, decides whether to resubmit and if so, decrements the integer in the RESUBMIT file. When the resubmit parameter decrements to zero, auto resubmission will stop. This provides flexibility to the user to prevent runaway jobs. Initially, users should set the RESUBMIT integer to some moderate value, like 2. Once confidence has been established, the integer in RESUBMIT can be increased. The default value is zero, so the script will not RESUBMIT automatically by default.

When using RESUBMIT, RUNTYPE should usually be continue; otherwise the same initial period of the run will likely be run over and over.

5.3.1 Runaway jobs

Occasionally, the model will stop prematurely (due to a hardware problem or a model problem). If this happens, often the scripts will continue to resubmit themselves and this will usually lead to a ``runaway'' situation. To stop runaway jobs, first set the resubmit parameter to zero in the RESUBMIT file, then try to kill currently active runaway jobs.

5.4 Batch queuing challenges

Generally, each machine at each site has a unique batch queueing environment. This is less true for IBM machines which seem to use loadleveler nearly universally. Even with loadleveler, however, different sites have different loadleveler configurations. In particular, users may need to change the network parameter and class. Certainly with any queueing systems, users need to be aware of queue names, processor resources, and time limits.

CCSM has been run under loadleveler, NQS, LSF, and PBS on various machines. In the $TOOLS directory are files named test.a1.mod.*. These files provide guidance about both hardware and batch setups for specific machines.

The default values in the CCSM script provide batch commands for loadleveler, NQS, and PBS batch systems. The file $TOOLS/test.a1.mods.nirvana provides guidance for an LSF queueing system.

Users need to be careful to implement appropriate changes to the CCSM scripts for their particular environment.

NOTE: Both the main run script and the harvester script are setup to run in batch environments and both need to be modified when using non-NCAR batch queuing systems.

5.5 Modifying source code

Source code is provided with the CCSM release. It is not unreasonable to make code changes directly in the files within the CCSM models directories and then rebuild and run the model.

However, the recommended approach is to create directories named src.atm, src.lnd, src.ice, src.ocn, and src.cpl in the directory where the case-specific scripts reside, $SCRIPTS/$CASE. Users can then copy source code files from the main CCSM models directories into these component-specific directories and then modify those files directly. By default, the CCSM scripts and build environment will use files in the src.* directories before files in the models source directories. In other words, source code in src.* directories has higher priority than source code in the models directories.

There are a number of benefits to doing this. First, it preserves the release source code so that differences can be carried out later and users can backtrack to the generic release source if desired. Second, it allows users to configure multiple experiments, each with each experiment having its own unique source code changes without requiring copies of the entire source tree. For instance, a sensitivity study could be carried out on a source code specified parameter by just copying one source code file into the src.* directory for a number of cases and then modifying that file in the scr.* directory.

5.6 CCSM Data Management

This section briefly outlines the data flow for a production run. All binaries, model input, and model output exist at some point in specific directories in the executable area. The scripts generate the model binaries and get model input. The model then executes, and model output (restart files, history files, and log files) are written directly into the executable directories. Once the model has completed execution, the script ccsm_archive is run and the model output is moved into the archival directory. Subdirectories are created in the archive directory for each component as well as a restart and restart.tars directory. The most recent set of restart files, including pointer files, are copied into the restart subdirectory of the archive directory. That directory is tarred up and copied to the restart.tars directory.

Once model output has been archived, the harvester executes, checkin files in the archive directory. If the files have successfully been moved to the mass store previously, the copy of the file in the archive directory is removed. If that file has not been copied to the mass store previously, the harvester performs that operation. In effect, the harvester must pass through the model output files twice in the archive area before removing them. On the first pass, the file is copied to the mass store, and on the second pass the existence of the mass store copy is verified and the local copy of the file is removed.

5.7 What is harvesting doing?

The CCSM harvester is a separate job that saves the model output to a separate file storage device, generally known as a mass storage system. This could be the NCAR mass store, an hpss, or similar. The overall CCSM2.0 data flow is described separately. A part of the data flow is moving model output from the executable directories to a temporary archive directory by a script called ccsm_archive. The harvester is then used to move data from the archive directory to the mass storage system.

The harvester script can be found in the main $SCRIPTS/$CASE directory and is usually named $CASE.har. The harvester can be run interactively or in batch mode. The harvester provided should be considered a template for customizing the harvesting process. Each user and each site may have different needs for harvesting. The harvesting script largely takes advantage of scripts in the $SCRIPTS/tools directory.

Overall, the harvester works as follows:

Loop through the archive directories for each component
Loop through each file in the component archive directory
Verify whether the file is already on the mass store and is bit-for-bit identical with the file sitting on disk. This is done by reading the file off the mass store into a local temporary filename, checkmss. Then the two files via the Unix compare command ``cmp''
If the local file and the mass store file are identical, remove the local file
If the local file and the mass store file are different, write the local file to the mass store
For some subset of files, copy the files to another area or another machine. This is done only if the local variable name, copyfiles_to_othersite, is true

The harvester takes advantage of the local copy of ccsm_joe to determine many of the case specific variables. A number of harvester variables are set in the main run script including $LMSOUT, $MACOUT, and $RFSOUT.

csm@ucar.edu