FC User's Guide

NCAR CSM Flux Coupler, version 4.0 -- User's Guide Table of Contents

4 NQS Script

In this section we examine a CSM coupled model batch job shell script. The basic purpose of this script is very simple: first the Coupler and the component model codes are prepared for execution, and then the Coupler and all component models are executed simultaneously as separate processes. Upon startup, the component models first establish a connection to the Coupler and then data begins to flow between them. All component models will continue to advance forward in time until they are signaled to stop by the Coupler. When the coupled system stops is determined solely by the Coupler.

The example Network Queuing System (NQS) shell script shown below is a complete script, suitable for actual use. This script first prepares each component for execution and then executes the components simultaneously. While the specifics shown below are a recommended method for running the coupled system on NCAR's Cray computers, variations on this method will also work, some of which may be more appropriate for different situations.

The coupled model batch job script below is a top-level NQS batch job script which in turn calls five separate subscripts called "setup scripts." These subscripts, "cpl.setup.csh," "atm.setup.csh," "ice.setup.csh," "lnd.setup.csh," and "ocn.setup.csh," are responsible for building their respective executable codes and gathering any required input data files, also see § 5. The setup scripts receive input data from the parent NQS script by means of several environment variables. After calling the setup subscripts, the parent NQS script executes the Coupler and all component models simultaneously as background processes. The parent script waits for these background processes to complete; when they do, the coupled run has completed and the NQS script saves stdout output and terminates. Following the example is a more detailed explanation of what is being done in the various parts of the NQS script.

    #=======================================================================
    #  This is a CSM coupled model NQS batch job script
    #=======================================================================

    #-----------------------------------------------------------------------
    # (a) Set NQS options
    #-----------------------------------------------------------------------
    # QSUB -q reg                           # select batch queue
    # QSUB -lT   5:10:00 -lt   5:00:00      # set CPU time limits
    # QSUB -lM 35Mw  -lm 20Mw               # set memory limits
    # QSUB -mb -me -eo                      # combine stderr & stout
    # QSUB -s  /bin/csh                     # select shell script
    # QSUB                                  # no more QSUB options
    #-----------------------------------------------------------------------

    #-----------------------------------------------------------------------
    # (b) Set env variables available to model setup scripts (below)   
    #     CASE, CASESTR, RUNTYPE, ARCH  , MAXCPUS, MSGLIB, SSD,
    #     MSS , MSSDIR , MSSRPD , MSSPWD, RPTDIR
    #-----------------------------------------------------------------------

    setenv CASE     test.00         # case name
    setenv CASESTR '(CSM test)'     # short descriptive text string
    setenv RUNTYPE  initial         # run type
    setenv ARCH     C90             # machine architecture
    setenv MAXCPUS  8               # max number of CPUs available
    setenv MSGLIB   MPI             # message passing library
    setenv SSD      TRUE            # SSD is available?
    setenv MSS      FALSE           # MSS is available?
    setenv MSSDIR   /DOE/csm/$CASE  # MSS directory path name
    setenv MSSRPD   365             # MSS file retention period
    setenv MSSPWD   'rosebud'       # MSS file password
    setenv RPTDIR   $HOME           # where restart pointer file are saved
    setenv CSMROOT  /fs/cgd/csm     # root directory for model codes
    setenv CSMSHARE $CSMROOT/share  # directory of "shared" code

    #-----------------------------------------------------------------------
    # (c) Specify input, output, and execution directories        
    #     o the component model setup.csh scripts must be in $NQSDIR   
    #     o stdout & stderr output is saved in $LOGDIR                 
    #-----------------------------------------------------------------------

    set EXEDIR = /tmp/doe/$CASE               # model runs here
    set NQSDIR = ~doe/$CASE                   # model setup scripts are here
    set LOGDIR = $NQSDIR                      # stdout output goes here

    #-----------------------------------------------------------------------
    # (d) Prepare component models for execution                       
    #     o create execution directories: $EXEDIR/[atm|cpl|lnd|ice|ocn]
    #     o execute the component model setup scripts located in $NQSDIR   
    #       (these scripts have access to env variables set above)
    #     o see the man page for details about the Cray assign function  
    #-----------------------------------------------------------------------

    setenv FILENV ./.assign    # allow separate .assign files for each model
    set LID = "`date +%y%m%d-%H%M%S`"  # create a unique log file ID

    mkdir -p $EXEDIR 
    cd       $EXEDIR
    foreach model (cpl atm ice lnd ocn)
      mkdir $EXEDIR/$model 
      cd    $EXEDIR/$model  
      $NQSDIR/$model.setup.csh >&! $model.log.$LID || exit 2
    end

    #-----------------------------------------------------------------------
    # (e) Execute models simultaneously (allocating CPUs)       
    #-----------------------------------------------------------------------

    rm $TMPDIR/*.$LOGNAME.*  # rm any old msg pipe files

    ja $TMPDIR/jacct         # start Cray job accounting
    cd $EXEDIR/cpl
    env NCPUS=$MAXCPUS cpl -l 0 -n 5 -t 600 < cpl.parm >>&! cpl.log.$LID &
    cd $EXEDIR/atm
    env NCPUS=$MAXCPUS atm -l 1      -t 600 < atm.parm >>&! atm.log.$LID &
    cd $EXEDIR/ocn
    env NCPUS=$MAXCPUS ocn -l 2      -t 600 < ocn.parm >>&! ocn.log.$LID &
    cd $EXEDIR/ice
    env NCPUS=2        ice -l 3      -t 600 < ice.parm >>&! ice.log.$LID &
    cd $EXEDIR/lnd
    env NCPUS=$MAXCPUS lnd -l 4      -t 600 < lnd.parm >>&! lnd.log.$LID &
    ja -tsclh $TMPDIR/jacct  # end   Cray job accounting

    #-----------------------------------------------------------------------
    # (f) save model output (stdout & stderr) to $LOGDIR               
    #-----------------------------------------------------------------------

    cd $EXEDIR
    gzip -v */*.log.$LID
    cp   -p */*.log.$LID* $LOGDIR

    #=======================================================================
    # End of nqs shell script                                          
    #=======================================================================

Items (a) through (f) in the above job script are now reviewed.

(a) Set NQS options

The Network Queuing System (NQS) is a special facility available under UNICOS (Cray's version of UNIX). NQS is a batch job facility: you submit your job to NQS and NQS runs your job. The QSUB options set here select the queue, the maximum memory required, the maximum time required, the combining of the NQS script's stdout and stderr, and the shell to interpret the NQS script. See the qsub man page on a UNICOS computer for more information.

(b) Set environment variables for use in the model setup scripts

While the executing of the Coupler and the component models is explicitly done in this NQS script, the building of executables from source code, the gathering of necessary input data files, and any other pre-execution preparation is deferred to the subscripts "cpl.setup.csh," "atm.setup.csh," "ice.setup.csh," "lnd.setup.csh," and "ocn.setup.csh." The 14 environment variables set in the NQS script may be used by the setup scripts to prepare the respective codes for execution. These environment variables are specifically intended to be used as input to the component model setup scripts - these variables are not intended to be accessed by component model executables. It is strongly suggested that component model binaries do not contain a hard-coded dependence on these environment variables. The environment variables are:

CASE: a string of one to eight characters which is the case name. Component models may also want to include this case name in their output data files to help identify the origin of that data. CASE must consist of valid UNIX filename characters. For an example, see subsection II of this User's Guide.
CASESTR: a string of zero or more characters which briefly describes the case. There is no specific restriction on length, but most component models probably wouldn't use more than the first 20 to 30 characters. Component models may want to include this case string in their output data files to help identify the origin of that data. For an example, see subsection II.
RUNTYPE: one of "initial," "continue," "regen," or "branch." Component models may need this information to create an appropriate input parameter namelist file. For an example, see subsection II.
ARCH: the machine architecture. In general, the CSM may be run on a variety of computer architectures. $ARCH specifies which architecture is being used.
MAXCPUS: an integer specifying the maximum number of CPUs each component can use for the coupled run.
MSGLIB: the message passing library being used. In general, the system may be able to use a variety of underlying message passing systems. If this is the case, $MSGLIB specifies which library is to be used.
SSD: one of "TRUE" or "FALSE." This indicates whether or not a Solid-state Storage Device (SSD) is available for use (i.e., whether the model can be run out-of-core or not). The SSD is site specific to NCAR. See the assign man page on an NCAR UNICOS system for more details.
MSS: one of "TRUE" or "FALSE." This indicates whether or not the Mass Storage System (MSS) is available for use. The MSS is site specific to NCAR. See the mswrite man page on an NCAR UNICOS system for more details.
MSSDIR: a Mass Storage System (MSS) directory path name. Component models should send their MSS data to a subdirectory of this directory. Generally $MSSDIR will be something like "/DOE/csm/$CASE/" and models will create their own MSS data subdirectory, e.g., "/DOE/csm/$CASE/ocn/." The MSS is site specific to NCAR. See the mswrite man page on an NCAR UNICOS system for more details. For an example, see subsection II.
MSSRPD: an integer specifying the MSS retention period in days. See the mswrite man page on an NCAR UNICOS system for more details.
MSSPWD: a character string to be used as the MSS write-permission password. See the mswrite man page on an NCAR UNICOS system for more details.
RPTDIR: a character string specifying an NFS-mounted directory into which component model restart pointer files are to be saved.
CSMROOT: a character string specifying an NFS-mounted directory in which component model codes are located. If all component model setup scripts make use of this variable (they may or may not use it), it can aid in transfering an entire code distribution to another machine or file system.
CSMSHARE a character string specifying an NFS-mounted directory in which component model "shared" codes are located. "Shared" code consists of subroutines or include files that should be identical between components. By accessing the same source code directory, this assures one that the same code is in fact being used.

Here we specify the directory where the model will run, the directory where the model setup scripts are found, and the directory where stdout and stderr output data will be saved when the simulation finishes.

EXEDIR: a parent directory where the coupled system is run. Five subdirectories, atm, cpl, ice, lnd, and ocn, are created beneath $EXEDIR. All executables, source code, and all related data files are located in their respective subdirectories.
NQSDIR: the NFS-mounted directory where the Coupler and component model setup scripts "cpl.setup.csh," "atm.setup.csh," "ice.setup.csh," "lnd.setup.csh," and "ocn.setup.csh" must be located.
LOGDIR: this is where stdout and stderr output files are saved when the simulation is complete.

(d) Prepare component models for execution

Here the execution directory and component model subdirectories are created and the Coupler and component model setup scripts are invoked. The purpose of the setup scripts is to build their respective executable codes, document what source code and data files are being used, and gather or create any necessary input data files. It is recommended that each component model have it's own, separate setup script. This natural decomposition of code allows the persons responsible for a given model to create an appropriate setup script for their model without being confused by the details of another model. Setting $FILENV to ./.assign allows each executable to create and use it's own, independent assign file. Assign is a UNICOS specific file I/O utility that may or may not be used by the various executables. See the assign man page on a UNICOS system for more details.

(e) Execute component models simultaneously

In section (d), via the setup scripts, all necessary pre-execution preparations were taken care of. At this point, all models are ready to be run. In this section we execute the Coupler and all component models simultaneously as background processes. Command line environment variables allow one to specify different numbers of CPUs to the different component models. The "-l" command line options are used by the message passing system, MPI, to assign logical process numbers to the component models. The "-t 600" options tells MPI how many seconds to wait for a message to be received before assuming that an error has occured the message will never be sent. The ja command is a UNICOS job accounting utility which provides data on CPU time used, memory used, etc. See the ja man page on a UNICOS system for more details.

(f) Save model output (stdout & stderr)

A separate stdout output file, combined with the stderr output, from each component model is compressed and saved to the directory $LOGDIR.

Fri 07 Aug 1998, 12:00:00