NCAR CSM Flux Coupler, version 4.0 -- User's Guide         Table of Contents

4   NQS Script

In this section we examine a CSM coupled model batch job shell script. The basic purpose of this script is very simple: first the Coupler and the component model codes are prepared for execution, and then the Coupler and all component models are executed simultaneously as separate processes. Upon startup, the component models first establish a connection to the Coupler and then data begins to flow between them. All component models will continue to advance forward in time until they are signaled to stop by the Coupler. When the coupled system stops is determined solely by the Coupler.

The example Network Queuing System (NQS) shell script shown below is a complete script, suitable for actual use. This script first prepares each component for execution and then executes the components simultaneously. While the specifics shown below are a recommended method for running the coupled system on NCAR's Cray computers, variations on this method will also work, some of which may be more appropriate for different situations.

The coupled model batch job script below is a top-level NQS batch job script which in turn calls five separate subscripts called "setup scripts." These subscripts, "cpl.setup.csh," "atm.setup.csh," "ice.setup.csh," "lnd.setup.csh," and "ocn.setup.csh," are responsible for building their respective executable codes and gathering any required input data files, also see § 5. The setup scripts receive input data from the parent NQS script by means of several environment variables. After calling the setup subscripts, the parent NQS script executes the Coupler and all component models simultaneously as background processes. The parent script waits for these background processes to complete; when they do, the coupled run has completed and the NQS script saves stdout output and terminates. Following the example is a more detailed explanation of what is being done in the various parts of the NQS script.

    #=======================================================================
    #  This is a CSM coupled model NQS batch job script
    #=======================================================================

    #-----------------------------------------------------------------------
    # (a) Set NQS options
    #-----------------------------------------------------------------------
    # QSUB -q reg                           # select batch queue
    # QSUB -lT   5:10:00 -lt   5:00:00      # set CPU time limits
    # QSUB -lM 35Mw  -lm 20Mw               # set memory limits
    # QSUB -mb -me -eo                      # combine stderr & stout
    # QSUB -s  /bin/csh                     # select shell script
    # QSUB                                  # no more QSUB options
    #-----------------------------------------------------------------------

    #-----------------------------------------------------------------------
    # (b) Set env variables available to model setup scripts (below)   
    #     CASE, CASESTR, RUNTYPE, ARCH  , MAXCPUS, MSGLIB, SSD,
    #     MSS , MSSDIR , MSSRPD , MSSPWD, RPTDIR
    #-----------------------------------------------------------------------

    setenv CASE     test.00         # case name
    setenv CASESTR '(CSM test)'     # short descriptive text string
    setenv RUNTYPE  initial         # run type
    setenv ARCH     C90             # machine architecture
    setenv MAXCPUS  8               # max number of CPUs available
    setenv MSGLIB   MPI             # message passing library
    setenv SSD      TRUE            # SSD is available?
    setenv MSS      FALSE           # MSS is available?
    setenv MSSDIR   /DOE/csm/$CASE  # MSS directory path name
    setenv MSSRPD   365             # MSS file retention period
    setenv MSSPWD   'rosebud'       # MSS file password
    setenv RPTDIR   $HOME           # where restart pointer file are saved
    setenv CSMROOT  /fs/cgd/csm     # root directory for model codes
    setenv CSMSHARE $CSMROOT/share  # directory of "shared" code

    #-----------------------------------------------------------------------
    # (c) Specify input, output, and execution directories        
    #     o the component model setup.csh scripts must be in $NQSDIR   
    #     o stdout & stderr output is saved in $LOGDIR                 
    #-----------------------------------------------------------------------

    set EXEDIR = /tmp/doe/$CASE               # model runs here
    set NQSDIR = ~doe/$CASE                   # model setup scripts are here
    set LOGDIR = $NQSDIR                      # stdout output goes here

    #-----------------------------------------------------------------------
    # (d) Prepare component models for execution                       
    #     o create execution directories: $EXEDIR/[atm|cpl|lnd|ice|ocn]
    #     o execute the component model setup scripts located in $NQSDIR   
    #       (these scripts have access to env variables set above)
    #     o see the man page for details about the Cray assign function  
    #-----------------------------------------------------------------------

    setenv FILENV ./.assign    # allow separate .assign files for each model
    set LID = "`date +%y%m%d-%H%M%S`"  # create a unique log file ID

    mkdir -p $EXEDIR 
    cd       $EXEDIR
    foreach model (cpl atm ice lnd ocn)
      mkdir $EXEDIR/$model 
      cd    $EXEDIR/$model  
      $NQSDIR/$model.setup.csh >&! $model.log.$LID || exit 2
    end

    #-----------------------------------------------------------------------
    # (e) Execute models simultaneously (allocating CPUs)       
    #-----------------------------------------------------------------------

    rm $TMPDIR/*.$LOGNAME.*  # rm any old msg pipe files

    ja $TMPDIR/jacct         # start Cray job accounting
    cd $EXEDIR/cpl
    env NCPUS=$MAXCPUS cpl -l 0 -n 5 -t 600 < cpl.parm >>&! cpl.log.$LID &
    cd $EXEDIR/atm
    env NCPUS=$MAXCPUS atm -l 1      -t 600 < atm.parm >>&! atm.log.$LID &
    cd $EXEDIR/ocn
    env NCPUS=$MAXCPUS ocn -l 2      -t 600 < ocn.parm >>&! ocn.log.$LID &
    cd $EXEDIR/ice
    env NCPUS=2        ice -l 3      -t 600 < ice.parm >>&! ice.log.$LID &
    cd $EXEDIR/lnd
    env NCPUS=$MAXCPUS lnd -l 4      -t 600 < lnd.parm >>&! lnd.log.$LID &
    ja -tsclh $TMPDIR/jacct  # end   Cray job accounting

    #-----------------------------------------------------------------------
    # (f) save model output (stdout & stderr) to $LOGDIR               
    #-----------------------------------------------------------------------

    cd $EXEDIR
    gzip -v */*.log.$LID
    cp   -p */*.log.$LID* $LOGDIR

    #=======================================================================
    # End of nqs shell script                                          
    #=======================================================================

Items (a) through (f) in the above job script are now reviewed.

(a) Set NQS options

The Network Queuing System (NQS) is a special facility available under UNICOS (Cray's version of UNIX). NQS is a batch job facility: you submit your job to NQS and NQS runs your job. The QSUB options set here select the queue, the maximum memory required, the maximum time required, the combining of the NQS script's stdout and stderr, and the shell to interpret the NQS script. See the qsub man page on a UNICOS computer for more information.

(b) Set environment variables for use in the model setup scripts

While the executing of the Coupler and the component models is explicitly done in this NQS script, the building of executables from source code, the gathering of necessary input data files, and any other pre-execution preparation is deferred to the subscripts "cpl.setup.csh," "atm.setup.csh," "ice.setup.csh," "lnd.setup.csh," and "ocn.setup.csh." The 14 environment variables set in the NQS script may be used by the setup scripts to prepare the respective codes for execution. These environment variables are specifically intended to be used as input to the component model setup scripts - these variables are not intended to be accessed by component model executables. It is strongly suggested that component model binaries do not contain a hard-coded dependence on these environment variables. The environment variables are:

(c) Specify input, output, and execution directories

Here we specify the directory where the model will run, the directory where the model setup scripts are found, and the directory where stdout and stderr output data will be saved when the simulation finishes.

(d) Prepare component models for execution

Here the execution directory and component model subdirectories are created and the Coupler and component model setup scripts are invoked. The purpose of the setup scripts is to build their respective executable codes, document what source code and data files are being used, and gather or create any necessary input data files. It is recommended that each component model have it's own, separate setup script. This natural decomposition of code allows the persons responsible for a given model to create an appropriate setup script for their model without being confused by the details of another model. Setting $FILENV to ./.assign allows each executable to create and use it's own, independent assign file. Assign is a UNICOS specific file I/O utility that may or may not be used by the various executables. See the assign man page on a UNICOS system for more details.

(e) Execute component models simultaneously

In section (d), via the setup scripts, all necessary pre-execution preparations were taken care of. At this point, all models are ready to be run. In this section we execute the Coupler and all component models simultaneously as background processes. Command line environment variables allow one to specify different numbers of CPUs to the different component models. The "-l" command line options are used by the message passing system, MPI, to assign logical process numbers to the component models. The "-t 600" options tells MPI how many seconds to wait for a message to be received before assuming that an error has occured the message will never be sent. The ja command is a UNICOS job accounting utility which provides data on CPU time used, memory used, etc. See the ja man page on a UNICOS system for more details.

(f) Save model output (stdout & stderr)

A separate stdout output file, combined with the stderr output, from each component model is compressed and saved to the directory $LOGDIR.


Fri 07 Aug 1998, 12:00:00