Introduction

Overview

The CCSM4 data models perform the basic function of reading external data files, modifying that data, and then sending it to the coupler via standard CCSM coupling interfaces. The coupler and other models have no fundamental knowledge of whether another component is fully active or just a data model. In some cases, data models are prognostic, that is, they also receive and use data sent by the coupler to the data model. But in most cases, the data models are not running prognostically and have no need to receive any data from the coupler.

The CCSM4 data models have been parallelized and share significant amounts of source code. Methods for reading and interpolating data have been established and can easily be reused. There is a natural hierarchy in the system. The data model calls strdata ("stream data") methods which then call stream methods. There are inputs associated with the data model, strdata, and streams to configure the setup. The stream methods are responsible for managing lists of input data files and their time axis. The information is then passed up to the strdata methods where the data is read and interpolated in space and time. The interpolated data is passed up to the data model where final fields are derived, packed, and returned to the coupler.

Design

The strdata implementation is hardwired to execute a set of specific operations associated with reading and interpolating data in space and time. The text box below shows the sequencing of the computation of model fields using the strdata methods.


STRDATA Implementation:
  for the current model time
  determine nearest lower and upper bound data from the input dataset
  if that is new data then
     read lower and upper bound data
     fill lower and upper bound data
     spatially map lower and upper bound data to model grid
  endif
  time interpolate lower and upper bound data to model time
  return fields to data model

IO Through Data Models

The two timestamps of input data that bracket the present model time are read first. These are called the lower and upper bounds of data and will change as the model advances. Those two sets of inputdata are first filled based on the method and masking set by the user. That operation occurs on the input data grid. The lower and upper bound data are then spatially mapped to the model grid based upon map methods and masking set by the user in namelist. Spatial interpolation only occurs if the input data grid and model grid are not the identical, and this is determined in the strdata module automatically. Time interpolation is the final step and is done using a time interpolation method specified by the user in namelist. A final set of fields is then available to the data model on the model grid and for the current model time.

There are two primary costs associated with strdata, reading data and spatially mapping data. Time interpolation is relatively cheap in the current implementation. As much as possible, redundant operations are minimized. Fill and mapping weights are generated at initialization and saved. The upper and lower bound mapped input data is saved between time steps to reduce mapping costs in cases where data is time interpolated more often than new data is read. If the input data timestep is relatively small (for example, hourly data as opposed to daily or monthly data) the cost of reading input data can be quite large. Also, there can be significant variation in cost of the data model over the coarse of the run, for instance, when new inputdata must be read and interpolated, although it's relatively predictable. The present implementation doesn't support changing the order of operations, for instance, time interpolating the data before spatial mapping. Although because the present computations are always linear, changing the order of operations wouldn't fundamentally change the results. The present order of operations generally minimizes the mapping cost for typical data model use cases.

There are several limitations in both options and usage within the data models at the present time. Spatial interpolation can only be performed from a two-dimensional latitude-longitude input grid. The target grid can be arbitrary but the source grid must be able to be described by simple one-dimensional lists of longitudes and latitudes, although they don't have to have equally spaced.

At the present time, data models can only read netcdf data, and IO is handled through either standard netcdf interfaces or through the pio library using either netcdf or pnetcdf. If standard netcdf is used, global fields are read and then scattered one field at a time. If pio is used, then data will be read either serially or in parallel in chunks that are approximately the global field size divided by the number of io tasks. If pnetcdf is used through pio, then the pnetcdf library must be included during the build of CCSM. The pnetcdf path and option is hardwired into the Macros file for the specific machine. To turn on pnetcdf in the build, make sure the Macros variables PNETCDF_PATH, INC_PNETCDF, and LIB_PNETCDF are set and that the pio CONFIG_ARGS sets the PNETCDF_PATH argument. See the CCSM users guide for more information.

Beyond just the option of selecting IO with pio, several namelist are available to help optimize pio IO performance. Those are io_type, num_iotasks, io_root, and io_stride. The number of total mpi tasks that can be used for IO is limited to the total number of tasks used by the data model. Often though, fewer io tasks result in improved performance. In general, [io_root + (num_iotasks-1)*io_stride + 1] has to be less than the total number of data model tasks. In practice, pio seems to perform optimally somewhere between the extremes of 1 task and all tasks, and is highly machine and problem dependent.

Restart Files

Restart files are generated automatically by the data models based upon a flag sent from the coupler. The restart files must meet the naming convention specified by the CCSM project and an rpointer file is generated at the same time. An rpointer file is a restart pointer file which contains the name of the most recently created restart file. Normally, if restart files are read, the restart filenames are specified in the rpointer file. Optionally though, there are namelist variables such as restfilm to specify the restart filenames via namelist. If those namelist are set, the rpointer file will be ignored. The standard procedure in CCSM is to use the rpointer files to specify the restart filenames. In many cases, no model restart is required for the data models to restart exactly. This is because there is no memory between timesteps in many of the data model science modes. If a model restart is required, it will be written automatically and then must be used to continue the previous run.

There are separate stream restart files that only exist for performance reasons. A stream restart file contains information about the time axis of the input streams. This information helps reduce the start costs associated with reading the input dataset time axis information. If a stream restart file is missing, the code will restart without it but may need to reread data from the input data files that would have been stored in the stream restart file. This will take extra time but will not impact the results.

Hierarchy

The hierarchy of data models, strdata, and streams also compartmentalize grids and fields. Some data models communicate with the coupler with fields on multiple model grids. For instance, the data land model communicates with the coupler on two different grids, a land grid and a runoff grid. In those cases, there are multiple model grids and so there will be multiple strdata namelist, one for each unique data model grid. For each strdata namelist, data is interpolated to a single model grid. But each strdata namelist input can have multiple stream description files and each stream input file can contain data on a different grid. The strdata module will gracefully read the different streams of input data and interpolate both spatially and temporally to the appropriate final model grid and model time. The text box below provides a schematic of the hierarchy


  driver     :                 call data land model
  data model :                   data land model
  data model :           land_data               runoff_data
  data model :            grid1                    grid2
  data model :           strdata1                 strdata2
  strdata    :    interp1a interp1b interp1c      interp2a
  strdata    :    stream1a stream1b stream1c      stream2a  
  stream     :     grid1a   grid1b   grid1c        grid2a
  stream     :   file1a_01 file1b_01 file1c_01    file2a_01
  stream     :     ...                 ...           ...
  stream     :   file1a_04           file1c_96    file2a_12

Users will primarily setup different data model configurations through existing namelist settings. The strdata and stream input options and format are identical for all data models. The data model specific namelist has significant overlap between data models, but each data model has a slightly different set of input namelist variables and each model reads that namelist from a unique filename. The detailed namelist options for each data model will be described later, but each model will specify a filename or filenames for strdata namelist input and each strdata namelist will specify a set of stream input files.

To continue with the above example, the following inputs would be consistent with the above figure. This is a data land model that has data on two different model grids, land_data on grid1 and runoff_data on grid2. Therefore two different strdata namelist will be needed. The data model namelist input file is hardwired to "dlnd_in" and in this case, the namelist would look something like


file="dlnd_in":
&dlnd_nml
  lnd_in = 'dlnd_lnd_in'
  rof_in = 'dlnd_rof_in'
  decomp = '1d'
/

The lnd_in and rof_in specify the filenames associated with the strdata namelist input for the land and runoff data separately. The land and runoff strdata namelist would then look like


file="dlnd_lnd_in":
&shr_strdata_nml
  dataMode   = 'CPLHIST'
  domainFile = 'grid1.nc'
  streams    = 'stream1a',
               'stream1b',
               'stream1c'
  mapalgo    = 'interp1a',
               'interp1b',
               'interp1c'
/

file="dlnd_rof_in":
&shr_strdata_nml
  dataMode   = 'RX1'
  domainFile = 'grid2.nc'
  streams    = 'stream2a'
  mapalgo    = 'interp2a'
/

Four stream description files are then expected to be available, stream1a, stream1b, stream1c, and stream2a. Those files specify the input data filenames, input data grids, and input fields that are expected among other things. For instance, one of the stream description files might look like


<stream>
      <comment>
         NCEP "normal year" data
      </comment>
      <dataSource>
         NCEP
      </dataSource>
      <fieldInfo>
         <variableNames>
            dn10  dens
            slp_  pslv
            q_10  shum
            t_10  tbot
            u_10  u
            v_10  v
         </variableNames>
         <filePath>
            /fis/cgd/cseg/csm/inputdata/atm/datm7/NYF
         </filePath>
         <offset>
            0
         </offset>
         <fileNames>
            nyf.ncep.T62.050923.nc
         </fileNames>
      </fieldInfo>
      <domainInfo>
         <variableNames>
            time   time
            lon    lon
            lat    lat
            area   area
            mask   mask
         </variableNames>
         <filePath>
            /fis/cgd/cseg/csm/inputdata/atm/datm7/NYF
         </filePath>
         <fileNames>
            nyf.ncep.T62.050923.nc
         </fileNames>
      </domainInfo>
</stream>

The stream files are not Fortran namelist format. Their format and options will be described later. In general, these examples of input files are not complete, but they do show the general hierarchy and feel of the data model input.

Summary

In summary, for each data model a top level namelist will be set that will point to a file that contains the strdata namelist. That namelist will specify the data model mode, stream description text files, and interpolation options. The stream description files will be provided as separate input files and contain the files and fields that need to be read.

From a user perspective, for any data model, it's important to know what modes are supported and the internal field names in the data model. That information will be used in the strdata namelist and stream input files.

Next Sections

In the next sections, more details will be presented including a full description of the science modes and namelist settings for the data atmosphere, data land, data ocean, and data ice models; namelist settings for the strdata namelist input; a description of the format and options for the stream description input files; and a list of internal field names for each of the data components. The internal data model field names are important because they are used to setup the stream description files and to map the input data fields to the internal data model field names.