8 Performance

This section provides a brief overview of CCSM performance issues.

8.1 Compiler Optimization

As is discussed in section 5.2, compiler flags are set in the $CCSMROOT/models/bld/Macros.* files. The default flags were chosen carefully in order to produce exact restart and valid climate results for various combinations of processors, parallelization, and components. They were also chosen for performance and trapping characteristics as well as confidence. Many tradeoffs were made with respect to these choices. Users are free to modify the compiler flags in the Macros files. However, CCSM strongly suggests that users carry out a suite of exact restart tests, parallelization, performance, and science validations prior to using a modified Macros file for science.

Patches will be provided as machines and compilers evolve. There is no guarantee about either the scientific or software validity of the compiler flags on machines outside CCSM's control. The community should be aware that machines and compilers will evolve, change, and occasionally break. Users should validate any CCSM run on a platform before documenting the resulting science (see section 7).

8.2 Model parallelization types

Model parallelization types and accompanying capabilities critical for optimizing the CCSM load balance are summarized in the following table. For each component, a summary is provided of the component's parallelization options and if answers are bit-for-bit when running that component with different processor counts.

Table 6: CCSM3 Components Parallelization
Component Version Type MPI OpenMP Hybrid PEs

Name Parallel Parallel Parallel bit-for-bit

cam cam3 active YES YES YES NO

datm datm6 data NO NO NO -

latm latm6 data NO NO NO -

xatm dead dead YES NO NO YES

clm clm3 active YES YES YES YES

dlnd dlnd6 data NO NO NO -

xlnd dead dead YES NO NO YES

pop ccsm_pop_1_4 active YES NO NO NO

docn docn6 data NO NO NO -

xocn dead dead YES NO NO YES

csim csim5 active YES NO NO NO

dice dice6 data NO NO NO -

xice dead dead YES NO NO YES

cpl cpl6 active YES NO NO YES

**Table 6:** CCSM3 Components Parallelization
Component	Version	Type	MPI	OpenMP	Hybrid	PEs
Name			Parallel	Parallel	Parallel	bit-for-bit
cam	cam3	active	YES	YES	YES	NO
datm	datm6	data	NO	NO	NO	-
latm	latm6	data	NO	NO	NO	-
xatm	dead	dead	YES	NO	NO	YES
clm	clm3	active	YES	YES	YES	YES
dlnd	dlnd6	data	NO	NO	NO	-
xlnd	dead	dead	YES	NO	NO	YES
pop	ccsm_pop_1_4	active	YES	NO	NO	NO
docn	docn6	data	NO	NO	NO	-
xocn	dead	dead	YES	NO	NO	YES
csim	csim5	active	YES	NO	NO	NO
dice	dice6	data	NO	NO	NO	-
xice	dead	dead	YES	NO	NO	YES
cpl	cpl6	active	YES	NO	NO	YES

The following lists limitations on component parallelization:

cam : can run on any number of processing elements (PES), decomposition is 1d in latitude, (generally use total pes that divides number of lats evenly or nearly evenly and less than or equal to the number of lats)
datm: 1 PE only
latm: 1 PE only
xatm: 1d decomp, must divide grid evenly
clm: can run on any number of pes
dlnd: 1 PE only
xlnd: 1d decomp, must divide grid evenly
pop : 2d decomp, must divide the grid evenly
docn: 1 PE only
xocn: 1d decomp, must divide grid evenly
csim: 2d decomp, must divide the grid evenly, does not run on 1 pe successfully
dice: 1 PE only
xice: 1d decomp, must divide grid evenly
cpl : can run on any number of pes

Some architectures and compilers do not support OpenMP threads. Therefore, OpenMP threading may not be used on a given architectures even if a given component supports its use. Also, threading will not be supported on architectures that may support threading if your compiler does not support threading.

8.3 Load Balancing CCSM3

CCSM load balance refers to the allocation of processors to different components such that efficient resource utilization occurs for a given model case a and the resulting throughput is in some sense optimized. Because of the constraints in how processors can be allocated efficiently to different components, this usually results in a handful of ``sweet spots'' for processor usage for any given component set, resolution, and machine.

CCSM components run independently and are tied together only through MPI communication with the coupler. For example, data sent by the atm component to the land component is sent first to the coupler component which then sends the appropriate data to each land component process. The coupler component communicates with the atm, land, and ice components once per hour and with the ocean only once a day. The overall coupler calling sequence currently looks like

Coupler
-------
do i=1,ndays   ! days to run
  do j=1,24      ! hours
    if (j.eq.1)  call ocn_send()
    call lnd_send()
    call ice_send()
    call ice_recv()
    call lnd_recv()
    call atm_send()
    if (j.eq.24) call ocn_recv()
    call atm_recv()
  enddo
enddo

For scientific reasons, the coupler receives hourly data from the land and ice models before receiving hourly data from the atmosphere. Because of this execution sequence, it is important to allocate processors in a way that assures that atm processing is not held up waiting for land or ice data. It is easy to naively allocate processors to components in such a way that unnecessary time is spent blocking on communication and idle processors result.

While the coupler is largely responsible for inter-component communication, it also carries out some computations such as flux calculations and grid interpolations. These are not indicated in the above pseudo-code.

Since all MPI ``sends'' and ``receives'' are blocked, the components might wait during the send and/or receive communication phase. Between the communication phases, each component carries out its internal computations. In general, a components time loop looks like:

General Physical Component
--------------------------
do i=1,ndays
  do j=1,24
    call compute_stuff_1()
    call cpl_recv()
    call compute_stuff_2()
    call cpl_send()
    call compute_stuff_3()
  enddo
enddo

So, compute_stuff_1 and compute_stuff_3 are carried out between the send and the receive, and compute_stuff_2 is carried out between the receive and send. This results in a communication pattern that is represented below. We note that for each ocean communication, there are 24 ice, land, and atm communications. However, aggregated over a day, the communication pattern can be represented schematically below and serves as a template for load balancing CCSM3.

ocn   r---------------------------s
      ^                           |
ice   ^    r   s                  |
      ^    ^   |                  |
lnd   ^ r  ^   |   s              |
      ^ ^  ^   |   |              |
atm   ^ ^  ^   |   |   r        s |
      ^ ^  ^   v   v   ^        v v
cpl   s-s--s---r---r---s--------r-r
          time->

    s = send
    r = recv

CCSM3 runtime statistics can be found in coupler, csim, and pop log files whereas cam and clm create files of the form timing.<processID> containing timing statistics. As an example, near the end of the coupler log file, the line

  (shr_timer_print) timer  2:    1 calls,   355.220s, id: t00 - main integration

indicates that 355 seconds were spent in the main time integration integration loop. This time is also referred to as the ``stepon'' time. Simply put, load balancing involves reassigning processors to components so as to minimize this statistic for a given number of processors. Due to the CCSM processing sequences, it is impossible to keep all processors 100% busy. Generally, a well balanced configuration will show that the atm and ocean processors are well utilized whereas the ice and land processors may indicate considerable idle time. It is more important to keep the atm and ocean processors busy as the number of processors assigned to atm and ocean is much larger than the number assigned to ice and land.

The script getTiming.csh, in the directory $CCSMROOT/scripts/ccsm_utils/Tools/getTiming, can be used to aid in the collection of run time statistics needed to examine the load balance efficiency.

The following examples illustrate some issues involved in load balancing a CCSM3 run for a T42_gx1v3 run on bluesky.

   Case                    LB1     LB2
   ====================================
   OCN cpus                40      48
   ATM cpus                32      40
   ICE cpus                16      20
   LND cpus                8       12
   CPL cpus                8       8
   total CPUs              104     128
   stepon                  336     280
   node seconds            34944   35840
   simulated yrs/day       7.05    8.45
   simulated yrs/day/cpu  .067    .066

In the above example, adding more processors in the correct balance resulted in an ensemble that was "faster" (computed more years per wall clock day) and statistically just as efficient (years per day per cpu). The example below shows that assigning more processors to a given run may speed up that run (generates more simulated years per day) but may be less processor efficient.

   Case                    LB3    LB4
   ====================================
   OCN cpus                32     48
   ATM cpus                16     40
   ICE cpus                8      20
   LND cpus                4      12
   CPL cpus                4      8
   total CPUs              64     128
   stepon                  471    280
   node seconds            30144  35840
   simulated yrs/day       5.03   8.45
   simulated yrs/day/cpu  .078   .066

Learning how to analyze run time statistics and properly assign processors to components takes considerable time and is beyond the scope of this document.

Next: 9 Troubleshooting (Log Files, Up: UsersGuide Previous: 7 Testing Contents Index

csm@ucar.edu