Changing PE layout

This example modifies the PE layout for our original run, b40.B2000. We now target the model to run on the jaguar supercomputer and modify our PE layout to use a common load balance configuration for CCSM on large CRAY XT5 machines.

In our original example, b40.B2000, we used 128 pes with each component running sequentially over the entire set of processors.

128-pes/128-tasks layout

Now we change the layout to use 1728 processors and run the ice, lnd, and cpl models concurrently on the same processors as the atm model while the ocean model will run on its own set of processors. The atm model will be run on 1664 pes using 832 MPI tasks each threaded 2 ways and starting on global MPI task 0. The ice model is run using 320 MPI tasks starting on global MPI task 0, but not threaded. The lnd model is run on 384 processors using 192 MPI tasks each threaded 2 ways starting at global MPI task 320 and the coupler is run on 320 processors using 320 MPI tasks starting at global MPI task 512. The ocn model uses 64 MPI tasks starting at global MPI task 832.

1728-pes/896-tasks layout

Since we will be modifying env_mach_pes.xml after configure was invoked, the following needs to be invoked:


> configure -cleanmach
> xmlchange -file env_mach_pes.xml -id NTASKS_ATM -val 832
> xmlchange -file env_mach_pes.xml -id NTHRDS_ATM -val 2
> xmlchange -file env_mach_pes.xml -id ROOTPE_ATM -val 0
> xmlchange -file env_mach_pes.xml -id NTASKS_ICE -val 320
> xmlchange -file env_mach_pes.xml -id NTHRDS_ICE -val 1
> xmlchange -file env_mach_pes.xml -id ROOTPE_ICE -val 0
> xmlchange -file env_mach_pes.xml -id NTASKS_LND -val 192
> xmlchange -file env_mach_pes.xml -id NTHRDS_LND -val 2
> xmlchange -file env_mach_pes.xml -id ROOTPE_LND -val 320
> xmlchange -file env_mach_pes.xml -id NTASKS_CPL -val 320
> xmlchange -file env_mach_pes.xml -id NTHRDS_CPL -val 1
> xmlchange -file env_mach_pes.xml -id ROOTPE_CPL -val 512
> xmlchange -file env_mach_pes.xml -id NTASKS_OCN -val 64
> xmlchange -file env_mach_pes.xml -id NTHRDS_OCN -val 1
> xmlchange -file env_mach_pes.xml -id ROOTPE_OCN -val 832
> configure -mach

Note that since env_mach_pes.xml has changed, the model has to be reconfigured and rebuilt.

It is interesting to compare the timings from the 128- and 1728-processor runs. The timing output below shows that the original model run on 128 pes cost 851 pe-hours/simulated_year. Running on 1728 pes, the model cost more than 5 times as much, but it runs more than two and a half times faster.


128-processor case:
Overall Metrics:
Model Cost: 851.05 pe-hrs/simulated_year (scale= 1.00)
Model Throughput: 3.61 simulated_years/day

1728-processor case:
Overall Metrics:
Model Cost: 4439.16 pe-hrs/simulated_year (scale= 1.00)
Model Throughput: 9.34 simulated_years/day

See understanding load balancing CCSM for detailed information on understanding timing files.