This example modifies the PE layout for our original run, EXAMPLE_CASE. We now target the model to run on the jaguar supercomputer and modify our PE layout to use a common load balance configuration for CESM on large CRAY XT5 machines. Also see the Section called Changing the PE layout in Chapter 2.
In our original example, EXAMPLE_CASE, we used 128 pes with each component running sequentially over the entire set of processors.
Now we change the layout to use 1728 processors and run the ice, lnd, and cpl models concurrently on the same processors as the atm model while the ocean model will run on its own set of processors. The atm model will be run on 1664 pes using 832 MPI tasks each threaded 2 ways and starting on global MPI task 0. The ice model is run using 320 MPI tasks starting on global MPI task 0, but not threaded. The lnd model is run on 384 processors using 192 MPI tasks each threaded 2 ways starting at global MPI task 320 and the coupler is run on 320 processors using 320 MPI tasks starting at global MPI task 512. The ocn model uses 64 MPI tasks starting at global MPI task 832.
Since we will be modifying env_mach_pes.xml after cesm_setup was called, the following needs to be invoked:
> ./cesm_setup -clean > xmlchange -id NTASKS_ATM -val 832 > xmlchange -id NTHRDS_ATM -val 2 > xmlchange -id ROOTPE_ATM -val 0 > xmlchange -id NTASKS_ICE -val 320 > xmlchange -id NTHRDS_ICE -val 1 > xmlchange -id ROOTPE_ICE -val 0 > xmlchange -id NTASKS_LND -val 192 > xmlchange -id NTHRDS_LND -val 2 > xmlchange -id ROOTPE_LND -val 320 > xmlchange -id NTASKS_ROF -val 192 > xmlchange -id NTHRDS_ROF -val 2 > xmlchange -id ROOTPE_ROF -val 320 > xmlchange -id NTASKS_CPL -val 320 > xmlchange -id NTHRDS_CPL -val 1 > xmlchange -id ROOTPE_CPL -val 512 > xmlchange -id NTASKS_OCN -val 64 > xmlchange -id NTHRDS_OCN -val 1 > xmlchange -id ROOTPE_OCN -val 832 > ./cesm_setup |
It is interesting to compare the timings from the 128- and 1728-processor runs. The timing output below shows that the original model run on 128 pes cost 851 pe-hours/simulated_year. Running on 1728 pes, the model cost more than 5 times as much, but it runs more than two and a half times faster.
128-processor case: Overall Metrics: Model Cost: 851.05 pe-hrs/simulated_year (scale= 1.00) Model Throughput: 3.61 simulated_years/day 1728-processor case: Overall Metrics: Model Cost: 4439.16 pe-hrs/simulated_year (scale= 1.00) Model Throughput: 9.34 simulated_years/day |
See understanding load balancing CESM for detailed information on understanding timing files.