In order to understand some aspects of compiling and running POP, a few words must be said here about how POP breaks up a problem to run on different threads and processors. Note that even the serial versions decompose the domain in order to achieve better performance on cache-based microprocessors.
In POP, the full horizontal domain size (nx_global,ny_global) is broken up into domains or blocks. The size of these blocks can be chosen to achieve better performance as described below. Any block size can be chosen, but to avoid padding the domain with extra points, the block size in each direction should be chosen such that it divides the global domain size in that direction evenly.
Once the domain has been decomposed into blocks, the blocks are distributed among the processors or nodes, ignoring blocks that only contain land points. The distribution of blocks across processors or nodes can be performed using either a load-balanced distribution to try to give all processors an equal amount of work or a Cartesian distribution which ensures that the block's north, south, east and west neighbors remain nearest neighbors. A load-balanced distribution is generally better for the baroclinic section of the code; a Cartesian distribution is better for the barotropic solver. Different distributions can be specified for the baroclinic and barotropic parts of the code.
Such a domain decomposition allows some flexibility in tuning the model for the best performance. Generally, a smaller block size will improve processor performance on cache-based microprocessors and a smaller block size should ensure a better load balance and better land point elimination. However, smaller block sizes add complexity to the communication routines (boundary updates, global reductions) and will result in a performance penalty for the barotropic solver. The user will need to experiment with a few combinations to find the best configuration for the simulation being run.
2010-01-26