IO: What is pio?

The parallel IO (PIO) library is included with CESM and is automatically built as part of the CESM build. CESM components use the PIO library to read and/or write data. The PIO library is a set of interfaces that support serial netcdf, parallel netcdf, or binary IO transparently. The implementation allows users to easily modify the pio setup on the fly to change the method (serial netcdf, parallel netcdf, or binary data) as well as various parameters associated with PIO to optimize IO performance.

CESM prefers that data be written in CF compliant netcdf format to a single file that is independent of all parallel decomposition information. Historically, data was written by gathering global arrays on a root processor and then writing the data from the root processor to an external file using serial netcdf. The reverse process (read and scatter) was done for reading data. This method is relatively robust but is not memory scalable, performance scalable, or performance flexible.

PIO works as follows. The PIO library is initialized and information is provided about the method (serial netcdf, parallel netcdf, or binary data), and the number of desired IO processors and their layout. The IO parameters define the set of processors that are involved in the IO. This can be as few as one and as many as all available processors. The data, data name and data decomposition are also provided to PIO. Data is written through the PIO interface in the model specific decomposition. Inside PIO, the data is rearranged into a block decomposition on the IO processors and the data is then written serially using netcdf or in parallel using pnetcdf. There are several namelist options to control PIO functionality. Refer to the Parallel I/O (PIO) control variables in the env_run namelist documentation for details.

There are several benefits associated with using PIO. First, even with serial netcdf, the memory use can be significantly decreased because the global arrays are decomposed across the IO processors and written in chunks serially. This is critical as CESM runs at higher resolutions where global arrays need to be minimized due to memory availability. Second, pnetcdf can be turned on transparently potentially improving the IO performance. Third, PIO parameters such as the number of IO tasks and their layout can be tuned to reduce memory and optimize performance on a machine by machine basis. Fourth, the standard global gather and write or read and global scatter can be recovered by setting the number of io tasks to 1 and using serial netcdf.

CESM uses the serial netcdf implementation of PIO and pnetcdf is turned off in PIO by default. Several components provide namelist inputs that allow use of pnetcdf in PIO. To use pnetcdf, a pnetcdf library (like netcdf) must be available on the local machine and PIO pnetcdf support must be turned on when PIO is built. This is done as follows

  1. Locate the local copy of pnetcdf. We recommend version 1.3.1 (1.2.0 or newer is required)

  2. Set PNETCDF_PATH in the Macros file to the directory of the pnetcdf install (ie. /contrib/pnetcdf1.3.1/).

  3. Run the clean_build script if the model has already been built.

  4. Run the build script to rebuilt pio and the full CESM system.

  5. Change component IO namelist settings to pnetcdf and set appropriate IO tasks and layout.

There is an ongoing effort between CESM, pio developers, pnetcdf developers and hardware vendors to understand and improve the IO performance in the various library layers. To learn more about pio, see the pio documentation.