The model code requires that certain environment variables be available. When running in "&D mode" each processing task has to establish these values. Some machines have trouble when processing slaves try to get environment variables. For example, using the UNIX shell "bash" on a Linux platform with the Lahey compiler - the model fails since the slave processes are unable to obtain the environment variables. This problem can be related to the UNIX shell that the user is using. Running on Linux with the Lahey compiler and the "tcsh" shell doesn't cause the same problem that running "bash" does. Users are cautioned that other combinations of UNIX shells and platforms may result in this same problem.
The model has undergone extensive testing on IBM platforms both at NCAR and at several other institutions. One problem discovered is a problem compiling the "history.F90" module when using the more extensive "DEBUG mode" compiler options (this mode is enabled by setting the environment variable DEBUG to "TRUE") and with the SPMD distributed memory configuration. The compiler will sometimes abort with the error "INTERNAL COMPILER ERROR". This problem is due to a problem with the IBM FORTRAN-90 compiler, and has been reported to IBM. In many cases this problem is inconsistent, so recompiling will sometimes (not always) work. As a work around the user could also compile "history.F90" without the DEBUG option on. This problem was reported to IBM and they have fixed the problem in newer versions of the compiler (or with the appropriate "E-fixes" applied to older compilers).
Certain versions of gmake have been shown to have trouble with date-stamps. The Earth System Modeling Framework (ESMF) library has trouble with these versions and will not build. The version of gmake with the fix is 3.79.1, the bug was introduced somewhere between 3.78.1 and 3.79. So to fix the problem you merely need to update to a newer version.
Other times CAM may fail for no obvious reason or perhaps the error message returned is cryptic or misleading. It has been our experience that the majority of these types of symptoms can be attributed to an incorrect allocation of hardware and/or software resources (e.g. the user sets the value of $OMP_NUM_THREADS to a value inconsistent with the number of physical CPUs per node). Most often an incorrect setting for the per-thread stack size will cause the model to fail with a segmentation fault, allocation error, or stack pointer error. Usually the default setting for this resource is too low and must be adjusted by setting the appropriate environment variables. Values in the range of 40-70 Mbytes seem to work well on most architectures. As a simple troubleshooting step the user may try adjusting this resource, or the process stack size, for their particular application. Here is a list of suggested runtime resource settings affecting the process and/or thread stack sizes.
limit stacksize unlimited setenv MP\_STACK\_SIZE 17000000
limit stack size unlimited setenv XLSMPOPTS "stack=40000000"
limit stack size unlimited setenv MP\_SLAVE\_STACKSIZE 40000000
limit stacksize unlimited
limit stacksize unlimited setenv MPSTKZ 40000000
At this point the model should begin compiling and executing. Appropriate log files will be generated in the /ptmp/$LOGNAME/$CASE directory. After a successful run of the model, the user may edit the namelist variables in run-ibm.csh to better suit their particular needs. After successfully compiling the model, subsequent invocations of the run script will only recompile when the user makes changes to model code. The model should begin execution very quickly after gmake verifies than no code has been changed.
In addition to properly configuring machine resources, we've identified the following problems often encountered when building and running CAM on the machines here at NCAR.