next up previous contents
Next: 2.7 Running CAM as Up: 2. Building and Running Previous: 2.5 Model Input Datasets   Contents

Subsections

2.6 Troubleshooting Guide

This section presents information which should help with some common problems users encounter when running the CAM.

2.6.0.1 Supported Platforms for CAM 3.0

CAM 3.0 is ported to and supported on the following platforms:

2.6.0.2 Known problems

2.6.0.3 General

The first step in troubleshooting a failed model run is to check the basics. Look at the logs for error messages. Make sure the model executable is up to date with any source code changes. Rebuild the model cleanly (i.e. issue a "gmake clean" before rerunning the script) if you are unsure of the state of any code. Ask yourself what has changed since the last successful run.

Other times CAM may fail for no obvious reason or perhaps the error message returned is cryptic or misleading. It has been our experience that the majority of these types of symptoms can be attributed to an incorrect allocation of hardware and/or software resources (e.g. the user sets the value of $OMP_NUM_THREADS to a value inconsistent with the number of physical CPUs per node). Most often an incorrect setting for the per-thread stack size will cause the model to fail with a segmentation fault, allocation error, or stack pointer error. Usually the default setting for this resource is too low and must be adjusted by setting the appropriate environment variables. Values in the range of 40-70 Mbytes seem to work well on most architectures. As a simple troubleshooting step the user may try adjusting this resource, or the process stack size, for their particular application. Here is a list of suggested runtime resource settings affecting the process and/or thread stack sizes.

2.6.0.4 How to increase the stacksize on different platforms

2.6.0.5 General problems on different platforms

Most distributed-memory platforms also provide runtime settings to enable a user to override the multiprocessing defaults and customize the machine parallelism to a particular application. CAM performance can be adversely affected by an incorrect configuration of the machine parallelism. The run scripts provided in the distribution create an executable that will run in a hybrid mode on distributed architectures, using MPI for communication between nodes and OpenMP directives on processes within a node. When running in hybrid mode the user should set the number of MPI tasks per node to be 1. Thread-based OpenMP multitasking will utilize all processors on the node. If the user makes the appropriate changes to the Makefile to disable OpenMP and use only MPI, the number of MPI tasks per node should be set equal to the number of physical processors per node.

At this point the model should begin compiling and executing. Appropriate log files will be generated in the /ptmp/$LOGNAME/$CASE directory. After a successful run of the model, the user may edit the namelist variables in run-ibm.csh to better suit their particular needs. After successfully compiling the model, subsequent invocations of the run script will only recompile when the user makes changes to model code. The model should begin execution very quickly after gmake verifies than no code has been changed.

In addition to properly configuring machine resources, we've identified the following problems often encountered when building and running CAM on the machines here at NCAR.


next up previous contents
Next: 2.7 Running CAM as Up: 2. Building and Running Previous: 2.5 Model Input Datasets   Contents
Jim McCaa 2004-10-22