This section describes what steps can be taken to try to identify why a test failed. The primary information associated with reviewing and debugging a run can be found in the Section called Troubleshooting runtime problems in Chapter 10.
First, verify that a test case is no longer in the batch queue. If that's the case, then review the possible test results and compare that to the result in the TestStatus file. Next, review the TestStatus.out file to see if there is any additional information about what the test did. Finally, go to the troubleshooting section and work through the various log files.
Finally, there are a couple other things to mention. If the TestStatus file contains "RUN" but the job is no longer in the queue, it's possible that the job either timed out because the wall clock on the batch submission was too short, or the job hung due to some run-time error. Check the batch log files to see if the job was killed due to a time limit, and if it was increase the time limit and either resubmit the job or generate a new test case and update the time limit before submitting it.
Also, a test case can fail because either the job didn't run properly or because the test conditions (i.e. exact restart) weren't met. Try to determine whether the test failed because the run failed or because the test did not meet the test conditions. If a test is failing early in a run, it's usually best to setup a standalone case with the same configuration in order to debug problems. If the test is running fine, but the test conditions are not being met (i.e. exact restart), then that requires debugging of the model in the context of the test conditions.
Not all tests will pass for all model configurations. Some of the issues we are aware of are
All models are bit-for-bit reproducible with different processor counts EXCEPT pop. The BFBFLAG must be set to TRUE in the env_run.xml file if the coupler is to meet this condition. There will be a performance penalty when this flag is set.
All models can be run with mixed mpi/openmp parallelism except pop, although the performance of the openmp implementation varies widely. Efforts are underway now to update pop to support openmp usage.
Some of the active components cannot run with the mpi serial library. This library takes the place of mpi calls when the model is running on one processors and MPI is not available or not desirable. The mpi serial library is part of the CCSM release and is invoked by setting the USE_MPISERIAL variable in env_build.xml to TRUE. An effort is underway to extend the mpi serial library to support all components' usage of the mpi library with this standalone implementation.