This section describes what steps can be taken to try to identify why a test failed. The primary information associated with reviewing and debugging a run can be found in the Section called Troubleshooting runtime problems in Chapter 10.
First, verify that a test case is no longer in the batch queue. If that's the case, then review the possible test results and compare that to the result in the TestStatus file. Next, review the TestStatus.out file to see if there is any additional information about what the test did. Finally, go to the troubleshooting section and work through the various log files.
Finally, there are a couple other things to mention. If the TestStatus file contains "RUN" but the job is no longer in the queue, it's possible that the job either timed out because the wall clock on the batch submission was too short, or the job hung due to some run-time error. Check the batch log files to see if the job was killed due to a time limit, and if it was increase the time limit and either resubmit the job or generate a new test case and update the time limit before submitting it.
Also, a test case can fail because either the job didn't run properly or because the test conditions (i.e. exact restart) weren't met. Try to determine whether the test failed because the run failed or because the test did not meet the test conditions. If a test is failing early in a run, it's usually best to setup a standalone case with the same configuration in order to debug problems. If the test is running fine, but the test conditions are not being met (i.e. exact restart), then that requires debugging of the model in the context of the test conditions.
Not all tests will pass for all model configurations. Some of the issues we are aware of are
All models are bit-for-bit reproducible on different processor counts EXCEPT for POP2 and CICE diagnostics. The coupler is not bit-for-bit reproducible out of the box. The BFBFLAG must be set to TRUE in the env_run.xml file for the coupler to be bit-for-bit reproducible. If you have a configuration where you expect bit-for-bit reproducibility when you change the processor count AND you want to validate this, then the BFBFLAG must be set to TRUE in the env_run.xml file if the coupler is to meet this condition. The main purpose of the BFBFLAG is to enforce a specific order of operations in the mapping implementation. This constraint can impact mapping performance so it is recommended that the BFBFLAG be set to FALSE in production. Also note that the CESM system is fully bit-for-bit reproducible when rerunning the same configuration on the same processor count. The BFBFLAG is only required when trying to reproduce answers when changing processor counts.
Some of the active components cannot run with the mpi serial library. This library takes the place of mpi calls when the model is running on one processors and MPI is not available or not desirable. The mpi serial library is part of the CESM release and is invoked by setting the USE_MPISERIAL variable in env_build.xml to TRUE. An effort is underway to extend the mpi serial library to support all components' usage of the mpi library with this standalone implementation. Also NOT all machines/platforms are setup to enable setting USE_MPISERIAL to TRUE. For these machines the env variable MPISERIAL_SUPPORT is set to FALSE. In order to enable USE_MPISERIAL to TRUE you also need to make changes in the Macros and env_machopts files for that machine. The best way to do this is to use a machine where MPISERIAL_SUPPORT is TRUE and look at the type of changes needed to make it work. Those same changes will need to be introduced for your machine. For the Macros file this includes the name of the compiler, possibly options to the compiler, and the settings of the MPI library and include path. For the env_machopts file you may want/need to modify the setting of MPICH_PATH. There also maybe many settings of MPI specific environment variables that don't matter when USE_MPISERIAL is TRUE.