7 Common Aborts, Errors, Debugging, and Performance Issues

This section provides some suggestions about how to overcome those nagging problems that sometimes occur. There is also a short discussion on performance issues.

7.1 Common aborts and errors

Occasionally problems crop up when trying to build or run the CCSM system. This section will help users solve some of these problems. The most obvious places to look for error messages are in the model log files or in the batch standard out or standard error files. On some systems, these messages will be mailed to the user.

Many of the fixes require changes to parameter settings. All changes to CCSM settings made during a run should be documented to ensure the scientific reproducibility of your results.

7.1.1 Component model has trouble building

Try making a clean start. Each component has a build directory called $OBJROOT/$model/obj (normally, this is equivalent to the $EXEROOT/$model/obj directory). To make a clean start, use the command: rm -r -f $OBJROOT/*/obj . This will remove the model object files and force the model to completely rebuild the next time. It is usually sufficient to only remove the obj directory for the component that seems to be causing problems.

When rebuilding the model, make sure the $SETBLD is set to true in the main run script if $RUNTYPE is set to continue. $SETBLD must be true or auto if $RUNTYPE is startup or hybrid or branch. Otherwise, the model will not build when submitted.

The main script can be run interactively on most platforms for build purposes. It may be necessary to kill the interactive script before the model starts running or activate the exit line that stops the script before the model begin executing.

If there is a source code problem, the compiler error will show up in the model log files. The main build script typically exits and indicates which model log file to interrogate.

7.1.2 Model won't continue due to restart problem

If a restart file is corrupted, the coupled model likely will not run. First, verify that user quotas have not been exceeded and verify that the disk is not full. Then check if the size of the restart file in the $EXEDIR and the $ARCROOT/restart directory is different than the previous restart files written by that component. If either of these are problems, action must be taken before the model can be restarted.

If there is a corrupt restart file, remove the copy of the file in the executable directory. If the copy of the restart file in $ARCROOT/restart is also corrupt, remove that copy of the file. Check to see if there is a valid copy of the file in $ARCROOT/$model or on the local mass store. If so, copy that file into $ARCROOT/restart and try resubmitting the model. If there are no valid copies of a restart file, back up to the last valid restart set by removing all files in the $ARCROOT/restart directory and untarring the last valid restart dataset residing in $ARCROOT/restart.tars. Alternatively, delete all files in $ARCROOT/restart, manually gather a set of restart files from either $ARCROOT/$model or the local mass store into $ARCROOT/restart, copy the rpointer files from $SCRIPTS to $ARCROOT/restart, modify the rpointer files in $ARCROOT/restart and resubmit the run.

7.1.3 Ocean model stops due to ocean non-convergence or time-stepping problem

If the ocean log file reports non-convergence and the integration stops, this represents an inability of the ocean model to converge to a solution in the barotropic solver. The best solution is to reduce the ocean model timestep. To do this, edit scripts/$CASE/ocn.setup.csh. Increase DT_COUNT by approximately 10% and restart the job. If nonconvergence occurs on the first ocean timestep, then other circumstances may be affecting the ocean-model forcing. Check to see if the proper component forcing files are in place.

It may be necessary to reset the restart files. This is accomplished by copying in the appropriate set of restart and rpointer files into the $ARCROOT/restart directory . An appropriate set is often available as a tar file in the $ARCROOT/restart.tars directory.

The coupler will adapt to the change in ocean timestep automatically. Changing the ocean timestep will change the answers and may result in a performance degradation in the coupled model. Normally, the ocean timestep is decreased for a short time and then the timestep is set back to the original value. If the ocean model stops often, it may be worth considering changing the ocean timestep permanently for the model run. Large changes in the ocean timestep may require changes in the model's load balance ($NTHRDS and $NTASKS in test.a1.run) for optimal performance.

Occasionally, the ocean model will stop due to a CFL instability. The solution to this problem is the same: reduce the ocean model timestep as described above.

7.1.4 Ice Model stops due to ice mpdata transport instability

If the ice log file reports an mpdata transport instability and stops, this indicates an instability in the advection scheme. To solve this problem, increase the value of ndte in the scripts/$CASE/ice.setup.csh file and resubmit the job. Normally, ndte is doubled for a short period (a few months or a year) and then reset to the original value. Increasing ndte increases the amount of subcycling in the advection scheme. If unstable ice transport is a regular problem, it might be worth increasing ndte and leaving it at the larger value for the duration of the model run. The coupler will not need to adapt to an increase in ice subcycling. However, changing the ice subcycling will change the answers and may result in a performance degradation in the coupled model.

If it is necessary to reset the restart files, position the appropriate set of restart and rpointer files into $ARCROOT/restart directory. An full set of restart and rpointer files is often available as a tar file in the $ARCROOT/restart.tars directory.

7.2 Debugging

Currently, no debuggers work well with the CCSM model, so print statements are the primary debugging tool. Calls to the shr_sys_flush routines (found in $CSMROOT/models/csm_share/shr_sys_mod.F90) are recommended to ensure that all contents of the print buffers are flushed to standard out before the model halts.

7.3 Performance issues

CCSM performance is dependent on a number of issues, including component model configuration, resolution, model timestep, and relative performance of floating-point operations versus memory access speed versus communication latencies on the hardware. These factors affect not only the individual component performance but also the load balancing of the overall configuration.

Each component in CCSM is run as a separate executable and the components communicate with the coupler at regular intervals. These communication points represent intermodel synchronization points. In order to load balance the overall model and allocate appropriate resources for individual components, timings of components between the synchronization points are needed. Most components print out timing information at the end of the run that can be analyzed for load balance and overall performance.

This section will be improved in the future. For help with performance questions, feel free to contact NCAR by e-mailing your questions to cesm.ucar.edu.

Next: 8 Supporting Scripts: Up: UsersGuide Previous: 6 Testing the CCSM Contents

csm@ucar.edu