Using mksurfdata to create surface datasets from grid datasets

mksurfdata is used to create surface-datasets from grid datasets and raw datafiles at half-degree resolution to produce files that describe the surface characteristics needed by CLM (fraction of grid cell covered by different land-unit types, and fraction for different vegetation types, as well as things like soil color, and soil texture, etc.). To run mksurfdata you can either use the script which will create namelists for you using the build-namelist XML database, or you can run it by hand using a namelist that you provide (possibly modeled after an example provided in the models/lnd/clm/tools/mksurfdata directory). The namelist for mksurfdata is sufficiently complex that we recommend using the tool to build them. In the next section we describe how to use the script and the following section gives more details on running mksurfdata by hand and the various namelist input variables to it.


The script can be used to run the mksurfdata program for several configurations, resolutions, simulation-years and simulation year ranges. It will create the needed namelists for you and move the files over to your inputdata directory location (and create a list of the files created, and for developers this file is also a script to import the files into the svn inputdata repository). It will also use the build-namelist XML database to determine the correct input files to use, and for transient cases it will create the appropriate mksrf_fdynuse file with the list of files for each year needed for this case. And in the case of urban single-point datasets (where surface datasets are actually input into mksurfdata) it will do the additional processing required so that the output dataset can be used once again by mksurfdata. Because, it figures out namelist and input files for you, it is recommended that you use this script for creation of standard surface datasets. If you need to create surface datasets for customized cases, you might need to run mksurfdata on it's own. But you could use with the "-debug" option to give you a namelist to start from. For help on you can use the "-help" option as below:

> cd models/lnd/clm/tools/mksurfdata
> -help
The output of the above command is:

SYNOPSIS [options]
     -crop                         Add in crop datasets
     -dinlc [or -l]                Enter the directory location for inputdata 
                                   (default /fs/cgd/csm/inputdata)
     -debug [or -d]                Don't actually run -- just print out what 
                                   would happen if ran.
     -dynpft "filename"            Dynamic PFT/harvesting file to use 
                                   (rather than create it on the fly) 
                                   (must be consistent with first year)
     -exedir "directory"           Directory where mksurfdata program is
                                   (by default assume it's in the current directory)
     -glc_nec "number"             Number of glacier elevation classes to use (by default 0)
     -irrig                        If you want to include irrigated crop in the output file.
     -years [or -y]                Simulation year(s) to run over (by default 1850,2000) 
                                   (can also be a simulation year range: i.e. 1850-2000)
     -help  [or -h]                Display this help.
     -nomv                         Don't move the files to inputdata after completion.
     -res   [or -r] "resolution"   Resolution(s) to use for files (by default all ).
     -rcp   [or -c] "rep-con-path" Representative concentration pathway(s) to use for 
                                   future scenarios 
                                   (by default -999.9, where -999.9 means historical ).
     -usrname "clm_usrdat_name"    CLM user data name to find grid file with.

NOTE: years, res, and rcp can be comma delimited lists.

OPTIONS to override the mapping of the input gridded data with hardcoded input

     -pft_frc "list of fractions"  Comma delimited list of percentages for veg types
     -pft_idx "list of veg index"  Comma delimited veg index for each fraction
     -soil_cly "% of clay"         % of soil that is clay
     -soil_col "soil color"        Soil color (1 [light] to 20 [dark])
     -soil_fmx "soil fmax"         Soil maximum saturated fraction (0-1)
     -soil_snd "% of sand"         % of soil that is sand

To run the script with optimized mksurfdata for a 4x5 degree grid for 1850 conditions, on bluefire you would do the following:

Example 2-6. Example of running to create a 4x5 resolution fsurdat for a 1850 simulation year

> cd models/lnd/clm/tools/mksurfdata
> gmake
> -y 1850 -r 4x5

Running mksurfdata by Hand

In the above section we show how to run mksurfdata through the using input datasets that are in the build-namelist XML database. When you are running with input datasets that are NOT available in the XML database you either need to add them as outlined in Chapter 3, or you need to run mksurfdata by hand, as we will outline here.

Preparing your mksurfdata namelist

When running mksurfdata by hand you will need to prepare your own input namelist. There are sample namelists that are setup for running on the NCAR machine bluefire. You will need to change the filepaths to run on a different machine. The list of sample namelists include

mksurfdata.namelist -- standard sample namelist.
mksurfdata.regional -- sample namelist to build for a regional grid dataset (5x5_amazon)
mksurfdata.singlept -- sample namelist to build for a single point grid dataset (1x1_brazil)

Note, that one of the inputs mksrf_fdynuse is a filename that includes the filepaths to other files. The filepaths in this file will have to be changed as well. You also need to make sure that the line lengths remain the same as the read is a formatted read, so the placement of the year in the file, must remain the same, even with the new filenames. One advantage of the script is that it will create the mksrf_fdynuse file for you.

We list the namelist items below. Most of the namelist items are filepaths to give to the input half degree resolution datasets that you will use to scale from to the resolution of your grid dataset. You must first specify the input grid dataset for the resolution to output for:

  1. mksrf_fgrid Grid dataset

Then you must specify settings for input high resolution datafiles

  1. mksrf_ffrac land fraction and land mask dataset

  2. mksrf_fglacier Glacier dataset

  3. mksrf_flai Leaf Area Index dataset

  4. mksrf_flanwat Land water dataset

  5. mksrf_forganic Organic soil carbon dataset

  6. mksrf_fmax Max fractional saturated area dataset

  7. mksrf_fsoicol Soil color dataset

  8. mksrf_fsoitex Soil texture dataset

  9. mksrf_ftopo Topography dataset (this is used to limit the extent of urban regions and is used for glacier multiple elevation classes)

  10. mksrf_furban Urban dataset

  11. mksrf_fvegtyp PFT vegetation type dataset

  12. mksrf_fvocef Volatile Organic Compound Emission Factor dataset

You specify the ASCII text file with the land-use files.

  1. mksrf_fdynuse "dynamic land use" for transient land-use/land-cover changes. This is an ASCII text file that lists the filepaths to files for each year and then the year it represents (note: you MUST change the filepaths inside the file when running on a machine NOT at NCAR). We always use this file, even for creating datasets of a fixed year. Also note that when using the "pft_" settings this file will be an XML-like file with settings for PFT's rather than filepaths (see the Section called Experimental options to mksurfdata below).

And optionally you can specify settings for:

  1. all_urban If entire area is urban (typically used for single-point urban datasets, that you want to be exclusively urban)

  2. mksrf_firrig Irrigation dataset, if you want activate the irrigation model over generic cropland (experimental mode, normally NOT used)

  3. mksrf_gridnm Name of output grid resolution (if not set the files will be named according to the number of longitudes by latitudes)

  4. mksrf_gridtype Type of grid (default is 'global')

  5. nglcec number of glacier multiple elevation classes. Can be 0, 1, 3, 5, or 10. When using the resulting dataset with CLM you can then run with glc_nec of either 0 or this value. (experimental normally use the default of 0, when running with the land-ice model in practice only 10 has been used)

  6. numpft number of Plant Function Types (PFT) in the input vegetation mksrf_fvegtyp dataset. You change this to 20, if you want to create a dataset with prognostic crop activated. The vegetation dataset also needs to have prognostic crop types on it as well. (experimental normally not changed from the default of 16)

  7. outnc_large_files If output should be in NetCDF large file format

  8. outnc_double If output should be in double precision (normally we turn this on)

  9. pft_frc array of fractions to override PFT data with for all gridpoints (experimental mode, normally NOT used).

  10. pft_idx array of PFT indices to override PFT data with for all gridpoints (experimental mode, normally NOT used).

  11. soil_clay percent clay soil to override all gridpoints with (experimental mode, normally NOT used).

  12. soil_color Soil color to override all gridpoints with (experimental mode, normally NOT used).

  13. soil_fmax Soil maximum fraction to override all gridpoints with (experimental mode, normally NOT used).

  14. soil_sand percent sandy soil to override all gridpoints with (experimental mode, normally NOT used).

After creating your namelist, when running on a non NCAR machine you will need to get the files from the inputdata repository. In order to retrieve the files needed for mksurfdata you can do the following on your namelist to get the files from the inputdata repository, using the check_input_data script which also allows you to export data to your local disk.

Example 2-7. Getting the raw datasets for mksurfdata to your local machine using the check_input_data script

> cd models/lnd/clm/tools/mksurfdata
# First remove any quotes and copy into a filename that can be read by the
# check_input_data script
> sed "s/'//g" namelist > clm.input_data_list
# Run the script with -export and give the location of your inputdata with $CSMDATA
> ../../../../../scripts/ccsm_utils/Tools/check_input_data -datalistdir . \
-inputdata $CSMDATA -check -export
# You must then do the same with the fdynuse file referred to in the namelist
# in this case we add a file = to the beginning of each line
> awk '{print "file = "$1}' pftdyn_hist_simyr2000-2000.txt > clm.input_data_list
# Run the script with -export and give the location of your inputdata with $CSMDATA
> ../../../../../scripts/ccsm_utils/Tools/check_input_data -datalistdir . \
-inputdata $CSMDATA -check -export

Experimental options to mksurfdata

The options: pft_frc, pft_idx, soil_clay, soil_color, soil_fmax, and soil_sand are also new and considered experimental. They provide a way to override the PFT and soil values for all grid points to the given values that you set. This is useful for running with single-point tower sites where the soil type and vegetation is known. Note that when you use pft_frc, all other landunits will be zeroed out, and the sum of your pft_frc array MUST equal 100.0. Also note that when using the "pft_" options the mksrf_fdynuse file instead of having filepath's will be an XML-like file with PFT settings. Unlike the file of file-paths, you will have to create this file by hand, will NOT be able to create it for you (other than the first year which will be set to the values entered on the command line). Note, that when PTCLM is run, it CAN create these files for you from a simpler format (see the Section called Dynamic Land-Use Change Files for use by PTCLM in Chapter 6). Instead of a filepath you have a list of XML elements that give information on the PFT's and harvesting for example:

So the <pft_f> tags give the PFT fractions and the <pft_i> tags give the index for that fraction. Harvest is an array of five elements, and grazing is a single value. Like the usual file each list of XML elements goes with a year, and there is limit on the number of characters that can be used.

Standard Practices when using mksurfdata

In this section we give the recommendations for how to use mksurfdata to give similar results to the files that we created when using it.

If you look at the standard surface datasets that we have created and provided for use, there are three practices that we have consistently done in each (you also see these in the sample namelists and in the script). The first is that we always output data in double precision (hence outnc_double is set to .true.). The next is that we always use the procedure for creating transient datasets (using mksrf_fdynuse) even when creating datasets for a fixed simulation year. This is to ensure that the fixed year datasets will be consistent with the transient datasets. When this is done a "surfdata.pftdyn" dataset will be created -- but will NOT be used in CLM. If you look at the sample namelist mksurfdata.namelist you note that it sets mksrf_fdynuse to the file pftdyn_hist_simyr2000.txt, where the single file entered is the same PFT file used in the rest of the namelist (as mksrf_fvegtyp). The last practice that we always do is to always set mksrf_ftopo, even if glacier elevation classes are NOT active. This is important in limiting urban areas based on topographic height, and hence is important to use all the time. The glacier multiple elevation classes will be used as well if you are running a compset with the active glacier model.

There are two other important practices for creating urban single point datasets. The first is that you often will want to set all_urban to .true. so that the dataset will have 100% of the gridcell output as urban rather than some mix of: urban, vegetation types, and other landunits. The next practice is that most of our specialized urban datasets have custom values for the urban parameters, hence we do NOT want to use the global urban dataset to get urban parameters -- we use a previous version of the surface dataset for the urban parameters. However, in order to do this, we need to append onto the previous surface dataset the grid and land mask/land fraction information from the grid and fraction datasets. This is done in using the NCO program ncks. An example of doing this for the Mexico City, Mexico urban surface dataset is as follows:

> ncks -A $CSMDATA/lnd/clm2/griddata/ \
> ncks -A $CSMDATA/lnd/clm2/griddata/ \
Note, if you look at the current single point urban surface datasets you will note that the above has already been done.

The final issue is how to build mksurfdata. When NOT optimized mksurfdata is very slow, and can take many hours to days to even run for medium resolutions such as one or two degree. So usually you will want to run it optimized. Possibly you also want to use shared memory parallelism using OpenMP with the SMP option. The problem with running optimized is that answers will be different when running optimized versus non-optimized for most compilers. So if you want answers to be the same as a previous surface dataset, you will need to run it on the same platform and optimization level. Likewise, running with or without OpenMP may also change answers (for most compilers it will NOT, however it does for the IBM compiler). However, answers should be the same regardless of the number of threads used when OpenMP is enabled. Note, that the output surface datasets will have attributes that describe whether the file was written out optimized or not, with threading or not and the number of threads used, to enable the user to more easily try to match datasets created previously. For more information on the different compiler options for the CLM4 tools see the Section called Common environment variables and options used in building the FORTRAN tools.