Creating your own single-point/regional surface datasets

The file: Quickstart.userdatasets in the models/lnd/clm/doc directory gives guidelines on how to create and run with your own single-point or regional datasets. Below we reprint the above guide.


          Quick-Start to using your own datasets in clm4
          ===============================================

Assumptions: You are already familiar with the use of the cpl7 scripts
             for creating cases to run with "standalone" clm. See the
             Quickstart.GUIDE and the README files and documentation in
             the scripts directory for more information on this process.
             We also assume that the env variable $CSMDATA points to the
             location of the standard datasets for your machine 
             (/fis/cgd/cseg/csm/inputdata on bluefire). We also assume that the
             following variables are used to point to the appropriate
             values that you want to use for your case. Mask is included
             as part of your resolution for your case, and SIM_YEAR and
             SIM_YEAR_RANGE will be set appropriately for the particular use
             case that you choose for your compset (i.e. 1850_control, 
             20thC_transient etc.).

                 SIM_YEAR -------- Simulation year       (i.e. 1850, or 2000)
                 SIM_YEAR_RANGE -- Simulation year range (i.e. constant, or 1850-2000)
                 MASK ------------ Land mask             (i.e. navy, USGS, or gx1v6)

Process:

       0.) Why do this?

       An alternative to the steps below, is to create your case, and hand-edit the
       relevant namelists as appropriate with your own datasets. One reason for
       the process below is so that we can do automated testing on dataset inclusion.
       But, it also provides the following functionality to the user:
           a.) New cases with the same datasets only require a small change to 
               env_conf.xml and env_run.xml (steps 5,6, and 8)
           b.) You can clone new cases based on a working case, without having to 
               hand-edit all of the namelists for the new case in the same way.
           c.) The process will check for the existence of files when cases are
               configured so you can have the scripts check that datasets exist
               rather than finding out at run-time after submitted to batch.
           d.) The process checks for valid namelists, and makes it less likely 
               for you to put an error or typo in the namelists.
           e.) The *.input_data_list files will be accurate for your case,
               you can use the check_input_data script to do queries on the files.
           f.) Your dataset names will be closer to standard names, and easier
               for inclusion in standard clm (with the exception of creation dates).
           g.) The regional extraction script (see 3.b below) will automatically create
               files with names following this convention.

       1.) Create your own dataset area -- link it to standard dataset location

       Create a directory to put your own datasets (such as /ptmp/$USER/my_inputdata).
       Use the script link_dirtree to link the standard datasets into this location.
       If you already have complete control over the datasets in $CSMDATA -- you
       can skip this step.

       		setenv MYCSMDATA /ptmp/$USER/my_inputdata
       		scripts/link_dirtree $CSMDATA $MYCSMDATA

       If you do this you can find the files you've added with...

           find $MYCSMDATA -type f -print

       and you can find the files that are linked to the standard location with...

           find $MYCSMDATA -type l -print

       2.) Establish a "user dataset identifier name" string

       You need a unique identifier for your datasets for a given resolution,
       mask, area, simulation-year, and simulation year-range. The identifier
       can be any string you want -- but we have the following suggestions:

       Suggestions for global grids:

       		setenv MYDATAID ${degLat}x${degLon}

       Suggestions for regional grids: either give the number of points in the grid

       		setenv MYDATAID nxmpt_citySTATE
       		setenv MYDATAID nxmpt_cityCOUNTRY
       		setenv MYDATAID nxmpt_regionCOUNTRY
       		setenv MYDATAID nxmpt_region

                       or give the total size of the gridcells

       		setenv MYDATAID nxmdeg_citySTATE
       		setenv MYDATAID nxmdeg_cityCOUNTRY

        for example: setenv MYDATAID 10x15 -- global 10x15 grid
                     setenv MYDATAID 1x1pt_boulderCO -- single-point for Boulder CO
                     setenv MYDATAID 5x5pt_boulderCO -- 5x5 region around Boulder CO
                     setenv MYDATAID 1x1deg_boulderCO - 1x1 degree region around Boulder CO
                     setenv MYDATAID 13x12pt_f19_alaskaUSA1 - 13x12 gridcells from f19
                                                 (1.9x2.5) global resolution over Alaska

       3.) Add your own datasets in the standard locations in that area

       3.a) Create datasets using the standard tools valid for any specific points

       Use the tools in models/lnd/clm/tools to create new datasets. Tools
       such as: mkgriddata, mksurfdata, mkdatadomain, and the regridding tools
       in ncl_scripts 

       (see the models/lnd/clm/bld/namelist_files/namelist_defaults_usr_files.xml 
        for the exact syntax for all files).

       surfdata:    copy files into: 
             $MYCSMDATA/lnd/clm2/surfdata/surfdata_${MYDATAID}_simyr${SIM_YEAR}.nc
       fatmgrid:    copy files into:
             $MYCSMDATA/lnd/clm2/griddata/griddata_${MYDATAID}.nc
       fatmlndfrc:  copy files into:
             $MYCSMDATA/lnd/clm2/griddata/fracdata_${MYDATAID}_${MASK}.nc
       domainfile:  copy files into:
             $MYCSMDATA/atm/datm7/domain.clm/domain.lnd.${MYDATAID}_${MASK}.nc

       3.b) Use the regional extraction script to get regional datasets from the global ones
       Use the getregional_datasets.pl script to extract out regional datasets of interest.
       Note, the script works on all files other than the "finidat" file as it's a 1D vector file.

       For example, Run the extraction for data from 52-73 North latitude, 190-220 longitude
       that creates 13x12 gridcell region from the f19 (1.9x2.5) global resolution over Alaska.

                cd models/lnd/clm/tools/ncl_scripts
                ./getregional_datasets.pl -sw 52,190 -ne 73,220 -id $MYDATAID \
                -mycsmdata $MYCSMDATA

       Repeat this process if you need files for multiple sim_year, and sim_year_range values.

       4.) Setup your case

       Follow the standard steps for executing "scripts/create_newcase" and customize
       your case as appropriate.

       i.e.

       		./create_newcase -case my_userdataset_test -res pt1_pt1 -compset I1850 \
                -mach bluefire

       The above example implies that: MASK=gx1v6, SIM_YEAR=1850, and SIM_YEAR_RANGE=constant.
       5.) Edit the env_run.xml in the case to point to your new dataset area

       Edit DIN_LOC_ROOT_CSMDATA in env_run.xml to point to $MYCSMDATA

       		./xmlchange -file env_run.xml -id DIN_LOC_ROOT_CSMDATA -val $MYCSMDATA

       6.) Edit the env_conf.xml in the case to point to your user dataset identifier 
       name.

       Edit CLM_USRDAT_NAME  to point to $MYDATAID

       		./xmlchange -file env_conf.xml -id CLM_USRDAT_NAME -val $MYDATAID
       		./xmlchange -file env_conf.xml -id CLM_PT1_NAME    -val $MYDATAID

       7.) Configure the case as normal

       		./configure -case

       8.) Run your case as normal

Using getregional_datasets.pl to get a complete suite of single-point/regional surface datasets from global ones

Use the regional extraction script to get regional datasets from the global ones The getregional_datasets.pl script to extract out regional datasets of interest. Note, the script works on all files other than the "finidat" file as it's a 1D vector file. The script will extract out a block of gridpoints from all the input global datasets, and create the full suite of input datasets to run over that block. The input datasets will be named according to the input "id" you give them and the id can then be used as input to CLM_USRDAT_NAME to create a case that uses it. See the section on CLM Script Configuration Items for more information on setting CLM_USRDAT_NAME (in Chapter 1). The list of files extracted by their name used in the namelists are: fatmgrid, fatmlndfrc, fsurdat, fpftdyn, flndtopo, stream_fldfilename_ndep, and the DATM files domainfile, and faerdep. For more information on these files see the Table on required files.

The alternatives to using this script are to use PTS_MODE, discussed earlier, to use PTCLM discussed in the next chapter, or creating the files individually using the different file creation tools (given in the Tools Chapter). Creating all the files individually takes quite a bit of effort and time. PTS_MODE has some limitations as discussed earlier, but also as it uses global files, is a bit slower when running simulations than using files that just have the set of points you want to run over. Another advantage is that once you've created the files using this script you can customize them if you have data on this specific location that you can replace with what's already in these files.

The script requires the use of both "Perl" and "NCL". See the NCL Script section in the Tools Chapter on getting and using NCL and NCL scripts. The main script to use is a Perl script which will then in turn call the NCL script that actually creates the output files. The ncl script gets it's settings from environment variables set by the perl script. To get help with the script use "-help" as follows:


> cd models/lnd/clm/tools/ncl_scripts
> ./getregional_datasets.pl -help
The output of the above is:

SYNOPSIS
     getregional_datasets.pl [options]	   Extracts out files for a single box region from the  \ 
   global
                                   grid for the region of interest. Choose a box determined by
                                   the NorthEast and SouthWest corners.
OPTIONS
     -debug [or -d]                Just debug by printing out what the script would do.
                                   This can be useful to find the size of the output area.
     -help [or -h]                 Print usage to STDOUT.
     -mask "landmask"              Type of land-mask (i.e. navy, gx3v7, gx1v6 etc.) (default gx1v6)
 \ 
        -mycsmdata "dir"              Root directory of where to put your csmdata. 
                                   (default /home/erik/inputdata or value of CSMDATA env variable)
     -mydataid "name" [or -id]     Your name for the region that will be extracted.                \ 
   (REQUIRED)
                                   Recommended name: grid-size_global-resolution_location  \ 
   (?x?pt_f??_????)
                                   (i.e. 12x13pt_f19_alaskaUSA for 12x13 grid cells from the f19  \ 
   global resolution over Alaska)
     -NE_corner "lat,lon" [or -ne] North East corner latitude and longitude                        \ 
   (REQUIRED)
     -nomv                         Do NOT move datasets to final location, just leave them in  \ 
   current directory
     -res "resolution"             Global horizontal resolution to extract data from (default  \ 
   1.9x2.5).
     -rcp "pathway"                Representative concentration pathway for future scenarios 
                                   Only used when simulation year range ends in a future
                                   year, such as 2100.
                                   (default -999.9).
     -sim_year   "year"            Year to simulate for input datasets (i.e. 1850, 2000) (default  \ 
   2000)
(default 2000)
     -sim_yr_rng "year-range"      Range of years for transient simulations 
                                   (i.e. 1850-2000, 1850-2100,  or constant) (default constant)

     -SW_corner "lat,lon" [or -sw] South West corner latitude and longitude                         \ 
   (REQUIRED)
     -verbose [or -v]              Make output more verbose.

The required options are: -id, -ne, and -se, for the output identifier name to use in the filenames, latitude and longitude of the Northeast corner, and latitude and longitude of the SouthEast corner (in degrees). Options that specify which files will be used are: -mask, -res, -rcp, -sim_year, and -sim_yr_rng for the land-mask to use, global resolution name, representative concentration pathway for future scenarios, simulation year, and simulation year range. The location of the input and output files will be determined by the option -mycsmdata (can also be set by using the environment variable $CSMDATA). If you are running on a machine like at NCAR where you do NOT have write permission to the CESM inputdata files, you should use the scripts/link_dirtree script to create soft-links of the original files to a location that you can write to. This way you can use both your new files you created as well as the original files and use them from the same location.

The remaining options to the script are -debug, and -verbose. -debug is used to show what would happen if the script was run, without creating the actual files. -verbose adds extra log output while creating the files so you can more easily see what the script is doing.

For example, Run the extraction for data from 52-73 North latitude, 190-220 longitude that creates 13x12 gridcell region from the f19 (1.9x2.5) global resolution over Alaska.

Example 5-4. Example of running getregional_datasets.pl to get datasets for a specific region over Alaska


> cd scripts
# First make sure you have a inputdata location that you can write to 
# You only need to do this step once, so you won't need to do this in the future
> setenv MYCSMDATA $HOME/inputdata         # Set env var for the directory for input data
> ./link_dirtree $CSMDATA $MYCSMDATA
> cd ../models/lnd/clm/tools/ncl_scripts
> ./getregional_datasets.pl -sw 52,190 -ne 73,220 -id 13x12pt_f19_alaskaUSA -mycsmdata $MYCSMDATA
Repeat this process if you need files for multiple sim_year, resolutions, land-masks, and sim_year_range values.

Warning

See the Section called Warning about Running with a Single-Processor on a Batch Machine for a warning about running single-point jobs on batch machines.

Note: See the Section called Managing Your Own Data-files in Chapter 3 for notes about managing your data when using link_dirtree.

Now to run a simulation with the datasets created above, you create a single-point case, and set CLM_USRDAT_NAME to the identifier used above. Note that in the example below we set the number of processors to use to one (-pecount 1). For a single point, you should only use a single processor, but for a regional grid, such as the example below you could use up to the number of grid points (12x13=156 processors.

Example 5-5. Example of using CLM_USRDAT_NAME to run a simulation using user datasets for a specific region over Alaska


> cd scripts
# Create the case and set it to only use one processor
> ./create_newcase -case my_userdataset_test -res pt1_pt1 -compset I1850 \
-mach bluefire
> cd my_userdataset_test/
> ./xmlchange -file env_run.xml -id DIN_LOC_ROOT_CSMDATA -val $MYCSMDATA
> ./xmlchange -file env_conf.xml -id CLM_USRDAT_NAME -val 13x12pt_f19_alaskaUSA
> ./xmlchange -file env_conf.xml -id CLM_BLDNML_OPTS -val '-mask gx1v6'
> ./xmlchange -file env_conf.xml -id CLM_PT1_NAME -val 13x12pt_f19_alaskaUSA
> ./configure -case