Blind Evaluation of Lossy Data-Compression in LENS

Co-Author: A. Baker
The objective of this project is to determine whether or not the effects of lossy data compression are statistically distinguishable from the natural variability of the climate system. In particular, our aim is to demonstrate that applying lossy compression to climate data would not negatively impact science results. To reach such a conclusion, we are providing climate scientists direct experience with data that has undergone lossy compression. The CESM-CAM5 large ensemble community project is an ideal venue for this evaluation because of its use of climate ensembles, struggle with storage limitations, and availability to the broader climate community.

We are currently conducting a blind experiment to evaluate the impact of data compression on the LENS community project data. We have contributed three additional runs to the LENS project and have compressed and reconstructed the CAM output of one or two of the new ensemble runs.  The challenge is for climate scientists to identify which of the additional ensemble members (31 - 33) has had its atmospheric data compressed and reconstructed. In particular, we are interested in feedback from the climate community detailing which ensembles member(s) you believe have been compressed, and, more importantly, why. We are especially interested in details of the analysis that led to your conclusion.

For more information, see our preliminary work in [1] that proposes a number of quality metrics that can be used to determine whether it is acceptable to compress particular variables from the Community Earth System Model (CESM). Results in [1] indicate that it is possible to achieve a compression rate of 5 to 1 using the fpzip [2, 3] compression algorithm without introducing statistically distinguishable changes to output fields. We have also used fpzip for the LENS project data, and obtained the following average compression ratios for the monthly, daily and 6-hourly LENS data:

COMPARISON OF COMPRESSION METHODS ON LENS DATA

COMPRESSION RATIO MONTHLY DAILY 6-HOURLY
netcdf-4 compressed/original.51 .70 .63
fpzip compressed/original .15 .22 .18
Data sets are available from the Earth System Grid via the following link: https://www.earthsystemgrid.org/dataset/ucar.cgd.ccsm4.CESM_CAM5_BGC_LE.html

Please direct questions or feedback to Allison Baker (abaker at ucar.edu).

[1] A.H. Baker, H. Xu, J.M. Dennis, M.N. Levy, D. Nychka, S.A. Mickelson, J. Edwards, M. Vertenstein, A. Wegener, “A Methodology for Evaluating the Impact of Data Compression on Climate Simulation Data.” Proc. of the 23rd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC14), Vancouver, CA, 2014, pp. 203-214.

[2] Peter Lindstrom and Martin Isenburg, "Fast and Efficient Compression of Floating-Point Data" IEEE Transactions on Visualization and Computer Graphics, 12(5):1245-1250, September-October 2006.

[3] http://computation.llnl.gov/casc/fpzip/