The simgen software package: User guide and notes
Transcripción
The simgen software package: User guide and notes
The simgen software package: User guide and notes Arthur M. Greene International Research Institute for Climate and Society The Earth Institute, Columbia University New York, NY June 28, 2012 Abstract The software simgen is provided in conjunction with the technical report, “A framework for the simulation of regional decadal variability for agricultural and other applications,” prepared at the request of the Climate Change, Agriculture and Food Security (CCAFS) program of the Consultative Group on International Agricultural Research (CGIAR). (See acknowledgments in that report for detailed attribution.) The report describes a generalized approach to statistical climate simulation, as applied to “near-term climate change” time horizons. Section 5 in the report describes a case study, in which the general approach is realized in a particularized climatological and applications-oriented setting, in the Western Cape region of South Africa. (This work was reported in a separate publication.) The simgen software comprises the various code elements used to produce the simulations described in that case study. Here, we describe both the code and its manner of deployment. As pointed out in the technical report, simulation in a particular regional setting will require a corresponding elaboration of the simulation framework, which effectively constitutes a template for this purpose. We discuss here how this plays out in the specific context of the case study, in the process identifying the degree of generality applying to specific elements of the code. One component of the detailed scheme was implemented using the statistical software “R.” This is also described, and the necessary references provided. simgen itself is written in the Python programming language, and makes use of additional external Python packages. No fees or other charges are required for use of any of the code or external packages employed in this project, all of which are available under various open source licenses. 1 1 Introduction The technical report “A framework for the simulation of regional decadal variability for agricultural and other applications” [Greene et al., 2011, referred to hereinafter as GHG] includes discussion of a case study. This study, described in detail in Greene et al. [2012], involved the statistical generation of an ensemble of climate simulations for the Berg and Breede Water Management Areas in the Western Cape province of South Africa. The simulations, comprising precipitation as well as minimum and maximum daily temperatures, were designed to drive the ACRU hydrology model [Schulze, 1995], developed at the University of KwaZulu-Natal, South Africa, and were part of an ambitious, effort, involving multiple institutional participants, to characterize future climate in the region of the Western Cape. The present document describes the code developed to produce those simulations and is provided, along with that code, as an adjunct to GHG. As discussed in GHG, simulation of regional decadal variability will be conditioned by a number of factors, including characteristics of the regional climate, available observational records and follow-on modeling requirements. The code, whose main routine is named simgen, has thus been designed with ACRU-based studies in mind. This means, inter alia, that input routines are designed to read, and output routines to write, ACRU-formatted files. (This format is described in Appendix A.) It is expected that programmers adapting simgen for other settings will modify these routines, and indeed, other aspects of the code as well, to suit the requirements of particular applications for which the simulations will be used. Because the various software components derive from different sources, licensing details vary among them. However, none of the components or programs utilized are “commercial” software, in the sense that they involve acquisition costs or licensing fees, in particular for noncommercial use. References to the various licenses and terms are provided in Sec. 2.3. (Disclaimer: The author is not an attorney; nothing in this document should be construed as legal advice.) The sections of this guide describe the various code components (Sec. 2), setup of the computing environment (Sec. 3), individual function calls (Sec. 4) and the sequence of operations in simgen (Sec. 5). Final remarks are provided in (Sec. 6). 2 Components The simulation process, as realized in the case study, utilizes several software components, most, but not all of which reside in the simgen code itself. The R programming language (http://www.r-project.org) was also employed for certain tasks. Some functions are executed only once, in setting up the simulation environment, while 2 others are repeated, typically in looping over individual locations in the modeled network. Tasks accomplished with R belong to the former group, and were logically performed “offline,” i.e., outside the scope of the simgen code. The simgen code itself is written in Python (http://www.python.org), an objectoriented programming language that has found wide application in many scientific and technical fields. There are a number of Python distributions, each typically including a set of “modules,” or packages, keyed to a defined range of tasks; we utilize one of these, described below, but also suggest alternatives, so as to facilitate the deployment of simgen. 2.1 Python The main simgen code, as stated, is written in Python. The code also invokes functions from a number of Python modules that are not part of the core Python language. Important among these are numpy, which provides key mathematical functions, as well as numerical arrays and random variables, and scipy, which supplies a linear regression function. A module necessary for the present version of simgen, but that may possibly be dispensed with in certain circumstances, is cdms (Climate Data Management System), which is part of the CDAT (Climate Data Analysis Tools) Python distribution. In fact, CDAT provides both numpy and scipy, so if the modeler chooses to install CDAT (available at http://www2-pcmdi.llnl.gov/cdat), all the necessary tools will be available from the start. Version 5.2 was the version utilized for the simulations described herein. CDAT is available for both the Linux operating system and for Mac OS X. The author has run it only under Linux, but the developers of CDAT are known to also use OS X, so it is likely that this is a viable option. It may also be possible to run CDAT using a virtual environment on Windows computers, but we lack the experience to offer guidance here. cdms is used only internally, in order to facilitate certain data manipulations. Input and output are both in form of ACSII files, and most of the computational work is performed using numpy arrays (rather than cdms “transient variables”). Dispensing with cdms would require rewriting some of the code, however. We also note the existence of cdat-lite, a Python package that includes cdms and other core CDAT libraries. Available at http://pypi.python.org/pypi/cdat-lite, it provides the necessary toolkit while avoiding the necessity of installing the (much larger) full CDAT distribution. We have also successfully run simgen using cdat-lite version 6.0, in a non-CDAT environment. Note that unless modifications are made to the simgen code as it now stands, it will be necessary for the user to obtain and install CDAT (or cdat-lite) in order to run simgen; We do not distribute either CDAT or cdat-lite. 3 The simulation code is designed to be run from within an interactive Python session: The initial call (issued from a terminal) can be to cdat itself; we happen prefer the ipython shell (see http://ipython.org), but this is strictly a user preference. Once Python is started (and all of the ancillary files are in place), simulations are generated by first importing simgen, then issuing a call to the gen routine, using appropriate arguments. To facilitate comprehension and usability, simgen has been extensively commented, with an initial “docstring” (set off by triple quotes) at the head of the module, additional docstrings placed strategically throughout the code and with many individual comment lines (lines beginning with the hash symbol #). The use of docstrings permits access to the included information via use of the Python help function, as well as other docstring functions, from within the interactive shell, while the individual comment lines provide a more granular description of the various code sequences. 2.2 R A key feature of the simulations described in Greene et al. [2012] is the generation of stochastic sequences on the annual time step. The statistical structure used for this step in the case study is identified as a vector autoregressive (VAR) model of order unity. This model is fit to the regional observational data, in the case study a trivariate annualized series of length 50, using the “Dynamic Systems Estimation” (dse) time series package for the R programming language, i.e., external to the main simgen code. The call in R was to the routine estVARXls, which implements a least-squares estimation of VAR parameters. A call to simulate in the dse package was used to generate the long simulated sequence referred to in Greene et al. [2012]. Other routines were used for model checking and testing various aspects of the inferred model, but details of these procedures will depend on the specifics of any simulation setting, as well as the form of time series model employed. The modularity of the simulation code permits the fitting of an infinite variety of statistical models, in Ror in other software of the modeler’s choice, and the generation of a long simulation sequence outside the main simgen code. When simgen is run it reads this long sequence and slices it to produce the detailed downscaled simulations. 2.3 Licenses We summarize below licensing information for the software elements included in, or utilized by simgen, to the best of our knowledge. Users are responsible for observing the terms of these licenses. 1. Python: Open source software, compatible with the GNU General Public Li4 cense (GPL). See http://docs.python.org/license.html. 2. CDAT: Full license terms are included within the source code distribution. Commercialization of CDAT requires notification of either the United States Department of Energy or Lawrence Livermore National Laboratory. Certain third-party components are distributed subject to additional licensing terms. 3. R: The R language is licensed under the GNU General Public License version 2. Some files may be covered by the GNU General Public License version 3. See http://www.r-project.org/Licenses for details. 4. simgen: simgen is provided under the “Attribution-NonCommercial-ShareAlike 3.0 Unported” (CC BY-NC-SA 3.0) Creative Commons license. Under this form (see http://www.creativecommons.org/licenses) commercial use (i.e., the redistribution of simgen or a derivative for profit) is restricted, and requires a written license agreement. Noncommercial distribution must be on the same terms under which simgen and its ancillary files is originally provided, and must include proper attribution. This attribution is defined in the leading docstring of the simgen code. 3 Preliminary setup Some files must exist, or must be created, prior to executing the main simgen code. These include (a) observational data, both in the form of disaggregated station-level daily files and as a three-component regional mean series at annual time resolution, (b) a global-mean, multimodel-mean temperature record and (c) the long stochastic sequence from which particular simulation “instances” are drawn. As explained in Section 2.2, the R model that generates the stochastic low-frequency realizations on which the detailed station-level simulations are based is fit to the observational data outside the simgen code structure. For illustrative purposes, examples of those files which must exist a priori are provided along with the simgen code. 3.1 Observational data Observational records may be utilized in three ways. First, when taken to represent the regional climatology they are used for for the fitting of the statistical model with which annual-to-decadal simulations are generated. Here, catchment averages for the three variables, reduced to annual time resolution, are used for this purpose. Second, when a simulation “instance” is downscaled, both the spatial and temporal disaggregation steps are conditioned by the relationship between the regional signal and 5 the fine-scale observational data at individual stations. Third, a k-nearest-neighbor (k-NN) resampling scheme is utilized as part of the downscaling process. It is the daily observational record that is resampled, in one-year blocks, in this step. 3.2 The regional record For the case study the regional signal, representing the behavior of the study region as a whole, consists of an average over the 171 records representing quinary-level catchment values. This signal is multivariate, the three components being precipitation and minimum and maximum daily temperatures. This record is used by the R routine to generate the long simulation sequence, and directly by simgen, in conjunction with the individual catchment values, for “broadcasting” of simulated annual-to-decadal values to individual catchments. 3.3 Multimodel mean temperature signal For detrending, as well as for the projection of future trends, a global-mean, multimodel ensemble mean temperature record is utilized. The ensemble is composed of a set of global climate models (GCMs) from the Coupled Model Intercomparison Project (CMIP5) [Taylor et al., 2011]. Global temperature records from the individual GCMs are smoothed and a multimodel average is computed. For the purposes of consistent projection, the multimodel mean signal must extend from the start time of the observational record through the end of the simulation period, several decades into the future, requiring a choice of scenario for the future. The case study utilizes the 4.5 W m−2 “Representative Concentration Pathway” (RCP4.5) experiment [Taylor et al., 2011]. Like the regional mean record described above, the multimodel mean signal is generated offline, and stored prior to runtime. 3.4 Systematic and random components The regional mean series (Section 3.2) are detrended by regression on the multimodel mean signal (Section 3.3), the residual being a multivariate “target” that the simulations are designed to emulate. The target, having annual time resolution, represents annual-to-decadal variability, and must initially be screened for the identification of possible “systematic” elements, which are operationally defined as signal components that differ significantly from AR(1) “red” noise. As discussed in GHG, if such elements are identified they must be modeled independently, the specific form of model depending on data characteristics and simulation requirements. Such models might assume forms as diverse as the WARM models described by Kwon et al. [2007, 2009] or the nonhomogeneous hidden Markov models that were applied to paleorecords Lee’s 6 Ferry streamflow on the Colorado River in the American Southwest by Prairie et al. [2008]. The range of models that might be deployed in this important step is limited only by the range of observed regional behavior. Deterministic elements were not identified in the regional series associated with the case study [see Sec. 4.2.1 and Fig. 6 in Greene et al., 2012]. Since the series were tested against an AR(1) null hypothesis, the annual-to-decadal component is thus modeled as a multivariate stochastic (red noise) process. For this purpose a firstorder vector autoregressive (VAR) model was fit, using the R dse package. After suitable verification, this model was used to generate a single, extended simulation sequence. This sequence is stored using the savetxt function in numpy, in the form of an ASCII file. 3.5 File locations and names A directory containing one possible arrangement of files is provided at http://iri.columbia.edu/∼amg/CCAFS/simgen/. Included subdirectories and the files they contain are listed here. The programmer may of course organize files and directories as desired, modifying pathnames in the simgen code accordingly. • dat: This directory holds the regional time series, in “obsav.dat.” • input_sim: This holds the single long simulated sequence. The example file, “sim 100kyr.dat,” includes 100 kyr of simulated data. • obs: The obs directory holds the individual (i.e., station-level) observational data, here a single example file, named for historical reasons “obshis 2626.txt.” In a full region-scale experiment there would be a possibly large set of such files. The simgen code is set up to read files from the obs directory by specifying just the four-digit identifiers before the “.txt” suffix (in the form of a Python list), but this default can easily be modified. • output_sim: The directory into which the output simulations are written. Here it holds a single sample file, named “sim 100k obshis 2626 001000.txt.” The first part of the filename identifies the regional file from which the simulation is derived. The “2626” label corresponds to the station whose variables are being simulated. Finally, the “001000” identifier refers to the index into the long simulation file at which the simulated sequence begins. That is, the long file consists of 100 kyr of data; the “001000” signifies that the simulated sequence uses values beginning with index 1000 into the file. 7 • pickled: This directory contains “tasav cmip5 comb sm 1901-2095.p,” the smoothed global multimodel mean temperature series. The year range is indicated in the file name. • python: The python directory contains three Python scripts, “simgen9s.py”, the main simgen code, “readqc.py”, a routine for reading ACRU-formatted files such as “obshis 2626.txt” and “detrend2.py”, a simple linear detrender used by simgen. The simgen version, as of this writing, is 9s, the 9 signifying the version number and s that this is a “special” edit, where file and directory names have been modified to reflect the demonstration environment. 4 Functions There are number of function definitions within the simgen module and also within the scripts called by simgen; They are described here — first the functions in simgen, in the order in which they appear, and then the two external scripts imported by simgen. Comments within the code (either lines beginning with the hash symbol # or text enclosed by triple quotes """) provide further details. 4.1 • The simgen module gen(obsix, simix, trendq, fname=’sim_100kyr.dat’,\ write=1, simlen=66, locate=2041, xval=0, M=1): This is the main routine in simgen, calling other functions as needed to produce the simulations. Note that the “\” character above represents a a line break and has no effect on execution. The arguments are as follows: – obsix: A Python list of four-digit index numbers (integers or strings) corresponding to the observational files for the locations to which the regional simulation will be downscaled. If there is only a single station (as in the demonstration folder) its index must still be provided as a list, for example [2626], rather than simply 2626. The code that translates these indices into the filenames to be read, as well as the filenames themselves, will very likely differ in particular application settings. – simix: An integer whose value must be less than the length of the long simulation file minus the length of the sequences being simulated. In the example provided above, the value given was 1000. This is the index into the long simulation at which the chosen segment begins. 8 – trendq: Specifies the quantile of the multimodel distribution to be used for the future precipitation trend simulation. – fname: The filename of the long simulation sequence. The enclosing directory name input_sim is not included but is prepended by simgen, an arrangement that can easily be modified to suit the user’s computing environment. Note that the default pathname given in the function definition may be overridden by providing an alternate in the call to gen.. – write=1: Whether or not to write output files. When running simgen for diagnostics the writing of output files may not be required. The default may be overridden as described above. This also applies to the defaults given below. – simlen=66: The desired simulation length, in years. – locate=2041: The year, in the simulated sequence, at which a specified decadal fluctuation is to begin. – xval=0: A value of zero will cause the 1950-1999 values of the simulation to replicate the observations; if the value is unity the 1950-1999 values will be simulated. – M=1: Whether or not to use the Mahalanobis distance metric in the k-NN routine, basing distance on the three-component (pr, Tmax, Tmin) vector. If zero, only precipitation is used for the distance computation. • yrgen: This function takes a daily (univariate) time series and returns annualized values, as well as the indices into the daily series at which year breaks occur. Called by gen. This function requires cdms to properly interpret arrays returned by the readqc module (see below). • acru: Called by gen, this function writes ACRU-formatted files. (Note that this is also the format of the “obshis” demonstration file.) The arguments, simdat, simlen, infile, fname and simix, refer to the simulation data to be written, the length in years of the 21st -century component of the simulation, names of the input observational and simulation files, and the index into the latter for slicing the simulation sequence. A number of these parameters are used simply for naming the output file. • getmoda: Provides a day-by-day list of month and day, each in two-digit numerical form, for either normal or leap years. Called by acru, for formatting the files to be written out. 9 • leap: Takes a four-digit integer year and returns 1 if a leap year, 0 if not. This, as well as the function above, are necessitated by the hydrological model’s requirement that leap and normal years be differentiated. In the observational files as well as the simulations, the former will include the extra day. If obsix is a list with more than one entry, i.e., if simulations are being generated for a network of stations, simgen does not return a value to the interactive window (it does write out the simulation files, assuming write=1). However, if a simulation is created for just a single station, in which case obsix holds a single value, then three arrays, identified in the code as datmat, fmat and scalemat are returned. These hold the complete simulation, at daily resolution, the trend component and the and non-trend component, respectively. fmat and scalemat have annual resolution. These files can be be useful for diagnosis as well as plotting the forced and unforced components of the simulation, separately or in combination. 4.2 The readqc module The Python program readqc contains a single function, r, whose argument is the name of a file (either observational or simulated) written in the ACRU format. It returns four Python objects, designated in the code as prvar, tmaxvar, tminvar, and datmat. The first three of these are one-dimensional cdms “TransientVariable” objects, holding, besides data values, embedded time axis and calendar information, and correspond to precipitation, maximum and minimum daily temperatures, respectively. The last of the variables, datmat, is a single numpy array object (i.e., without the time axis information) holding all three variables. In all of these arrays the time resolution is daily. prvar, tmaxvar and tminvar are provided as arguments to the yrgen function in simgen to recover annualized versions of the respective variables. 4.3 The detrend2 module The program detrend2 is a short script that provides linear detrending. It is used by simgen to remove small random trends from segments that have been extracted from the long simulated sequence. It contains a single function, dt, which calls the scipy routine linregress, which performs univariate linear regression. 5 Processing sequence We present here an overview of program flow. Additional details are provided in Greene et al. [2012] and in the simgen code itself. The line numbers provided are approximate, and may shift as comments are modified or updated. 10 1. Ingestion of preexisting fixed files (line 159 et seq.) Certain files, as described in Secs. 3 and 3.5 must exist prior to running simgen. These files, with the exception of the individual station observations, are read. • A simulation “instance” (i.e., a sequence of length simlen yr, typically a few decades) is extracted from the long simulation file (line 166 et seq.). Although the latter has no overall trend, small random trends may appear in short extracted segments. These are removed with detrend2. • The multimodel mean global temperature corresponding to the simulation period is extracted (line 173 et seq., also line 223). • Regional mean observational records are read (line 180). 2. The 21st -century regional precipitation trend is computed, based on the specified value of trendq and parameters of the ensemble distribution (line 208 et seq.) 3. Years in the observed sequence are resampled using a k-nearest-neighbor bootstrap, in order to generate subannual variations. For coherence across the watershed, the same sequence of years must be used at all stations. The sequence is created in this fairly involved routine (lines 228–437). 4. The main loop over catchments (line 448, reading for obsno in obsix:) • The station record is annualized (line 451 et seq.). This is performed (a) for detrending, via regression on the multimodel mean signal and (b) for computing the degree of dependence on the regional decadal signal, by regressing the residuals from step (a) on it. In the case of temperature, coefficients from step (a) will be used to project simulated trends forward in time; for precipitation the coefficients are used to “scatter” catchment trends around the imposed value computed in step 2. Coefficients from step (b) determine the degree to which the simulated regional signal is mixed with uncorrelated noise in the simulated station record. • Using the regional data, trends and coefficients from the above steps, station-level simulations are generated (line 456 et seq.) These have an annual time step, and incorporate both the simulated decadal signal and the inferred future trends. • The annualized station-level signal is downscaled to daily resolution, by rescaling the subannual variations from the resampled sequence of observational years (line 578 et seq.) 11 • Two short routines (line 688 et seq.) check the simulated sequences for days on which the maximum temperature is less than or equal to the minimum daily temperature, and for negative values of precipitation, neither of which is acceptable to ACRU (although the temperature condition could conceivably occur in nature). Maximum and minimum temperatures may occasionally be rounded to equal values if the former exceeds the latter by a small amount. The correction consists in adding a small increment to the maximum temperature, so that it exceeds the minimum temperature by 0.25◦ C. Since precipitation is scaled multiplicatively, negative values are largely avoided. However, in the admixture of uncorrelated noise with the scaled regional signal, such values may (rarely) occur, typically if there exists a long drying trend, bringing scaled values close to zero. The correction consists of setting negative precipitation values to zero. • The final, checked variables are assembled into arrays (line 719 et seq.) and are written out as text files with a call to acru (line 733). 5. If a single station was designated, datmat, fmat and scalemat are returned to the interactive session; otherwise simgen loops over the stations whose indices are provided in ixlist, generating and writing out simulation files for the catchments designated thereby. 6 Final remarks In this document we have tried to describe the workings of simgen. We have included details of the software, including required ancillary packages, and have described the setup of the working environment and the steps involved in preparing and running the code. Example files permit actual execution of simgen and should be adequate for testing the code, helping to understand what it does and how, and for preparing the programmer to apply simgen in other settings. As we have emphasized, the simgen code is not a universal solution to the problem of decadal simulation, but represents a particularized realization of the decadal simulation framework outlined in GHG. It is thus unlikely that simgen will be deployed by others in exactly its present form. Rather, it is hoped that it will prove a useful template on which to base simulation models in a diverse array of environments. 12 A The ACRU format The first two lines of obshis 2626.txt are reproduced below: 203711401950 1 1 203711401950 1 2 0.0P 24.6 0.0P 29.0 8.4 12.7 4.9 26.63 95.00 47.00 138.2 5.8 27.62 92.99 34.10 138.2 In the first field, the first eight digits are an internal identifier, not utilized by simgen. After this are four digits representing the year, which for both of these lines is 1950. Then, utilizing two spaces each, are fields for the month (January) and day (first and second of January, for the two lines). After this are the three key fields, precipitation, maximum daily temperature and minimum daily temperature, which for the first record are 0.0 mm, 24.6◦ C and 8.4◦ C, respectively. The “P” following the precipitation values in both lines indicates that these particular data have been filled (“patched”), i.e., that the values were initially missing: The readqc module reports to the screen the number of such values in any observational record being ingested. The remaining fields are not utilized by simgen, nor is data written to them in the simulation files, which include only a single space after the minimum temperature value. References Greene, A. M., L. Goddard, and J. W. Hansen, A framework for the simulation of regional decadal variability for agricultural and other applications, Tech. rep., International Research Institute for Climate and Society, Palisades, NY, 2011. Greene, A. M., M. Hellmuth, and T. Lumsden, Stochastic decadal climate simulations for the Berg and Breede Water Management Areas, Western Cape province, South Africa, Water Resourc. Res., 48 , 2012. Kwon, H.-H., U. Lall, and A. F. Khalil1, Stochastic simulation model for nonstationary time series using an autoregressive wavelet decomposition: Applications to rainfall and temperature, Water Resour. Res., 43 , 2007. Kwon, H.-H., U. Lall, and J. Obeysekera, Simulation of daily rainfall scenarios with interannual and multidecadal climate cycles for South Florida, Stoch. Environ. Res. Risk. Assess., 23 , 879–896, 2009. Prairie, J., K. Nowak, B. Rajagopalan, U. Lall, and T. Fulp, A stochastic nonparametric approach for streamflow generation combining observational and paleoreconstructed data, Water Resour. Res., 44 , 2008. 13 Schulze, R. E., Hydrology and Agrohydrology: A Text to Accompany the ACRU 3.00 Agrohydrological Modelling System, WRC Report TT 69/95 , Water Research Commission, Pretoria, RSA, 1995. Taylor, K. E., R. J. Stouffer, and G. A. Meehl, An overview of CMIP5 and the experiment design, Bull. Am. Met. Soc., 2011. 14