The simgen software package: User guide and notes

Transcripción

The simgen software package: User guide and notes
The simgen software package:
User guide and notes
Arthur M. Greene
International Research Institute for Climate and Society
The Earth Institute, Columbia University
New York, NY
June 28, 2012
Abstract
The software simgen is provided in conjunction with the technical report,
“A framework for the simulation of regional decadal variability for agricultural
and other applications,” prepared at the request of the Climate Change, Agriculture and Food Security (CCAFS) program of the Consultative Group on
International Agricultural Research (CGIAR). (See acknowledgments in that
report for detailed attribution.) The report describes a generalized approach
to statistical climate simulation, as applied to “near-term climate change” time
horizons. Section 5 in the report describes a case study, in which the general
approach is realized in a particularized climatological and applications-oriented
setting, in the Western Cape region of South Africa. (This work was reported
in a separate publication.) The simgen software comprises the various code
elements used to produce the simulations described in that case study. Here,
we describe both the code and its manner of deployment.
As pointed out in the technical report, simulation in a particular regional
setting will require a corresponding elaboration of the simulation framework,
which effectively constitutes a template for this purpose. We discuss here how
this plays out in the specific context of the case study, in the process identifying
the degree of generality applying to specific elements of the code.
One component of the detailed scheme was implemented using the statistical
software “R.” This is also described, and the necessary references provided.
simgen itself is written in the Python programming language, and makes use
of additional external Python packages. No fees or other charges are required
for use of any of the code or external packages employed in this project, all of
which are available under various open source licenses.
1
1
Introduction
The technical report “A framework for the simulation of regional decadal variability for agricultural and other applications” [Greene et al., 2011, referred to hereinafter as GHG] includes discussion of a case study. This study, described in detail
in Greene et al. [2012], involved the statistical generation of an ensemble of climate
simulations for the Berg and Breede Water Management Areas in the Western Cape
province of South Africa. The simulations, comprising precipitation as well as minimum and maximum daily temperatures, were designed to drive the ACRU hydrology
model [Schulze, 1995], developed at the University of KwaZulu-Natal, South Africa,
and were part of an ambitious, effort, involving multiple institutional participants, to
characterize future climate in the region of the Western Cape. The present document
describes the code developed to produce those simulations and is provided, along with
that code, as an adjunct to GHG.
As discussed in GHG, simulation of regional decadal variability will be conditioned
by a number of factors, including characteristics of the regional climate, available
observational records and follow-on modeling requirements. The code, whose main
routine is named simgen, has thus been designed with ACRU-based studies in mind.
This means, inter alia, that input routines are designed to read, and output routines
to write, ACRU-formatted files. (This format is described in Appendix A.) It is
expected that programmers adapting simgen for other settings will modify these
routines, and indeed, other aspects of the code as well, to suit the requirements of
particular applications for which the simulations will be used.
Because the various software components derive from different sources, licensing
details vary among them. However, none of the components or programs utilized are
“commercial” software, in the sense that they involve acquisition costs or licensing
fees, in particular for noncommercial use. References to the various licenses and terms
are provided in Sec. 2.3. (Disclaimer: The author is not an attorney; nothing in this
document should be construed as legal advice.)
The sections of this guide describe the various code components (Sec. 2), setup
of the computing environment (Sec. 3), individual function calls (Sec. 4) and the
sequence of operations in simgen (Sec. 5). Final remarks are provided in (Sec. 6).
2
Components
The simulation process, as realized in the case study, utilizes several software components, most, but not all of which reside in the simgen code itself. The R programming
language (http://www.r-project.org) was also employed for certain tasks. Some
functions are executed only once, in setting up the simulation environment, while
2
others are repeated, typically in looping over individual locations in the modeled network. Tasks accomplished with R belong to the former group, and were logically
performed “offline,” i.e., outside the scope of the simgen code.
The simgen code itself is written in Python (http://www.python.org), an objectoriented programming language that has found wide application in many scientific
and technical fields. There are a number of Python distributions, each typically
including a set of “modules,” or packages, keyed to a defined range of tasks; we
utilize one of these, described below, but also suggest alternatives, so as to facilitate
the deployment of simgen.
2.1
Python
The main simgen code, as stated, is written in Python. The code also invokes functions from a number of Python modules that are not part of the core Python language.
Important among these are numpy, which provides key mathematical functions, as
well as numerical arrays and random variables, and scipy, which supplies a linear
regression function.
A module necessary for the present version of simgen, but that may possibly be
dispensed with in certain circumstances, is cdms (Climate Data Management System),
which is part of the CDAT (Climate Data Analysis Tools) Python distribution. In
fact, CDAT provides both numpy and scipy, so if the modeler chooses to install CDAT
(available at http://www2-pcmdi.llnl.gov/cdat), all the necessary tools will be
available from the start. Version 5.2 was the version utilized for the simulations
described herein.
CDAT is available for both the Linux operating system and for Mac OS X. The
author has run it only under Linux, but the developers of CDAT are known to also use
OS X, so it is likely that this is a viable option. It may also be possible to run CDAT
using a virtual environment on Windows computers, but we lack the experience to
offer guidance here. cdms is used only internally, in order to facilitate certain data
manipulations. Input and output are both in form of ACSII files, and most of the
computational work is performed using numpy arrays (rather than cdms “transient
variables”). Dispensing with cdms would require rewriting some of the code, however.
We also note the existence of cdat-lite, a Python package that includes cdms and
other core CDAT libraries. Available at http://pypi.python.org/pypi/cdat-lite,
it provides the necessary toolkit while avoiding the necessity of installing the (much
larger) full CDAT distribution. We have also successfully run simgen using cdat-lite
version 6.0, in a non-CDAT environment.
Note that unless modifications are made to the simgen code as it now stands, it
will be necessary for the user to obtain and install CDAT (or cdat-lite) in order to
run simgen; We do not distribute either CDAT or cdat-lite.
3
The simulation code is designed to be run from within an interactive Python
session: The initial call (issued from a terminal) can be to cdat itself; we happen prefer
the ipython shell (see http://ipython.org), but this is strictly a user preference.
Once Python is started (and all of the ancillary files are in place), simulations are
generated by first importing simgen, then issuing a call to the gen routine, using
appropriate arguments.
To facilitate comprehension and usability, simgen has been extensively commented,
with an initial “docstring” (set off by triple quotes) at the head of the module, additional docstrings placed strategically throughout the code and with many individual
comment lines (lines beginning with the hash symbol #). The use of docstrings permits access to the included information via use of the Python help function, as well
as other docstring functions, from within the interactive shell, while the individual
comment lines provide a more granular description of the various code sequences.
2.2
R
A key feature of the simulations described in Greene et al. [2012] is the generation
of stochastic sequences on the annual time step. The statistical structure used for
this step in the case study is identified as a vector autoregressive (VAR) model of
order unity. This model is fit to the regional observational data, in the case study
a trivariate annualized series of length 50, using the “Dynamic Systems Estimation”
(dse) time series package for the R programming language, i.e., external to the main
simgen code. The call in R was to the routine estVARXls, which implements a
least-squares estimation of VAR parameters. A call to simulate in the dse package
was used to generate the long simulated sequence referred to in Greene et al. [2012].
Other routines were used for model checking and testing various aspects of the inferred
model, but details of these procedures will depend on the specifics of any simulation
setting, as well as the form of time series model employed. The modularity of the
simulation code permits the fitting of an infinite variety of statistical models, in Ror
in other software of the modeler’s choice, and the generation of a long simulation
sequence outside the main simgen code. When simgen is run it reads this long
sequence and slices it to produce the detailed downscaled simulations.
2.3
Licenses
We summarize below licensing information for the software elements included in, or
utilized by simgen, to the best of our knowledge. Users are responsible for observing
the terms of these licenses.
1. Python: Open source software, compatible with the GNU General Public Li4
cense (GPL). See http://docs.python.org/license.html.
2. CDAT: Full license terms are included within the source code distribution.
Commercialization of CDAT requires notification of either the United States
Department of Energy or Lawrence Livermore National Laboratory. Certain
third-party components are distributed subject to additional licensing terms.
3. R: The R language is licensed under the GNU General Public License version
2. Some files may be covered by the GNU General Public License version 3.
See http://www.r-project.org/Licenses for details.
4. simgen: simgen is provided under the “Attribution-NonCommercial-ShareAlike
3.0 Unported” (CC BY-NC-SA 3.0) Creative Commons license. Under this
form (see http://www.creativecommons.org/licenses) commercial use (i.e.,
the redistribution of simgen or a derivative for profit) is restricted, and requires
a written license agreement. Noncommercial distribution must be on the same
terms under which simgen and its ancillary files is originally provided, and must
include proper attribution. This attribution is defined in the leading docstring
of the simgen code.
3
Preliminary setup
Some files must exist, or must be created, prior to executing the main simgen code.
These include (a) observational data, both in the form of disaggregated station-level
daily files and as a three-component regional mean series at annual time resolution,
(b) a global-mean, multimodel-mean temperature record and (c) the long stochastic
sequence from which particular simulation “instances” are drawn. As explained in
Section 2.2, the R model that generates the stochastic low-frequency realizations on
which the detailed station-level simulations are based is fit to the observational data
outside the simgen code structure. For illustrative purposes, examples of those files
which must exist a priori are provided along with the simgen code.
3.1
Observational data
Observational records may be utilized in three ways. First, when taken to represent
the regional climatology they are used for for the fitting of the statistical model with
which annual-to-decadal simulations are generated. Here, catchment averages for the
three variables, reduced to annual time resolution, are used for this purpose. Second,
when a simulation “instance” is downscaled, both the spatial and temporal disaggregation steps are conditioned by the relationship between the regional signal and
5
the fine-scale observational data at individual stations. Third, a k-nearest-neighbor
(k-NN) resampling scheme is utilized as part of the downscaling process. It is the
daily observational record that is resampled, in one-year blocks, in this step.
3.2
The regional record
For the case study the regional signal, representing the behavior of the study region as
a whole, consists of an average over the 171 records representing quinary-level catchment values. This signal is multivariate, the three components being precipitation
and minimum and maximum daily temperatures. This record is used by the R routine
to generate the long simulation sequence, and directly by simgen, in conjunction with
the individual catchment values, for “broadcasting” of simulated annual-to-decadal
values to individual catchments.
3.3
Multimodel mean temperature signal
For detrending, as well as for the projection of future trends, a global-mean, multimodel ensemble mean temperature record is utilized. The ensemble is composed
of a set of global climate models (GCMs) from the Coupled Model Intercomparison
Project (CMIP5) [Taylor et al., 2011]. Global temperature records from the individual GCMs are smoothed and a multimodel average is computed. For the purposes of
consistent projection, the multimodel mean signal must extend from the start time
of the observational record through the end of the simulation period, several decades
into the future, requiring a choice of scenario for the future. The case study utilizes the 4.5 W m−2 “Representative Concentration Pathway” (RCP4.5) experiment
[Taylor et al., 2011]. Like the regional mean record described above, the multimodel
mean signal is generated offline, and stored prior to runtime.
3.4
Systematic and random components
The regional mean series (Section 3.2) are detrended by regression on the multimodel
mean signal (Section 3.3), the residual being a multivariate “target” that the simulations are designed to emulate. The target, having annual time resolution, represents
annual-to-decadal variability, and must initially be screened for the identification of
possible “systematic” elements, which are operationally defined as signal components
that differ significantly from AR(1) “red” noise. As discussed in GHG, if such elements are identified they must be modeled independently, the specific form of model
depending on data characteristics and simulation requirements. Such models might
assume forms as diverse as the WARM models described by Kwon et al. [2007, 2009]
or the nonhomogeneous hidden Markov models that were applied to paleorecords Lee’s
6
Ferry streamflow on the Colorado River in the American Southwest by Prairie et al.
[2008]. The range of models that might be deployed in this important step is limited
only by the range of observed regional behavior.
Deterministic elements were not identified in the regional series associated with
the case study [see Sec. 4.2.1 and Fig. 6 in Greene et al., 2012]. Since the series were
tested against an AR(1) null hypothesis, the annual-to-decadal component is thus
modeled as a multivariate stochastic (red noise) process. For this purpose a firstorder vector autoregressive (VAR) model was fit, using the R dse package. After
suitable verification, this model was used to generate a single, extended simulation
sequence. This sequence is stored using the savetxt function in numpy, in the form
of an ASCII file.
3.5
File locations and names
A directory containing one possible arrangement of files is provided at
http://iri.columbia.edu/∼amg/CCAFS/simgen/. Included subdirectories and the
files they contain are listed here. The programmer may of course organize files and
directories as desired, modifying pathnames in the simgen code accordingly.
• dat: This directory holds the regional time series, in “obsav.dat.”
• input_sim: This holds the single long simulated sequence. The example file,
“sim 100kyr.dat,” includes 100 kyr of simulated data.
• obs: The obs directory holds the individual (i.e., station-level) observational
data, here a single example file, named for historical reasons “obshis 2626.txt.”
In a full region-scale experiment there would be a possibly large set of such files.
The simgen code is set up to read files from the obs directory by specifying just
the four-digit identifiers before the “.txt” suffix (in the form of a Python list),
but this default can easily be modified.
• output_sim: The directory into which the output simulations are written. Here
it holds a single sample file, named “sim 100k obshis 2626 001000.txt.” The
first part of the filename identifies the regional file from which the simulation is
derived. The “2626” label corresponds to the station whose variables are being
simulated. Finally, the “001000” identifier refers to the index into the long
simulation file at which the simulated sequence begins. That is, the long file
consists of 100 kyr of data; the “001000” signifies that the simulated sequence
uses values beginning with index 1000 into the file.
7
• pickled: This directory contains “tasav cmip5 comb sm 1901-2095.p,” the smoothed
global multimodel mean temperature series. The year range is indicated in the
file name.
• python: The python directory contains three Python scripts, “simgen9s.py”,
the main simgen code, “readqc.py”, a routine for reading ACRU-formatted files
such as “obshis 2626.txt” and “detrend2.py”, a simple linear detrender used by
simgen. The simgen version, as of this writing, is 9s, the 9 signifying the version
number and s that this is a “special” edit, where file and directory names have
been modified to reflect the demonstration environment.
4
Functions
There are number of function definitions within the simgen module and also within
the scripts called by simgen; They are described here — first the functions in simgen,
in the order in which they appear, and then the two external scripts imported by
simgen. Comments within the code (either lines beginning with the hash symbol #
or text enclosed by triple quotes """) provide further details.
4.1
•
The simgen module
gen(obsix, simix, trendq, fname=’sim_100kyr.dat’,\
write=1, simlen=66, locate=2041, xval=0, M=1):
This is the main routine in simgen, calling other functions as needed to produce
the simulations. Note that the “\” character above represents a a line break
and has no effect on execution. The arguments are as follows:
– obsix: A Python list of four-digit index numbers (integers or strings) corresponding to the observational files for the locations to which the regional
simulation will be downscaled. If there is only a single station (as in the
demonstration folder) its index must still be provided as a list, for example
[2626], rather than simply 2626. The code that translates these indices
into the filenames to be read, as well as the filenames themselves, will very
likely differ in particular application settings.
– simix: An integer whose value must be less than the length of the long
simulation file minus the length of the sequences being simulated. In the
example provided above, the value given was 1000. This is the index into
the long simulation at which the chosen segment begins.
8
– trendq: Specifies the quantile of the multimodel distribution to be used
for the future precipitation trend simulation.
– fname: The filename of the long simulation sequence. The enclosing directory name input_sim is not included but is prepended by simgen, an
arrangement that can easily be modified to suit the user’s computing environment. Note that the default pathname given in the function definition
may be overridden by providing an alternate in the call to gen..
– write=1: Whether or not to write output files. When running simgen for
diagnostics the writing of output files may not be required. The default
may be overridden as described above. This also applies to the defaults
given below.
– simlen=66: The desired simulation length, in years.
– locate=2041: The year, in the simulated sequence, at which a specified
decadal fluctuation is to begin.
– xval=0: A value of zero will cause the 1950-1999 values of the simulation
to replicate the observations; if the value is unity the 1950-1999 values will
be simulated.
– M=1: Whether or not to use the Mahalanobis distance metric in the k-NN
routine, basing distance on the three-component (pr, Tmax, Tmin) vector.
If zero, only precipitation is used for the distance computation.
• yrgen: This function takes a daily (univariate) time series and returns annualized values, as well as the indices into the daily series at which year breaks
occur. Called by gen. This function requires cdms to properly interpret arrays
returned by the readqc module (see below).
• acru: Called by gen, this function writes ACRU-formatted files. (Note that
this is also the format of the “obshis” demonstration file.) The arguments,
simdat, simlen, infile, fname and simix, refer to the simulation data to
be written, the length in years of the 21st -century component of the simulation,
names of the input observational and simulation files, and the index into the
latter for slicing the simulation sequence. A number of these parameters are
used simply for naming the output file.
• getmoda: Provides a day-by-day list of month and day, each in two-digit numerical form, for either normal or leap years. Called by acru, for formatting
the files to be written out.
9
• leap: Takes a four-digit integer year and returns 1 if a leap year, 0 if not.
This, as well as the function above, are necessitated by the hydrological model’s
requirement that leap and normal years be differentiated. In the observational
files as well as the simulations, the former will include the extra day.
If obsix is a list with more than one entry, i.e., if simulations are being generated
for a network of stations, simgen does not return a value to the interactive window
(it does write out the simulation files, assuming write=1). However, if a simulation
is created for just a single station, in which case obsix holds a single value, then
three arrays, identified in the code as datmat, fmat and scalemat are returned.
These hold the complete simulation, at daily resolution, the trend component and the
and non-trend component, respectively. fmat and scalemat have annual resolution.
These files can be be useful for diagnosis as well as plotting the forced and unforced
components of the simulation, separately or in combination.
4.2
The readqc module
The Python program readqc contains a single function, r, whose argument is the
name of a file (either observational or simulated) written in the ACRU format. It
returns four Python objects, designated in the code as prvar, tmaxvar, tminvar,
and datmat. The first three of these are one-dimensional cdms “TransientVariable”
objects, holding, besides data values, embedded time axis and calendar information,
and correspond to precipitation, maximum and minimum daily temperatures, respectively. The last of the variables, datmat, is a single numpy array object (i.e., without
the time axis information) holding all three variables. In all of these arrays the time
resolution is daily. prvar, tmaxvar and tminvar are provided as arguments to the
yrgen function in simgen to recover annualized versions of the respective variables.
4.3
The detrend2 module
The program detrend2 is a short script that provides linear detrending. It is used by
simgen to remove small random trends from segments that have been extracted from
the long simulated sequence. It contains a single function, dt, which calls the scipy
routine linregress, which performs univariate linear regression.
5
Processing sequence
We present here an overview of program flow. Additional details are provided in
Greene et al. [2012] and in the simgen code itself. The line numbers provided are
approximate, and may shift as comments are modified or updated.
10
1. Ingestion of preexisting fixed files (line 159 et seq.) Certain files, as described
in Secs. 3 and 3.5 must exist prior to running simgen. These files, with the
exception of the individual station observations, are read.
• A simulation “instance” (i.e., a sequence of length simlen yr, typically a
few decades) is extracted from the long simulation file (line 166 et seq.).
Although the latter has no overall trend, small random trends may appear
in short extracted segments. These are removed with detrend2.
• The multimodel mean global temperature corresponding to the simulation
period is extracted (line 173 et seq., also line 223).
• Regional mean observational records are read (line 180).
2. The 21st -century regional precipitation trend is computed, based on the specified
value of trendq and parameters of the ensemble distribution (line 208 et seq.)
3. Years in the observed sequence are resampled using a k-nearest-neighbor bootstrap, in order to generate subannual variations. For coherence across the watershed, the same sequence of years must be used at all stations. The sequence
is created in this fairly involved routine (lines 228–437).
4. The main loop over catchments (line 448, reading for obsno in obsix:)
• The station record is annualized (line 451 et seq.). This is performed (a)
for detrending, via regression on the multimodel mean signal and (b) for
computing the degree of dependence on the regional decadal signal, by
regressing the residuals from step (a) on it. In the case of temperature,
coefficients from step (a) will be used to project simulated trends forward
in time; for precipitation the coefficients are used to “scatter” catchment
trends around the imposed value computed in step 2. Coefficients from
step (b) determine the degree to which the simulated regional signal is
mixed with uncorrelated noise in the simulated station record.
• Using the regional data, trends and coefficients from the above steps,
station-level simulations are generated (line 456 et seq.) These have an
annual time step, and incorporate both the simulated decadal signal and
the inferred future trends.
• The annualized station-level signal is downscaled to daily resolution, by
rescaling the subannual variations from the resampled sequence of observational years (line 578 et seq.)
11
• Two short routines (line 688 et seq.) check the simulated sequences for
days on which the maximum temperature is less than or equal to the minimum daily temperature, and for negative values of precipitation, neither of
which is acceptable to ACRU (although the temperature condition could
conceivably occur in nature). Maximum and minimum temperatures may
occasionally be rounded to equal values if the former exceeds the latter by
a small amount. The correction consists in adding a small increment to
the maximum temperature, so that it exceeds the minimum temperature
by 0.25◦ C.
Since precipitation is scaled multiplicatively, negative values are largely
avoided. However, in the admixture of uncorrelated noise with the scaled
regional signal, such values may (rarely) occur, typically if there exists
a long drying trend, bringing scaled values close to zero. The correction
consists of setting negative precipitation values to zero.
• The final, checked variables are assembled into arrays (line 719 et seq.)
and are written out as text files with a call to acru (line 733).
5. If a single station was designated, datmat, fmat and scalemat are returned to
the interactive session; otherwise simgen loops over the stations whose indices
are provided in ixlist, generating and writing out simulation files for the
catchments designated thereby.
6
Final remarks
In this document we have tried to describe the workings of simgen. We have included
details of the software, including required ancillary packages, and have described the
setup of the working environment and the steps involved in preparing and running
the code. Example files permit actual execution of simgen and should be adequate
for testing the code, helping to understand what it does and how, and for preparing
the programmer to apply simgen in other settings.
As we have emphasized, the simgen code is not a universal solution to the problem
of decadal simulation, but represents a particularized realization of the decadal simulation framework outlined in GHG. It is thus unlikely that simgen will be deployed
by others in exactly its present form. Rather, it is hoped that it will prove a useful
template on which to base simulation models in a diverse array of environments.
12
A
The ACRU format
The first two lines of obshis 2626.txt are reproduced below:
203711401950 1 1
203711401950 1 2
0.0P 24.6
0.0P 29.0
8.4
12.7
4.9 26.63 95.00 47.00 138.2
5.8 27.62 92.99 34.10 138.2
In the first field, the first eight digits are an internal identifier, not utilized by
simgen. After this are four digits representing the year, which for both of these lines
is 1950. Then, utilizing two spaces each, are fields for the month (January) and day
(first and second of January, for the two lines). After this are the three key fields,
precipitation, maximum daily temperature and minimum daily temperature, which
for the first record are 0.0 mm, 24.6◦ C and 8.4◦ C, respectively. The “P” following the
precipitation values in both lines indicates that these particular data have been filled
(“patched”), i.e., that the values were initially missing: The readqc module reports
to the screen the number of such values in any observational record being ingested.
The remaining fields are not utilized by simgen, nor is data written to them in the
simulation files, which include only a single space after the minimum temperature
value.
References
Greene, A. M., L. Goddard, and J. W. Hansen, A framework for the simulation
of regional decadal variability for agricultural and other applications, Tech. rep.,
International Research Institute for Climate and Society, Palisades, NY, 2011.
Greene, A. M., M. Hellmuth, and T. Lumsden, Stochastic decadal climate simulations
for the Berg and Breede Water Management Areas, Western Cape province, South
Africa, Water Resourc. Res., 48 , 2012.
Kwon, H.-H., U. Lall, and A. F. Khalil1, Stochastic simulation model for nonstationary time series using an autoregressive wavelet decomposition: Applications to
rainfall and temperature, Water Resour. Res., 43 , 2007.
Kwon, H.-H., U. Lall, and J. Obeysekera, Simulation of daily rainfall scenarios with
interannual and multidecadal climate cycles for South Florida, Stoch. Environ. Res.
Risk. Assess., 23 , 879–896, 2009.
Prairie, J., K. Nowak, B. Rajagopalan, U. Lall, and T. Fulp, A stochastic nonparametric approach for streamflow generation combining observational and paleoreconstructed data, Water Resour. Res., 44 , 2008.
13
Schulze, R. E., Hydrology and Agrohydrology: A Text to Accompany the ACRU
3.00 Agrohydrological Modelling System, WRC Report TT 69/95 , Water Research
Commission, Pretoria, RSA, 1995.
Taylor, K. E., R. J. Stouffer, and G. A. Meehl, An overview of CMIP5 and the
experiment design, Bull. Am. Met. Soc., 2011.
14