Statistics with R Using Atmospheric Gases: Part 3 Defining the Model for Mauna Loa CO2

Defining the Model

The title seems to imply there is one model for the Mauna Loa CO2 data series. But  there are many, some of which may be better than others in describing the data and some better also in revealing the underlying processes that generate the observations. So “defining the model” actually refers to a process of exploration.

What happened to the naïve approach? Rather than proceed as if I knew nothing about the data, I intend to explore the data series without committing to hypotheses about what physical, biological or social factors generated the data.

Properties of the Mauna Loa CO2 Series

The approach to modeling will depend on whether the modeler is working with the daily, weekly, monthly or annual data. As of 20th May 2015, the were 2139 weeks of data from May 19, 1974 to May 10, 2015. For most years there were 52 weeks, but for 7 years there were 53 weeks.

As of now, we have 52 weeks for 41 years, which is 2132 weeks. Plus for 7 of those 41 years we have one extra week, for a total of 2139 observations.

We begin by reading the data as a vector into a spreadsheet, estimate the missing values and then prepare a graph to view the full years from the first week in 1975 to the last week in 2014.

mlco2 1975-2014

My first impressions are: the annual variations are wave-like and the long-term pattern seems to increase almost as a straight line, with very small fluctuations from year to year.

I count as a discovery that the waves appear to be equally spaced and similar in amplitude. This is something to keep in mind for next steps.

We can see the pattern of CO2 growth and variability is different from a random walk as described by the Penn State Department of Meteorology in the following graphic.

With a random walk, the mean and variance are not constant over time. Such a process, if subject to shocks, does not revert to the mean of its path. The process is not stationary and unless it can be transformed into a stationary process, statistical tools will give spurious results.

randomwalklong_l

By contrast, the process that generates the CO2 data does revert to a mean, but it is the mean value of the long-term trend line. The series is not stationary but it is mean-reverting. As a consequence, the series can be transformed so that statistical tools give valid results. The problem is to find a method to transform the data.

Close-up Visualization

A closer look at the CO2 data reveals details of short-term patterns and noise (random variations).

Keeling 128 weeks

In this segment of  the series there are three peaks and two troughs. It appears that the three peaks could be connected by a straight line.

I count the apparent pattern in peaks as a discovery to be kept in mind.

The seasonal variation seems to follow a saw-tooth pattern, which may or may not result from the upward trend.

I count as a discovery that the wave pattern might be saw-tooth .

Seasonal Pattern: Saw-tooth or Sinusoidal?

By asking this question, I am taking the first step into geophysical theory, but it is a small departure from the naïve approach because it is common knowledge that the Earth has seasons and that the changing of the seasons is related to the Sun as demonstrated by the following graphic from Robert O’Connell course in Astronomy.

Dec-of-Sun-v-Date

I cannot be certain, but I can proceed by using the working hypothesis: the seasonal pattern of the Mauna Loa CO2 series is generated as a sinusoidal series. I reserve judgement on the reason for the saw-toothed appearance.

Caveat: I count as a risk that assuming a sinusoid pattern may bias my results.

Consequences of Sinusoidal Pattern in the Data

The main consequence of assuming the seasonal pattern to be sinusoidal is that statistical software has the facility for using sine and cosine functions to transform data.

Yet, before proceeding to modeling using trigonometric functions, there is a simple way to explore the data. The following figure shows a complete sine function (A1 and B1) and we know that the full cycle covers 360 degrees, equivalent to one revolution of Earth around the Sun.

sine_wave

We know that area A1 + area B1 = 0 because A1 has positive values and B1 has equal and opposite (negative) values. Also we know that moving from right to left by 180 degrees subtracts A1 and adds exactly the same amount at A2.

Intuitively we can see that if we always sum over 360 degrees, then moving one degree from left to right will subtract from A1 exactly the same amount as added at A2.

We conclude that a moving average of 360 degrees (one year) along the entire waveform will always equal zero.

However if we apply a moving average of 52 weeks from 1975 to 2014 the total will not equal zero. If the peaks and the troughs are equally high and low, the sum of weeks 1 to 52 and the sum of weeks 2 to 53 and so on will give the average of the peaks and troughs. This will be an estimate of the tend in CO2 increase from week to week with seasonality removed. The result will be a line that represents the trend in CO2 level.

We cannot proceed to the end of the series because when we pass the first week in 2015 would have only 51 weeks in our sum. We will lose 51 weeks of our series: at the beginning or end, or 25 and 26 weeks at each end. (I ignore the data from part years, 1974 and 2015.)

There are other risks in using this approach. This is the real world that is being measured and not a mathematical function. We cannot be sure that the points of time in the real world, equivalent to A and B, from May to September, are constant in width. Perhaps the time between maximum and minimum varies by a week or so from year to year.

Nor can we be sure that the moving average fits the amplitude of seasonal variations in the real world, although the moving average is adaptive over time periods as long as 52 weeks, over shorter time periods the moving average could generate artifacts.

I have to keep in mind that this analysis will give approximations for exploration only. The fluctuations and anomalies in the data series are so slight that these rough approximations may lead to false conclusions.

The following graphic shows the trend line for years 2012 through 2014. The 53rd weekly data value for 2012 was omitted. This avoided misalignment of the series. By omitting the 53rd week, the date of the vernal equinox (March 20th or 21st) remains almost aligned at about the 12th week.

Precise alignment on the dates of the vernal equinoxes to a precision of +/3 days will be done during pre-processing. The autumn equinoxes are not expected to become aligned. This may be justified on both geophysical and biophysical grounds. Physics alone tells us that solar energy increase at the vernal equinox drives the annual cycle. Thus solar energy decline at the autumnal equinox may allow the system to return to its state at the vernal equinox (relaxation).

In physical terms, this view of the system would would be equivalent to a variably damped harmonic oscillator with variable periodic perturbation. This seems a long way from the naïve approach, but these ideas are simply working hypotheses and not commitments.

Trend line

seasonal variations

The trend line can be superimposed on the data series. Note that the trend line is equivalent to the zero line in the seasonal graph.

trend line plus seasonal.

The seasonal graphic does not look quite right because the series seems to be declining when the expectation was that it would be neither increasing nor declining. Further, when the trend line is superimposed on the original data, it does not appear to be centered vertically, but instead appears to pass through a little below the midpoint between the maximum and minimum values. This visual impression arises mainly because the last year peak and trough are both lower than the preceding two years.

However, the following graphic demonstrates alternative ways of conceptualizing the relationship between seasonal variations and the trend line.

I am using the term “outgassing” to refer to the release of CO2 into the atmosphere from all sources, including combustion and respiration, and the term “drawdown” to refer to the capture of CO2 from the atmosphere by all sinks.

Some sources may also be sinks and vice versa, depending on the season and state of Earth systems.

drawdown.or.degassing.

To explore the data, I assumed that the process of drawdown in summer is balanced by the process of degassing (including combustion) and that therefore the trendline should pass through the mean between the peaks and troughs. I therefore imposed on the data a trendline that fits this hypothesis.

Yet, the data series itself has two intrinsic trend lines defined by the annual maximum and minimum levels of CO2 in May and September respectively. The gaps between  adjacent minima and maxima could provide a measure of the varying width of the channel that contains the data series.

These three measures are non-parametric in the sense that no formal hypothesis needs to be posited concerning the shape of the annual cycle, whether sinusoidal or saw tooth. This may or may not be useful as a way to proceed but is prompted by an article by Graham Griffiths, City University, London, UK and William Schiesser, Lehigh University, USA in Scholarpedia.

About linear and non-linear waves they say:

The term amplification factor is used to represent the change in the magnitude of a solution over time. It can be calculated in either the time domain, by considering solution harmonics, or in the complex frequency domain by taking Fourier transforms.

Some dissipation and dispersion occur naturally in most physical systems described by PDEs. Errors in magnitude are termed dissipation and errors in phase are called dispersion. These terms are defined below. The term amplification factor is used to represent the change in the magnitude of a solution over time. It can be calculated in either the time domain, by considering solution harmonics, or in the complex frequency domain by taking Fourier transforms.

…Generally, [numerical dissipation] results in the higher frequency components being damped more than lower frequency components. The effect of dissipation therefore is that sharp gradients, discontinuities or shocks in the solution tend to be smeared out, thus losing resolution…

…the Fourier components of a wave can be considered to disperse relative to each other. It therefore follows that the effect of a dispersive scheme on a wave composed of different harmonics, will be to deform the wave as it propagates. However the energy contained within the wave is not lost and travels with the group velocity. Generally, this results in higher frequency components traveling at slower speeds than the lower frequency components. The effect of dispersion therefore is that often spurious oscillations or wiggles occur in solutions with sharp gradient, discontinuity or shock effects, usually with high frequency oscillations trailing the particular effect…

The naïve approach has led me to view the CO2 series as representing global phenomena that may be subject to amplification, dissipation and dispersion, not something amenable to analysis by spreadsheet.

The pattern of increase and decrease in CO2 may reveal three things: the net increase year by year (1), and the relative power of sources (2) compared to the sinks (3).

That the trend line passes through the series about midway between the annual peaks and troughs is a reasonable assumption. Even so, this assumption seems to me to be a theoretical construct for estimating the net power of the sources and sinks.

An alternative way to conceptualize the series is to focus on the peaks and the troughs. The annual increase above the bottoms of the troughs shows when the sources have more power than the sinks. The annual decline below the peaks shows when the sinks have more power than the sources.

Phenology is concerned with the timing of biotic activity and events. The amount of inter-annual variation in timing of the peaks, troughs and the time interval between the peaks and troughs, if any, seems to me to be worth investigating.

Conclusions

Manipulation of the data in a spreadsheet is useful for the exploratory stage of the study. In addition to the results reported here, I explored the following:

1. Mean annual CO2 level and its variance. I found that the variance increases with the mean but did not check to see if this was in accordance with Taylor’s power law. I did a rough check and found that the percentage increase in standard deviation scales with the percentage increase in the mean, which points to compound growth in CO2 level over time.

The apparent increase in variance about the trend lend appears to be an artifact of the using a linear model instead of an compound growth rate model.

The correlation between year number and annual average CO2 level for exponential trend seems not be significantly different from the exponential trend. Each regression accounted for over 99% of variance in the CO2 level. The arithmetic trend was slightly concave upwards and the exponential trend slightly concave downwards.  The arithmetic increase in CO2 averaged 1.70 ppmv per year compared to the average seasonal amplitude of about 6.2 ppmv. The seasonal amplitude represented 1.7% of the average CO2 level of 362 ppmv.

2. I examined departures from the smooth trend line and found interesting anomalies. Since the methodology is crude, not a lot can be derived from this graph. The 1998 super El Niño left its mark. However, the role of the eruption at Pinatubo in June 1991 may have been exaggerated, since close inspection shows that the CO2 level was declining before the eruption occurred.

annual anomalies

The most significant lesson from analysis of the anomalies is that the difference between the maximum and minimum anomaly was only 6 ppmv, about equal to the seasonal variation, reduced in this graphic by smoothing to about one ppmv.

CO2 is a trace atmospheric gas that increased on average by one-third of one percent per annum between 1959 and 2014. (The percentage growth rate is identical when calculated as linear or geometric.)

The global level of CO2 as measured at Mauna Loa fluctuates seasonally about the annual trend in a range of about 1.5%. (Geometric average of the range taken from peak to trough, range defined as the ratio of annual maximum divided by annual minimum.)

The annual trend line is subject to external shocks, such as El Niño, but is mean-reverting, so that anomalies from the long-term trend line are about the same order of magnitude as the seasonal variations.

All these characteristics of CO2 data cause statisticians to refer to the series as an example of a deterministic trend that is stationary about its trend line. “‘Deterministic” in science implies that some external (physical and/or social) processes generate the observations and that the scientific problem is to discover what these processes are and how much each “determines” the observed resultant level of CO2 in the atmosphere and its variation temporally and spatially.

Given the location and elevation of the Mauna Loa Observatory, the atmospheric gases may be well-mixed. If so, this would have the effect of a low-pass filter that allows the ENSO signal to remain but may or may not suppress teleconnections with other oceanic basins, the Atlantic and Indian Oceans in particular. By analogy, continental-scale signals from biotic, geologic and human factors may also be filtered out.

What remains appears to be a global CO2 signal representing net emissions (sources minus sinks). Because there is so little land and population in the Southern Hemisphere, land-based signals may be taken to represent a Northern Hemisphere signal.

Decision about how to Proceed

All of these consideration have led me to decide that analysis by spreadsheet is acceptable for inspection of the main characteristics of the data series, exploration, but not for much more than exploration.

What I will need to proceed are tools powerful enough to disentangle temporal, geographical, natural and human factors in CO2 levels and its variability.

At the very least it will be necessary to use trigonometrical functions to fit trendlines and Fourier analysis to detrended CO2 data series, not only the Mauna Loa series, but other series in the global network.

Years ago I converted Peter Bloomfield’s FORTRAN code for Fourier analysis into Microsoft QuickBasic, a programming package that is now obsolete. (Fourier Analysis of Time Series: An Introduction. Wiley, 1976.)

The second edition of Dr. Bloomfield’s book (published in 2000) uses the S-Plus language, a commercial version of S. There are some important differences between R and S, but much of the code for S runs unaltered in R.

A little research about R reveals why the language has become so popular with the scientific community. R is free and so are thousands of addons, packages, functions, code snippets and tutorials. And although R is and was specifically designed for statistical analysis, the language has evolved into a complete system for scientific research and presentation.

Up to this point, my analysis of Mauna Loa Co2 series has relied on spreadsheets. I had intended to infill missing data using the R programming language, but have decided it will be more efficient to do the pre-processing using Excel and the statistical analysis using R.

Part 4 to follow: Literature Review of the Mauna Loa CO2 Series

Advertisements

Statistics with R Using Atmospheric Gases: Part 2 Estimating Missing Values in the Context of Modeling Mauna Loa CO2

Recap

In Part 1 of this series, I set out my interest in atmospheric CO2 and R as a programming language for statistical modeling. I explained why I chose the Mauna Loa CO2 series for learning statistics with R.

Finally, I mentioned that during the last 40 years or so, the weekly record of CO2 at Mauna Loa has 17 missing observations, about 0.8% of the total.

The Purpose of this Article

In this part of this series I propose a methodology for estimating the missing values. Here I establish my reasoning and criteria.

I wish to estimate the missing values for convenience in doing the calculations.

1. The method should not introduce artifacts: estimated values should be consistent with the original data.

2. The estimated values should not change the results of the analysis more than would gaps in observations.

A stricter criterion would be that the estimated values should not change the results more than would a data series that had no missing values. This seems impossible, since we do not know the missing values. However, there is one approach to modelling that suggests that many missing values would have no effect on the results: Fourier analysis. The R language is suitable for doing Fourier analysis and so is the ESRL CO2 data.

Rationale

The method for infilling missing values must respect the data in the context of the model used to do the analysis. If only a small percentage of values are missing, the data series can still be analyzed as a regular time series. With the Mauna Loa CO2 series, only 17 weekly values are missing out of over 2100 weeks. Thus, the very sparsity of the missing values justifies interpolation of the values.

table of missing co2 data

Apart from the exploration phase of the study, I avoid using moving averages because moving averages can introduce artifacts into analysis: I do not want to base my conclusions on artifacts that I generate by data processing.

Thought experiment

With R, I can use regression to run a line through the weekly CO2 data points. I can use the line to revise my estimate of missing data.

Data points exactly on the regression line cannot affect not the shape, location or slope of the line. So estimated missing values that exactly on the line will be consistent with the data series and will not alter the results of the analysis.

Caveat: Provided we do not change the model that we use to define the line that fits the data. This caveat seems obvious and is probably the reason ESRL provides raw data, flagging the missing values but not replacing them with estimates.

I plan to back-cast (predict backwards) the weekly CO2 observations, first by week, and then by estimating values for the mid-points of every month. The mid-month estimates can be used to verify that my model results are consistent with the monthly averages provided by ESRL.

The Rationale

This seems like a lot of work to replicate what ESRL have already done. However, this rational follows the principle: the researcher must know the data before he can do anything with it.

A researcher can come to know the data by exploring it in a naïve way. This is a difficult task in itself, because no Earth scientist or statistician will find it easy to adopt a naïve view in regard to something so familiar as the Keeling Curve.

The ESRL Mauna Loa CO2 graphic is familiar to most scientists and statisticians.

Compatibility of the Scientific Method and Data Interpolation

There is an over-riding imperative: the methodology should be acceptable to both atmospheric physicists and professional statisticians.

I follow the scientific method as advocated by Richard Feynman: First, I guess the missing values. I show how in the graphic below which is from an Excel spreadsheet. And then I test to determine if the guesses are supported by the observations on the whole data series.

Four missing values in one month is the worst gap that we find in the Mauna Loa CO2 record — a whole month missing, indicated by the orange markers in the graphic. This occurred twice, in 1975 and again in 1984.

ESRL provides a five-year graphic for the CO2 series that suggests a way to interpolate the missing values. The regularity of both trend and seasonal change indicates that interpolated values may contain errors that are no greater than the random noise already present in the data. The close-up graph provided by the Scripps Institution of Oceanography supports this.

I estimated the interpolated figures by using the data indicated by blue diamonds, y1 (346.17) and y6 (347.33). I assumed the growth rate r is geometric and calculated values for y2 to y5 by growing the value y1 by rate r = (y6/y1)^(1/5). Thus y2 = y1*(1+r) … to … y5 = y4 * (1+r).

The interpolated values appear to follow a straight line, but only because the curve from y1 to y6 is so slight.

graph of missing values for 1984
Atmospheric Co2 in parts per million. ESRL Mauna Loa data for 31 weeks in 1984
Dates in decimal format

The spreadsheet graphic displays about the same amount of noise as in the close-up graph from the Scripps Institute.

This graphical method is just augmented guessing. Its only virtue is that it allows the analysis to proceed as if there were no missing values. Once the data is modeled, the model can be used to test the goodness of fit of the estimates of missing values and the missing values will be adjusted by using the fitted line produced by the model.

This is why I consider the method for interpolating missing values should be specific to the model used to analyze the data: change the model and the justification for the estimating the missing values may no longer be valid.

Part 3 to follow: Defining the Model for Mauna Loa CO2

Statistics with R using Atmospheric Gases: Part 1 The Background to Mauna Loa CO2 Measurement

Atmospheric Gases

Anyone who begins a program of statistics for geoscience will discover that sunspots and variable starts are favoured by most authors. Soon it becomes apparent that Earth data requires a different, often more complex approach.

First, Earth data is more detailed and more plentiful. And second, Earth geoscience data is often associated with a multitude of associated physical variables as well as economic and social data.

To begin my study of atmospheric gases, I chose CO2, partly because of its inherent interest for geoscience, but also because there are many associations between CO2, other physical data, and socio-economic data.

The R programming Language

Although I used a spreadsheet to pre-process data, I have started to use the R language to do statistical analysis, because R is a language and environment specifically for statistical computing and graphics.

The language is supported by the Free Software Foundation via GNU. Further information: R language

ESRL,  NOAA, and the Carbon Cycle

ESRL is the Earth System Research Laboratory of the U.S. National Oceanic & Atmospheric Administration (ESRL). Among the many research themes of ESRL is the Carbon Cycle.

Carbon, as used in the term carbon cycle refers to all of the organic and non-organic compounds involved in the generation and storage of carbon dioxide. Study of the carbon cycle has traditionally included physics, chemistry and biology.

Today, study of the carbon cycle has expanded to include law, economics, philosophy, political science, sociology and psychology. However, I aim to confine these blogs to physics, chemistry, biology, and those social factors, such as economics, that drive the variation in CO2.

The importance of carbon dioxide is that photosynthesis by plants is the basis for most life on Earth on land and in the ocean. Although water vapour is the most important Greenhouse Gas, interest in carbon dioxide has grown since Arrhenius (in 1896) calculated the effect of CO2 in modulating the heating of the Earth by solar energy, a second more direct effect of CO2 in the evolution of plant and animal life on Earth.

Carbon Dioxide (CO2) Measurements

ESRL maintains a global network to monitor Greenhouse Gases, including CO2. The Mauna Loa Observatory in Hawaii is the most famous of all stations in the global network.

Mauna Loa Observatory

The elevation of the observing site is 3397 meters (11,141 ft), far enough removed from local effects so that the observations are interpreted to represent well-mixed global density of CO2.

Mauna Loa CO2 Data

ESRL provides the Mauna Loa CO2 data in several versions. I chose the weekly data, downloaded from the ftp address provided.

I decided to begin with the CO2 data collected at Mauna Loa, because the Mauna Loa CO2 measurements are the most widely used CO2 series. The monthly series dates back to 1958 and the weekly series to 1974. The weekly data file created on May 6, 2015 has weekly averages from May 19, 1974 to April 2015, updated every week with delay of a few weeks to allow preliminary validation.

Missing Data

Geoscience data usually has missing data owing to instrumental problems or to contamination by the environment. The Mauna Loa data updated to May 6, 2015 spanned 2137 weeks of which 17 weekly entries (0.8%)  indicated missing values, a 99.2% recording rate.

The ESRL web site provides monthly data also with missing values together with a second column that contains estimates for the missing values. However, the weekly data do not show estimates for the missing values.

This is not a setback. It’s an opportunity to develop and test an estimation methodology, which I explain and demonstrate in Part 2 of this article.

Part 2 follows: Estimating Missing Values in the Context of Modeling Mauna Loa CO2