In Part 1 of this series, I set out my interest in atmospheric CO2 and R as a programming language for statistical modeling. I explained why I chose the Mauna Loa CO2 series for learning statistics with R.
Finally, I mentioned that during the last 40 years or so, the weekly record of CO2 at Mauna Loa has 17 missing observations, about 0.8% of the total.
The Purpose of this Article
In this part of this series I propose a methodology for estimating the missing values. Here I establish my reasoning and criteria.
I wish to estimate the missing values for convenience in doing the calculations.
1. The method should not introduce artifacts: estimated values should be consistent with the original data.
2. The estimated values should not change the results of the analysis more than would gaps in observations.
A stricter criterion would be that the estimated values should not change the results more than would a data series that had no missing values. This seems impossible, since we do not know the missing values. However, there is one approach to modelling that suggests that many missing values would have no effect on the results: Fourier analysis. The R language is suitable for doing Fourier analysis and so is the ESRL CO2 data.
The method for infilling missing values must respect the data in the context of the model used to do the analysis. If only a small percentage of values are missing, the data series can still be analyzed as a regular time series. With the Mauna Loa CO2 series, only 17 weekly values are missing out of over 2100 weeks. Thus, the very sparsity of the missing values justifies interpolation of the values.
Apart from the exploration phase of the study, I avoid using moving averages because moving averages can introduce artifacts into analysis: I do not want to base my conclusions on artifacts that I generate by data processing.
With R, I can use regression to run a line through the weekly CO2 data points. I can use the line to revise my estimate of missing data.
Data points exactly on the regression line cannot affect not the shape, location or slope of the line. So estimated missing values that exactly on the line will be consistent with the data series and will not alter the results of the analysis.
Caveat: Provided we do not change the model that we use to define the line that fits the data. This caveat seems obvious and is probably the reason ESRL provides raw data, flagging the missing values but not replacing them with estimates.
I plan to back-cast (predict backwards) the weekly CO2 observations, first by week, and then by estimating values for the mid-points of every month. The mid-month estimates can be used to verify that my model results are consistent with the monthly averages provided by ESRL.
This seems like a lot of work to replicate what ESRL have already done. However, this rational follows the principle: the researcher must know the data before he can do anything with it.
A researcher can come to know the data by exploring it in a naïve way. This is a difficult task in itself, because no Earth scientist or statistician will find it easy to adopt a naïve view in regard to something so familiar as the Keeling Curve.
The ESRL Mauna Loa CO2 graphic is familiar to most scientists and statisticians.
Compatibility of the Scientific Method and Data Interpolation
There is an over-riding imperative: the methodology should be acceptable to both atmospheric physicists and professional statisticians.
I follow the scientific method as advocated by Richard Feynman: First, I guess the missing values. I show how in the graphic below which is from an Excel spreadsheet. And then I test to determine if the guesses are supported by the observations on the whole data series.
Four missing values in one month is the worst gap that we find in the Mauna Loa CO2 record — a whole month missing, indicated by the orange markers in the graphic. This occurred twice, in 1975 and again in 1984.
ESRL provides a five-year graphic for the CO2 series that suggests a way to interpolate the missing values. The regularity of both trend and seasonal change indicates that interpolated values may contain errors that are no greater than the random noise already present in the data. The close-up graph provided by the Scripps Institution of Oceanography supports this.
I estimated the interpolated figures by using the data indicated by blue diamonds, y1 (346.17) and y6 (347.33). I assumed the growth rate r is geometric and calculated values for y2 to y5 by growing the value y1 by rate r = (y6/y1)^(1/5). Thus y2 = y1*(1+r) … to … y5 = y4 * (1+r).
The interpolated values appear to follow a straight line, but only because the curve from y1 to y6 is so slight.
The spreadsheet graphic displays about the same amount of noise as in the close-up graph from the Scripps Institute.
This graphical method is just augmented guessing. Its only virtue is that it allows the analysis to proceed as if there were no missing values. Once the data is modeled, the model can be used to test the goodness of fit of the estimates of missing values and the missing values will be adjusted by using the fitted line produced by the model.
This is why I consider the method for interpolating missing values should be specific to the model used to analyze the data: change the model and the justification for the estimating the missing values may no longer be valid.
Part 3 to follow: Defining the Model for Mauna Loa CO2