Darwin Airport Temperature Data in Context

(download)

Previously, I attempted to recreate Willis Eschenbach's analysis of Darwin, Australia's historical temperature record.1 This recreation confirmed that a slight cooling trend in Darwin's combined temperature record became a pronounced warming trend after the Global Historical Climate Network (GHCN) statistically adjusted a subset of Darwin's individual records. What was left unanswered was whether such adjustments, originally intended to control for changes in land use and measurement practices, were, en masse, introducing or inflating a global warming trend.

In order to quantify the impact of the aforementioned adjustments, I analyze trends in temperature records taken from thousands of weather stations. More specifically, I examine the change in each record's trend due to the GHCN's adjustment process. I then use the results to place the adjustments made to Darwin's temperature records in context. As usual, I rely on R, a statistical programming language, for its ability to automate statistical operations. Follow along, step-by-step, and draw your own conclusions about the GHCN's adjustment process.

Computing the Trends

We begin our analysis with the assumption that raw and adjusted datasets, each containing yearly GHCN mean temperature data, have already been loaded into R.2 Next, we define a function capable of computing a trend for each temperature record present within a specified dataset.

> ComputeTrendsByRecord <- function(data) {
    # Returns a dataset containing the linear trend of each temperature
    # record in the input dataset.
    #
    # data: A dataset containing yearly GHCN mean temperature data.
    #
    trend.data <- data.frame(Id=character(), Rec=factor(), Trend=numeric(),
        Length=numeric())
    for (station.id in unique(data$Id)) {
        station.data <- subset(data, Id == station.id)
        station.recs <- split(station.data, f=station.data$Rec, drop=TRUE)
        for (rec in station.recs) {
            y <- ts(rec$MeanTemp, start=rec[1, "Year"])
            x <- time(y)
            b <- coef(lm(y ~ x))["x"]
            names(b) <- NULL
            trend.data <- rbind(trend.data,
                data.frame(Id=station.id, Rec=rec[1, "Rec"], Trend=b,
                    Length=length(y)))
        }
    }
    return(trend.data)
}

Once the above function is defined, we start computing trends.

> raw.trend.data <- ComputeTrendsByRecord(raw.data)
> adj.trend.data <- ComputeTrendsByRecord(adj.data)

With R's system.time() function, we can measure how long it takes to compute the trends. Doing so leads to the realization that computing thousands of linear regressions is very resource intensive. For instance, on an AMD Turion64 ML-32 CPU at 1.8 GHz with 384 MB system RAM running Ubuntu 8.04 it takes an elapsed time of 40 minutes to compute the raw trends and 18 minutes to compute the adjusted trends. It is faster to compute the adjusted trends because every temperature record has a raw version, but not every record has a corresponding adjusted version.3 Our aim now is to pair the trends by record via a merge.

> paired.trend.data <- merge(raw.trend.data, adj.trend.data,
    by=c("Id", "Rec"))

The head() function gives us a glimpse of what the resultant dataset looks like, for example:

> head(paired.trend.data)
           Id Rec      Trend.x Length.x      Trend.y Length.y
1 10160390000   0 -0.001393259      119  0.015645415      119
2 10160390000   1  0.013603567       64  0.043426557       64
3 10160390000   2  0.021083753       56 -0.006408558       56
4 10160390000   3  0.013808714       42  0.026670411       42
5 10160395001   0  0.038976221       56 -0.001259622       56
6 10160419000   0 -0.018078076       49 -0.018078076       49

Here, columns with .x and .y suffixes denote temperature record attributes calculated on raw and adjusted data, respectively. Our last manipulation is to add a new column that captures how much each record's trend has changed going from a pre-adjusted state to a post-adjusted state.

> paired.trend.data$DeltaTrend <- (paired.trend.data$Trend.y -
    paired.trend.data$Trend.x) * 10 # degC/decade

It should be noted that each change (delta) in trend is multiplied by a factor of 10 so that it is expressed in degrees Celsius per decade. Finally, our approach to computing trends is obviously inefficient. For instance, we expend a lot of wasted effort calculating trends for raw temperature records that have no corresponding adjusted version. Improving the code is left as an exercise for the reader.

Graphing the Trends

The best way to visualize the deltas that we have just calculated is via a histogram.

> hist(paired.trend.data$DeltaTrend,
    las=1,
    main=paste("Histogram of Changes in the Trends of",
        nrow(paired.trend.data), "Records\n",
        "due to the GHCN's Adjustment Process"),
    nclass=200,
    xlab="Degrees Celsius per Decade")

Our histogram is not complete, though. We need to draw a series of vertical lines through it, each line marking the location of a delta associated with one of Darwin Airport's temperature records. At this point, it is important to remember that Darwin Airport's station identifier is "50194120000".

> darwin.paired.trend.data <- subset(paired.trend.data,
    Id == "50194120000")
> abline(v=darwin.paired.trend.data$DeltaTrend,
    col=c("blue", "purple", "red"))
> legend("right",
    fill=c("blue", "purple", "red"),
    inset=0.1,
    legend=darwin.paired.trend.data$Rec,
    title="Darwin Record")

The final product is shown in Fig. 1 above.

Results

From Fig. 1, it is clear that the change in the trend of Darwin Record Zero (blue), Eschenbach's "smoking gun",4 is extreme; however, it is not extremal. There are 574 records with deltas greater than Darwin Record Zero, including Darwin Record Two (red). There are also 453 records that have had their trends adjusted downward more than Darwin Record Zero has had its trend adjusted upward. Of the 6,736 deltas calculated, 3,317 are greater than zero and 2,574 are less than zero. There is evidence that this difference is not due to chance alone (sign test, P<0.001).5 The mean change in trend is +0.017°C per decade (t-test, P<0.001).6 The mean change in trend weighted by record length in years is +0.018°C per decade. Finally, Darwin's temperature records are summarized as follows:

Darwin's Deterministic Trends (°C/dec)
Record Raw Adjusted Delta Years
0 -0.07150741 0.163448332 0.23495574 110
1 0.02683280 0.004114076 -0.02271872 69
2 0.15437369 0.414611679 0.26023798 42
3 0.31134085 NA NA 20
4 -0.41069865 NA NA 9

Our results replicate those found elsewhere.7, 8

An Optimistic Interpretation

An optimistic interpretation of our results is that no "smoking gun" exits. In other words, even if the adjustments to Darwin Record Zero seem bizarre, the fact is that "most adjustment[s] hardly modify the reading, and the warming and cooling adjustments end up compensating each other."9 Furthermore, the whole point of the GHCN's adjustment process is to eliminate biases in the temperature data. In this regard, the mean change in trend of +0.017°C per decade is a sign that the GHCN's adjustment process is doing what it is supposed to do: separate the signal of global warming from the noise of local, non-climatic effects.

A Skeptical Interpretation

A skeptical interpretation of our results is that the GHCN's adjustment process is not just offsetting existing biases, but creating new ones. Thus, the "smoking gun" is not the upward adjustment of Darwin Record Zero's temperature trend but the incongruous treatment of Darwin Record One and Darwin Record Two. As shown in Fig. 2, the raw version of Darwin Record Two (red) is practically a subsequence of Darwin Record One (purple). Yet, after going through the GHCN's adjustment process, the two records barely overlap. If the purpose of the GHCN's adjustment process is to correct for local changes affecting a temperature station, then, in the absence of any station metadata to the contrary, similar records should be adjusted similarly. Moreover, an effect size of +0.017°C per decade is not insignificant as it represents 31% of the +0.054°C per decade global temperature trend from 1850-2005.10

Errata

In working with the GHCN Version 2 database, I came across the following problems:

Gaps/Deletions
Some temperature records have gaps in which whole years are missing. Arguably, when such a gap is present, the offending record should probably be split into two separate records. Also, it should be noted that gaps subvert R's time series facilities since time series objects are supposed to represent observations over consecutive years.
Insertions/Repetitions
Some adjusted temperature records have extraneous observations that do not appear in their corresponding raw versions. See records from stations "30187860001", "40371826000", "40878662001", "41476726000", and "80099910001" for more details.

  1. Markus, M. (2010-01-12). Darwin Airport Temperature Data Revisited. Personal blog.
  2. Markus, M. (2010-01-12). See the section on "The Data Reloaded".
  3. This statement should be true; however, it appears that Kuwait International Airport (i.e. "21240582000") has a ghost. That is, there is a temperature record, Record Three to be exact, for which there is an adjusted version but no original version.
  4. Eschenbach, W. (2009-12-08). The Smoking Gun At Darwin Zero. Watts Up With That?
  5. Whitely, E., & Ball, J. (2002). Statistics review 6: Nonparametric methods. Critical Care, 6, 509-513. [PMID: 12493072]. Here we use a nonparametric sign test (with zeros discarded) lest someone accuse us of completely ignoring the distribution's deviation from normality. In R, use the binom.test() function.
  6. Conducting a Student's t-Test is reasonable given the paired nature of the data, the relative symmetry of the distribution, and the large sample size. In R, use the t.test() function.
  7. Gilestro, G. (2009-12-12). Lots of smoke, hardly any gun. Do climatologists falsify data? Personal blog. See also the accompanying results.txt file.
  8. Stokes, N. (2009-12-21). Darwin and GHCN Adjustments. Moyhu.
  9. Gilestro, G. (2009-12-12).
  10. Trenberth, K. E., et al. (2007). Observations: Surface and Atmospheric Climate Change. In S. Solomon, et al. (Eds.), IPCC Fourth Assessment Report (pp. 235-336). New York: Cambridge University Press. See Table 3.2 on p. 243.

Darwin Airport Temperature Data Revisited

 

(download)

Climatological organizations often statistically adjust temperature records in order to offset local, non-climatic effects. Herein, I use R, a statistical programming language, to recreate Willis Eschenbach's analysis of some questionable adjustments made to the historical temperature record of Darwin, Australia by the Global Historical Climate Network (GHCN).1 This post builds upon my previous work with the GHCN Version 2 database;2 however, I have tried to make it relatively self-contained so that one can proceed without too much backtracking.

The Data Reloaded

Assuming that we have the relevant temperature data files (i.e. v2.mean and v2.mean_adj) in our working directory,3 we begin a new R session. Next, we copy and paste the following function into the R environment:

> LoadGHCN <- function(filename, na.rm=FALSE) {
    # Returns a dataset containing yearly GHCN mean temperature data.
    #
    # filename: The name of a file containing monthly GHCN mean
    #           temperature data.
    #    na.rm: Strip missing values before calculating yearly means?
    #   
    months <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep",
        "Oct", "Nov", "Dec")
    colw <- c(11, 1, 4, rep(5, 12))
    coln <- c("Id", "Rec", "Year", months)
    colc <- c("character", "factor", "numeric", rep("numeric", 12))
    data <- read.fwf(filename, colw, col.names=coln, na.strings="-9999",
        colClasses=colc)
    data$MeanTemp <- rowMeans(data[months], na.rm=na.rm)/10 # Celsius
    return(data[c("Id", "Rec", "Year", "MeanTemp")])
}

This function is just some refactored code taken from my previous post on Darwin and put into a reusable form. It is designed to read monthly GHCN mean temperature data from a file and transform said data into a more utilitarian representation. One important aspect of this function is that it allows one to specify, via the na.rm parameter, how missing monthly values should be handled when computing each yearly mean. Normally, in statistical programming languages like R and SAS, a calculation on a missing value (i.e. NA) results in a missing value. Such a conservative approach is usually desired. In our case, though, such an approach is too conservative since the monthly GHCN mean temperature data is interspersed with missing values. Thus, we load the raw and adjusted temperature data like so:

> raw.data <- LoadGHCN("v2.mean", na.rm=TRUE)
> adj.data <- LoadGHCN("v2.mean_adj", na.rm=TRUE)

It should be noted that there are more intelligent ways to handle missing values. For instance, one could use the complete.cases() function to determine the pattern of missing values per row. One could then opt to assign a missing value to a yearly mean if the row of monthly temperature values upon which it is based contains a sequence of missing values. Improving the LoadGHCN() function is left as an exercise for the reader.

Eschenbach's Darwin

Willis Eschenbach, a climate change skeptic, has created a plot, shown in Fig. 1 above, that depicts the effects of GHCN adjustments to Darwin Airport's combined temperature record.4 Our goal is to use R to recreate this plot. The first obstacle we encounter is the fact that there are several distinct temperature records for Darwin Airport. More specifically, Darwin's raw temperature data consists of five records, three of which are present in homogenized form in its adjusted dataset. We need to compute an average temperature record for Darwin on a per dataset basis. To help us do this, we define the following function:

> AverageTemperatureRecords <- function(data, station.id) {
    # Returns a time series that is the average of all of the temperature
    # records for the given temperature station.
    #
    #       data: A dataset containing yearly GHCN mean temperature data.
    # station.id: An eleven-character temperature station identifier.
    #
    station.data <- subset(data, Id == station.id)
    station.recs <- split(station.data, f=station.data$Rec, drop=TRUE)
    station.ts <- NULL
    for (rec in station.recs) {
        station.ts <- ts.union(station.ts,
            ts(rec$MeanTemp, start=rec[1, "Year"]))
    }
    if (!is.null(dim(station.ts))) {
        station.ts <- ts(rowMeans(station.ts, na.rm=TRUE),
            start=start(station.ts))
    }
    return(station.ts)
}

Now, recalling that Darwin Airport's station identifier is "50194120000", we average the temperature records in each dataset to create two separate time series.

> raw.darwin.ts <- AverageTemperatureRecords(raw.data, "50194120000")
> adj.darwin.ts <- AverageTemperatureRecords(adj.data, "50194120000")

The next problem we must overcome is that the temperature time series in Fig. 1 are presented in terms of temperature anomalies. Temperature anomalies are departures from a common baseline. In climatology, the baseline is usually a global mean temperature calculated over some localized time period (e.g. 1971–2000).5 Nevertheless, in Fig. 1, the time series are normalized with respect to two different baselines so that they both can start from zero. Eschenbach claims that this is done to highlight the effect of the adjustments.

> eschenbach.raw.ts <- raw.darwin.ts - raw.darwin.ts[1]
> eschenbach.adj.ts <- adj.darwin.ts - adj.darwin.ts[1]

Once we have the normalized time series, we can plot them.

> ts.plot(eschenbach.adj.ts, eschenbach.raw.ts,
    col=c("red", "blue"),
    gpars=list(pch=20),
    main="GHCN Raw & Adj Temperatures Darwin Airport",
    type="o",
    ylab="Temperature Anomaly (C)")

The result forms the foundation of Fig. 2 above. We are not done yet, though. We need to add trend lines. To start, we use the lm() function to fit a straight-line model to each time series. In other words, we are just doing some simple linear regressions with time as the independent variable and temperature anomaly as the dependent variable. Note that the lm() function's na.action parameter must be set to NULL to ensure that a regression's residuals and fitted values are returned as time series.

> eschenbach.raw.reg <- lm(eschenbach.raw.ts ~ time(eschenbach.raw.ts),
    na.action=NULL)
> eschenbach.adj.reg <- lm(eschenbach.adj.ts ~ time(eschenbach.adj.ts),
    na.action=NULL)

Using the coef() function on the output of the first regression, we get an estimated slope for the straight-line model of -0.007124129, which translates into a cooling of 0.007° C per year or 0.7° C per century. This agrees with the trend that Eschenbach reports for the raw temperature time series in Fig. 1. The second regression yields an estimated slope of 0.01905713, indicative of a 1.9° C per century warming. Eschenbach, on the other hand, reports a 1.2° C per century warming for the adjusted temperature time series in Fig. 1. Despite this discrepancy, we plot the regressions' fitted values so as to produce the trend lines shown in Fig. 2.

> lines(fitted(eschenbach.adj.reg), col="red")
> lines(fitted(eschenbach.raw.reg), col="blue")

The last thing we have to do is overlay the amount of adjustment.

> lines((eschenbach.adj.ts - eschenbach.raw.ts), col="black")

We finally have Fig. 2, which looks very similar to Fig. 1, Eschenbach's original plot. The biggest difference between the two figures is the fact that, in Fig. 1, the black line tracking the amount of adjustment is plotted against a scale on the right-hand side. One gripe is that Eschenbach should have calculated the amount of adjustment as (adj.darwin.ts - raw.darwin.ts). We can confirm this by looking at Fig. 3, a plot of the original time series, which shows us that the adjusted time series starts off 2.34° C below the raw time series and that the series eventually converge.

A Smoking Gun?

So, are the GHCN's adjustments inappropriately introducing or inflating a global warming trend? Is Darwin a smoking gun? The climate change skeptic may view the adjustments made to Darwin with some suspicion. Yet, there really is no way to contextualize the adjustments made to Darwin without looking at other temperature stations.


  1. Eschenbach, W. (2009-12-08). The Smoking Gun At Darwin Zero. Watts Up With That?
  2. Markus, M. (2009-12-23). Raw Darwin Airport Temperature Data with R. Personal blog.
  3. Markus, M. (2009-12-23). See the section on "Getting the Data".
  4. Eschenbach, W. (2009-12-08). Figure 7.
  5. Global Surface Temperature Anomalies. (2009-12-11). NOAA.

Raw Darwin Airport Temperature Data with R

 

(download)

Traditionally, I have done statistical analyses using SAS. SAS provides an analyst with a pretty good environment; however, the system is expensive and programming via the SAS Macro Language leaves much to be desired. R, a freely available statistical programming language, is the open source response to SAS. Up until now, I have not done much programming in R. One reason for my lack of programming in R is that I don't have any interesting data to analyze. Forutnately, a series of tit-for-tat blog posts1,2,3,4 discussing the temperature record for Australia's Darwin Airport has changed this situation. My goal here is to use R to recreate Fig. 1,5 as shown above. I hope that you will follow along.

Getting the Data

First, we need to get the temperature record for Darwin Aiport. This data is contained in the Global Historical Climate Network (GHCN) Version 2 database, which is available for download from the National Climatic Data Center (NCDC). According to the NCDC, the GHCN database "contains historical temperature, precipitation, and pressure data for thousands of land stations worldwide."6 Since we are just interested in the temperature data, all we have to do is download the following files:

v2.mean.Z
A compressed file containing monthly mean station temperatures.
v2.mean_adj.Z
A compressed file containing monthly mean station temperatures statistically adjusted to offset "nonclimatic discontinuities" such as changes in instrument shelters or local station environments.
v2.temperature.inv
An inventory file containing station metadata.

It should be noted that I downloaded all of the above files on December 18th, 2009.

Preparing the Data

Now that we have the temperature data, we need to uncompress it. On UNIX, this can be done using the uncompress command.

> uncompress v2.mean.Z
> uncompress v2.mean_adj.Z

After uncompressing the files, each file should be larger and no longer have the ".Z" ending. Next, we need to find the identifier for the Darwin Airport station by searching the inventory file using the grep command.

> grep "DARWIN" v2.temperature.inv
50194120000 DARWIN AIRPOR                  -12.40  130.87   30   17U   56FLxxCO 3A 1WATER           C

Here, the identifier is the eleven-digit number at the beginning of the row (i.e. "50194120000"). It consists of a country code (3 digits), station number (5 digits), and a modifier (3 digits). We can also confirm that we have the correct station by noting that the Darwin Airport is indeed located at 12°40'S latitude and 130°87'E longitude.

Loading the Data into R

Once we have the station identifier, we can start up our R environment7 and then load the raw temperature data from the v2.mean file into it via the read.fwf command. We can learn more about that command, or any other R command, by prefixing it with a question mark.

> ?read.fwf

Since the v2.mean file has a fixed-width format, we must specify the width, name, and class of each of its columns. Additionally, we need to tell R that missing monthly temperature values are coded as -9999.

> months <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep",
    "Oct", "Nov", "Dec")
> colw <- c(11, 1, 4, rep(5, 12))
> coln <- c("Id", "Rec", "Year", months)
> colc <- c("character", "factor", "numeric", rep("numeric", 12))
> raw <- read.fwf("v2.mean", colw, col.names=coln, na.strings="-9999",
    colClasses=colc)

It may take awhile for R to load all of the data. After R finishes loading the data, we can use the summary and attributes commands to verify that the data was loaded correctly. R reports that the dataset that I am using has 595,759 rows.

Plotting the Data

Before we can actually plot the raw Darwin Airport temperature data, we must first calculate yearly mean station temperatures. We can accomplish this by using the rowMeans command to average all of the monthly temperatures on a per row (i.e. year) basis. It should be noted that each row mean is divided by a factor of 10 in order to convert it into degrees Celsius.

> raw$RawMeanTemp <- rowMeans(raw[, months], na.rm=TRUE)/10
> raw.globe <- raw[, c("Id", "Rec", "Year", "RawMeanTemp")]

Next, we must subset out the temperature data for Darwin Airport, split this data into the five individual records that make up the overall Darwin Airport temperature record, and create time series8 out of each these records.

> raw.darwin <- subset(raw.globe, Id == "50194120000")
> raw.darwin.recs <- split(raw.darwin, f=raw.darwin[, "Rec"], drop=TRUE)
> for (i in 1:length(raw.darwin.recs)) {
    varname <- paste("rec", i, sep=".")
    assign(varname, ts(raw.darwin.recs[[i]][, "RawMeanTemp"],
        start=raw.darwin.recs[[i]][1, "Year"]))
}

Finally, all that's left for us to do is to plot the time series.

> ts.plot(rec.1, rec.2, rec.3, rec.4, rec.5,
    col=c("blue", "purple", "red", "pink", "lightblue"),
    main="Fig.2 - Five Raw Darwin Temperature Records from GHCN",
    type="o",
    ylab="Temp (C)")

The result is shown in Fig. 2 above. It's not pretty, but it looks like it agrees with Fig. 1 and it should provide us with a good starting point for future analyses.


  1. Eschenbach, W. (2009-12-08). The Smoking Gun At Darwin Zero. Watts Up With That?
  2. Scepticism's limits. (2009-12-11). The Economist.
  3. Eschenbach, W. (2009-12-12). Willis: Reply to the Economist. Watts Up With That?
  4. Climate manipulation gun still notably smoke-free. (2009-12-14). The Economist.
  5. Eschenbach (2009-12-08). Figure 5.
  6. GHCN-Monthly Version 2. (2008-08-20). NOAA.
  7. R version 2.6.2 (2008-02-08). On Ubuntu, you can install R via the apt-get utility. You will want to install the r-base and r-base-dev packages.
  8. Shumway, R. H., & Stoffer, D. S. (2006). An R Time Series Tutorial. For, Time Series Analysis and Its Applications: With R Examples (2nd Ed.). New York: Springer-Verlag.