Oxygen may be a carcinogen

In inordinate amounts or forms, anything can be poison to life – even the air we breathe. But its threat seems more ominous when you think that even in small quantities, accumulated over time, the oxygen in the air can cause cancer. Two American scientists, Kamen Simeonov and Daniel Himmelstein, have concluded exactly that after analyzing cancer-incidence data compiled between 2005 and 2009 among people populating counties along the US’s west coast. Their calculation doesn’t show a dramatic drop in incidence with altitude yet the statistical methods used to refine the results suggest the relationship is definitely there: oxygen contributes to the growth of cancerous tumors. As they write in their paper,

“As a predictor of lung cancer incidence, elevation was second only to smoking prevalence in terms of significance and effect size.

A relative-importance test on R with the data, available on Himmelstein’s GitHub, attests to this (regression indices: LMG, Pratt, first and last). elevlung Additionally,

the lung cancer association was robust to varying regression models, county stratification, and population subgrouping; additionally seven environmental correlates of elevation, such as exposure to sunlight and fine particulate matter, could not capture the association.”

Simeonov and Himmelstein found that with every 1,000 m rise in elevation, lung cancer incidence decreased by 7.23% – that is, 5.18-9.29 per 100,000 individuals, which is fully 12.7% of the mean incidence (56.8 per 100,000 individuals). Overall, the duo attributes a decrease of 25.299% of lung cancer cases per 100,000 individuals to the “range of elevation of counties of the Western United States”. In other words,

Were the entire United States situated at the elevation of San Juan County, CO (3,473 m), we estimate 65.496% [46,855–84,136] fewer new lung cancer cases would arise per year.
Their paper was published in the open access journal PeerJ on January 13, 2015. The validity of the result lies in the strength of the statistical analysis backing it. Cancers are caused by a variety of agents. Respiratory cancers, in turn, are often the result of exposure to certain heavy metals, fine particulate matter, radiation, inhalation of toxic substances and genetic predisposition. To say oxygen could be one such toxic substance requires the claimants to show its relative significance with other known carcinogens and its covariance with incidence of cancer. Only statistics enables this. First, the data shows that the incidence of cancer dropped with increasing altitude.

My plot from data. The grey band represents the confidence level.
My plot from data. The grey band represents the confidence interval. Lung cancer incidence in per 100,000 individuals, elevation in 1,000s of meters.

Next, it shows that the incidence couldn’t have dropped due to anything else but the elevation. (‘Pearson’ is the Pearson correlation coefficient: the higher its absolute value is, the stronger the correlation.)

"Predictors displayed expected correlations such as a strong positive correlation between obesity and diabetes. Collinearity was moderate but pervasive. Elevation covaried with most variables including cancers indicating the need to adjust for covariates while carefully considering collinearity." Credit: http://dx.doi.org/10.7717/peerj.705
“Predictors displayed expected correlations such as a strong positive correlation between obesity and diabetes. Collinearity was moderate but pervasive. Elevation covaried with most variables including cancers indicating the need to adjust for covariates while carefully considering collinearity.” Credit: http://dx.doi.org/10.7717/peerj.705

To corroborate their results, the authors were also able to show that their statistical models were able to point out known risks – such as variation of incidence with smoking and exposure to radon. On the other hand, unlike smoking, exposure to radon also varies with altitude. The paper however does not clarify how it eliminates the resulting confounding fully.

Alternatively, Van Pelt (2003) attributed “some, but not all” of the Cohen (1995) radon association to elevation. Follow-up correspondences by each author revolved around the difficulty in assigning the effect wholly to elevation or radon when both of these highly-correlated predictors remained significant (Cohen, 2004; Van Pelt, 2004). We believe that our data quality improvements, including county-specific smoking prevalences and population-weighted elevations, were responsible for wholly attributing the effect to elevation.
In fact, this admission belies the study’s ultimate problem (and that of others like it): a profusion of influences on the final results. Cancer – lung or another – can be caused due to so many things. To assess its incidence in terms of a few variables – such as elevation, smoking and sunlight – could only be for the sake of convenience. Because, beyond a point, to think cancer could be the result of just one or two factors is to be foolishly reductionist. At the same time, this issue is typical of so many statistical investigations that it would be more productive to consider Simeonov’s and Himmelstein’s find as a springboard off which to launch more studies than to think it the final word on anything. They endorse the same thing with their final admission, that their study is still a victim of the ‘ecological fallacy’ – when studies of groups are thought to be equivalent to studies of individuals but are really not so. As this essay states,
Serious errors can result when an investigator makes the seemingly natural assumption that the inferences from an ecological analysis must pertain either to the individuals within the groups or to individuals across groups. A frequently cited early example of an ecological inference was Durkheim’s study of the correlation between suicide rates and religious denominations in Prussia in which the suicide rate was observed to be correlated with the number of Protestants. However, it could as well have been the Catholics who were committing suicide in largely Protestant provinces.

2 comments

  1. Thank you, Daniel. It was cool that you put the data out there, it’s always exciting to try and replicate it, not to mention that it feels like I’m doing science, too. 😉

    I did use the relaimpo package in R. I’m sorry that the labels are missing. The nine variables are elevation, particulate matter, sunlight, UV, smoking, radon, drinking, diabetes and obesity. Here’s the code:


    install.packages("relaimpo")
    library(relaimpo)
    relimp1 <- read.csv("path-to-file",header=T)
    fit <- lm(lung ~ elevation + particulate + sunlight + uvb + smoking + radon + drinking + diabetes + obesity, data=relimp1)
    calc.relimp(fit,type=c("lmg","last","first","pratt"),rela=TRUE)
    boot <- boot.relimp(fit,b = 3000,type=c("lmg","last","first","pratt"),rank=TRUE,diff=TRUE,rea=TRUE)
    booteval.relimp(boot)
    plot(booteval.relimp(boot,sort=TRUE))

    Here’s a link to the spreadsheet I fed to R. I was surprised that diabetes showed up, too. I used only the data in your github repo.

    Re the percentages: Oops! Thanks, I’ll fix it.

  2. Thanks for the insightful and informed discussion. I linked to it from the article at PeerJ. It’s great to see someone perform original analyses using our data.

    The relative importance analysis seems like an important addition to our study. Did you use the relaimpo package in R? How did you choose the 9 predictors to include. It appears that every other variable label has been omitted from the plot, making it a bit difficult to interpret.

    I was surprised that diabetes had the highest relative importance under the first method: the correlation plot from our paper, which you show, indicates that diabetes doesn’t have the highest univariate correlation with lung cancer. I suspect your calculation included counties across the entire United States. The Southeastern United States has super high diabetes as well as lung cancer rates, which likely drives the correlation. We restricted our analysis to the 10 states of the mountainous American West.

    I noticed that in your post, “99%” was accidentally appended to two numbers (“6549699%” and “7.2399%”). 99% refers to the confidence interval level.

Comments are closed.