In inordinate amounts or forms, anything can be poison to life – even the air we breathe. But its threat seems more ominous when you think that even in small quantities, accumulated over time, the oxygen in the air can cause cancer. Two American scientists, Kamen Simeonov and Daniel Himmelstein, have concluded exactly that after analyzing cancer-incidence data compiled between 2005 and 2009 among people populating counties along the US’s west coast. Their calculation doesn’t show a dramatic drop in incidence with altitude yet the statistical methods used to refine the results suggest the relationship is definitely there: oxygen contributes to the growth of cancerous tumors. As they write in their paper,
“As a predictor of lung cancer incidence, elevation was second only to smoking prevalence in terms of significance and effect size.
A relative-importance test on R with the data, available on Himmelstein’s GitHub, attests to this (regression indices: LMG, Pratt, first and last). Additionally,
the lung cancer association was robust to varying regression models, county stratification, and population subgrouping; additionally seven environmental correlates of elevation, such as exposure to sunlight and fine particulate matter, could not capture the association.”
Simeonov and Himmelstein found that with every 1,000 m rise in elevation, lung cancer incidence decreased by 7.23% – that is, 5.18-9.29 per 100,000 individuals, which is fully 12.7% of the mean incidence (56.8 per 100,000 individuals). Overall, the duo attributes a decrease of 25.299% of lung cancer cases per 100,000 individuals to the “range of elevation of counties of the Western United States”. In other words,
Were the entire United States situated at the elevation of San Juan County, CO (3,473 m), we estimate 65.496% [46,855–84,136] fewer new lung cancer cases would arise per year.
Next, it shows that the incidence couldn’t have dropped due to anything else but the elevation. (‘Pearson’ is the Pearson correlation coefficient: the higher its absolute value is, the stronger the correlation.)
To corroborate their results, the authors were also able to show that their statistical models were able to point out known risks – such as variation of incidence with smoking and exposure to radon. On the other hand, unlike smoking, exposure to radon also varies with altitude. The paper however does not clarify how it eliminates the resulting confounding fully.
Alternatively, Van Pelt (2003) attributed “some, but not all” of the Cohen (1995) radon association to elevation. Follow-up correspondences by each author revolved around the difficulty in assigning the effect wholly to elevation or radon when both of these highly-correlated predictors remained significant (Cohen, 2004; Van Pelt, 2004). We believe that our data quality improvements, including county-specific smoking prevalences and population-weighted elevations, were responsible for wholly attributing the effect to elevation.
Serious errors can result when an investigator makes the seemingly natural assumption that the inferences from an ecological analysis must pertain either to the individuals within the groups or to individuals across groups. A frequently cited early example of an ecological inference was Durkheim’s study of the correlation between suicide rates and religious denominations in Prussia in which the suicide rate was observed to be correlated with the number of Protestants. However, it could as well have been the Catholics who were committing suicide in largely Protestant provinces.
Thanks for the insightful and informed discussion. I linked to it from the article at PeerJ. It’s great to see someone perform original analyses using our data.
The relative importance analysis seems like an important addition to our study. Did you use the relaimpo package in R? How did you choose the 9 predictors to include. It appears that every other variable label has been omitted from the plot, making it a bit difficult to interpret.
I was surprised that diabetes had the highest relative importance under the first method: the correlation plot from our paper, which you show, indicates that diabetes doesn’t have the highest univariate correlation with lung cancer. I suspect your calculation included counties across the entire United States. The Southeastern United States has super high diabetes as well as lung cancer rates, which likely drives the correlation. We restricted our analysis to the 10 states of the mountainous American West.
I noticed that in your post, “99%” was accidentally appended to two numbers (“6549699%” and “7.2399%”). 99% refers to the confidence interval level.
Thank you, Daniel. It was cool that you put the data out there, it’s always exciting to try and replicate it, not to mention that it feels like I’m doing science, too. 😉
I did use the relaimpo package in R. I’m sorry that the labels are missing. The nine variables are elevation, particulate matter, sunlight, UV, smoking, radon, drinking, diabetes and obesity. Here’s the code:
relimp1 <- read.csv("path-to-file",header=T)
fit <- lm(lung ~ elevation + particulate + sunlight + uvb + smoking + radon + drinking + diabetes + obesity, data=relimp1)
boot <- boot.relimp(fit,b = 3000,type=c("lmg","last","first","pratt"),rank=TRUE,diff=TRUE,rea=TRUE)
Here’s a link to the spreadsheet I fed to R. I was surprised that diabetes showed up, too. I used only the data in your github repo.
Re the percentages: Oops! Thanks, I’ll fix it.