A conference’s peer-review was found to be sort of random, but whose fault is it?

It’s not a good time for peer-review. Sure, if you’ve been a regular reader of Retraction Watch, it’s never been a good time for peer-review. But aside from that, the process has increasingly been taking the brunt for not being able to stem the publishing of results that – after publication – have been found to be the product of bad research practices.

The problem may be that the reviewers are letting the ‘bad’ papers through but the bigger issue is that, while the system itself has been shown to have many flaws – not excluding personal biases – journals rely on the reviewers and naught else to stamp accepted papers with their approval. And some of those stamps, especially from Nature or Science, are weighty indeed. Now add to this muddle the NIPS wrangle, where researchers may have found that some peer-reviews are just arbitrary.

NIPS stands for the Neural Information Processing Systems (Foundation), whose annual conference was held in the second week of December 2014, in Montreal. It’s considered one of the few main conferences in the field of machine-learning. Around the time, two attendees – Corinna Cortes and Neil Lawrence – performed an experiment to judge how arbitrary the conference’s peer-review could get.

Their modus operandi was simple. All the papers submitted to the conference were peer-reviewed before they were accepted. Cortes and Lawrence then routed a tenth of all submitted papers through a second peer-review stage, and observed which papers were accepted or rejected in the second stage (According to Eric Price, NIPS ultimately accepted a paper if either group of reviewers accepted it). Their findings were distressing.

About 57%* of all papers accepted in the first review were rejected during the second review. To be sure, each stage of the review was presumably equally competent – it wasn’t as if the second stage was more stringent than the first. That said, 57% is a very big number. More than five times out of 10, peer-reviewers disagreed on what could be published. In other words, in an alternate universe, the same conference but with only the second group of reviewers in place was generating different knowledge.

Lawrence was also able to eliminate a possibly redeeming confounding factor, which he described in a Facebook discussion on this experiment:

… we had a look through the split decisions and didn’t find an example where the reject decision had found a ‘critical error’ that was missed by the accept. It seems that there is quite a lot of subjectivity in these things, which I suppose isn’t that surprising.

It doesn’t bode well that the NIPS conference is held in some esteem among its attendees for having one of the better reviewing processes. Including the 90% of the papers that did not go through a second peer-review, the total predetermined acceptance rate was 22%, i.e. reviewers were tasked with accepting 22 papers out of every 100 submitted. Put another way, the reviewers were rejecting 78%. And this sheds light on the more troubling perspective of their actions.

If the reviewers had been randomly rejecting a paper, they would’ve done so at the tasked rate of 78%. At NIPS, one can only hope that they weren’t – so the second group was purposefully rejecting 57% of the papers that the first group had accepted. In an absolutely non-random, logical world, this number should have been 0%. So, that 57% is closer to 78% than is 0% implies some of the rejection was random. Hmm.

While this is definitely cause for concern, forging ahead on the basis of arbitrariness – which machine-learning theorist John Langford defines as the probability that the second group rejects a paper that the first group has accepted – wouldn’t be the right way to go about it. This is similar to the case with A/B-testing: we have a test whose outcome can be used to inform our consequent actions, but using the test itself as a basis for the solution wouldn’t be right. For example, the arbitrariness can be reduced to 0% simply by having both groups accept every nth paper – a meaningless exercise.

Is our goal to reduce the arbitrariness to 0% at all? You’d say ‘Yes’, but consider the volume of papers being submitted to important conferences like NIPS and the number of reviewer-hours being available to evaluate them. In the history of conferences, surely some judgments must have been arbitrary for the reviewer to have fulfilled his/her responsibilities to his/her employer. So you see the bigger issue: it’s not all the reviewer as much as it’s also the so-called system that’s flawed.

Langford’s piece raises a similarly confounding topic:

Perhaps this means that NIPS is a very broad conference with substantial disagreement by reviewers (and attendees) about what is important? Maybe. This even seems plausible to me, given anecdotal personal experience. Perhaps small highly-focused conferences have a smaller arbitrariness?

Problems like these are necessarily difficult to solve because of the number of players involved. In fact, it wouldn’t be entirely surprising if we found that nobody or no institution was at fault except how they were all interacting with each other, and not just in fields like machine-learning. A study conducted in January 2015 found that minor biases during peer-review could result in massive changes in funding outcomes if the acceptance rate was low – such as with the annual awarding of grants by the National Institutes of Health. Even Nature is wary about the ability of its double-blind peer-review to solve the problems ailing normal ‘peer-review’.

Perhaps for the near future, the only takeaway is likely going to be that ambitious young scientists are going to have to remember that, first, acceptance – just as well as rejection – can be arbitrary and, second, that the impact factor isn’t everything. On the other hand, it doesn’t seem possible in the interim to keep from lowering our expectations of peer-reviewing itself.

*The number of papers routed to the second group after the first was 166. The overall disagreement rate was 26%, so they would have disagreed on the fates of 43. And because they were tasked with accepting 22% – which is 37 or 38 – group 1 could be said to have accepted 21 that group 2 rejected, and group 2 could be said to have accepted 22 that group 1 rejected. Between 21/37 (56.7%) and 22/38 (57.8%) is 57%.

Hat-tip: Akshat Rathi.