Random Reflections: Cochrane and the Origins of Hierarchies

In 1972, Archie Cochrane published Effectiveness and Efficiency: Random Reflections on Health Services. In a little under 86 pages, Cochrane offers a wide-ranged but succinct delivery of his experience and his philosophy of evidence in clinical practice. It’s a fascinating gallop through the concerns of one of the most influential figures in medicine.

A few weeks ago, Joseph Vere and Barry Gibson published a new paper, Variation amongst hierarchies of evidence (2020), which includes a chart drawn from Vere’s excellent PhD thesis (2018), in which he identifies Effectiveness and Efficiency as the source of the first hierarchy of evidence. Admitting that this goes against the most commonly cited sources – either Campbell and Stanley (1963) or the Canadian Task Force on the Periodic Health Examination (1979) – Vere lays out an argument for considering Cochrane’s monograph as the original evidence hierarchy in its modern form.

There is no doubt that the early proponents looked to Cochrane for a great deal of the inspiration behind their movement. Cochrane and Feinstein are perhaps the most influential figures, prior to Sackett, on what became Evidence-Based Medicine. Given the value of Cochrane’s name-brand amongst this group, it is even more surprising that early hierarchies did not identify Cochrane as a source, though the common practice (outlined here and here) of referring to even the earliest hierarchies as the summation of the epidemiological literature to date may have trumped even the value of Cochrane’s good name. Presenting a hierarchy as a pseudo-objective fact may have been more valuable than identifying it with an individual’s philosophy.

Vere is no doubt correct that Cochrane’s Effectiveness and Efficiency lays the groundwork for the arguments later used by Sackett, Guyatt and others in defending hierarchies of evidence. Cochrane succinctly states first an argument to denigrate expert opinion, then blasts observational studies, and finally places RCTs at the pinnacle. In his thesis, Vere summarises some of Cochrane’s reasoning. But for students of hierarchies, it is well worth plumbing this artefact further.

Cochrane first clearly identifies the dilemma which the founders of EBM felt they faced: that expert opinion was prized above experimentation. He writes: “Two of the most striking changes in word usage in the last twenty years are the upgrading of ‘opinion’ in comparison with other types of evidence, and the downgrading of the word ‘experiment’.” (p.20) Cochrane then ascribes some of the blame on television interviewers, for whom getting a pithy opinion is key and the nuance of evidence is tedious. It is hard to tell whether he has identified a true historical trend towards the glorification of opinion in the media in the early 1970s, but his frustrations still ring true and could apply in equal measure to the condensing of complex evidence into a single tweet, or to the ‘hot takes’ and ‘soundbites’ which comprise so much of the media landscape. Perhaps the only thing missing is “the interest in pop singers’ views on theology” (p.20), which might now ring a little high-brow. Meanwhile, he pins the demise of ‘experiment’ on the revival of an alternative meaning of “trying anything, hence the endless references to ‘experimental’ theatres, art, architecture, and schools.” (p.20) In this regard, experiment is far from alone, the term ‘theory’ having seen a similar trajectory.

Cochrane sets himself a clear task: “The particular problem is the value of various types of evidence in testing the hypothesis.” (p.20). This looks like a clear articulation of the critical appraisal project – to find a way to appraise the value of evidence. But the link to a hierarchy is already subtly made – it is the “various types of evidence” which he wants to appraise. Cochrane shifts us away from a study-by-study appraisal model towards appraising types of studies, and hence methodologies for providing evidence. Indeed, what Cochrane is about to present might be interpreted less as a hierarchy of evidence, but as a hierarchy of methodologies, in the terminology I have previously developed (2015).

Cochrane immediately criticises expert opinion. I will quote his argument in full, as it is both succinct and a clear kernel for future EBM work:

The oldest, and probably still the commonest form of evidence proffered, is clinical opinion. This varies in value with the ability of the clinician and the width of his experience, but its value must be rated low, because there is no quantitative measurement, no attempt to discover what would have happened if the patients had had no treatment, and every possibility of bias affecting the assessment of the result. It could be described as the simplest (and worst) type of observational evidence.” (pp.20-1)

We see here the germ of arguments which predominated in the early EBM literature. It is intriguing to see that Cochrane focuses only on individual opinion here, setting aside collective expert opinion and the consensus conferences against which the likes of Sackett were keen to rebel. For hierarchies, there are three principles introduced here which have a considerable impact. First, expert opinion is distinguished from observational evidence. While Cochrane is clearly aware that expert opinion could be considered a form of observational evidence, his next section immediately drops that line. Expert opinion would forever be separate as a category. Second, Cochrane puts in place an absolute ranking here: he says “its value must be rated low” – low and not lower than other forms of evidence. The absolute vs. relative tendencies of future hierarchies would oscillate back and forth depending on how willing the authors were to accept the third element here: that there is still considerable variability within the value of evidence even at this lowest level. When Cochrane writes “This varies in value“, he offers us a glimpse of future developments towards conditional and non-categorical rankings. Importantly, Cochrane’s approach is far from hard-line here. He is clear that he is speaking only about the expert clinician’s ability to test a hypothesis about effectiveness using their clinical opinion. This is a long way from knocking down the worth of clinician’s opinions on other matters.

Having dispatching of opinion in a little under eight lines, Cochrane turns his attention to the broader category of observational evidence. Again, his discussion is so concise that it merits some substantial quotation:

“Moving up the scale at the observational level, the main changes introducing improvement are the appearance of ‘comparison’ groups, the introduction of measurement and the exclusion of possible bias from the measurements. Comparison groups as they appear in the literature are a very mixed lot. Some are positively grotesque, such as that old favourite ‘those who refused treatment’. They are usually very different from the theoretical ‘control’ group, which should be the same in all respects, which might influence the course of the disease, as the treated group. This, of course, puts a limit on the possible accuracy of this sort of investigation as we seldom if ever know all the characteristics that might influence the course of the disease. […] But even with all these sophistications observational evidence is never very satisfactory. […] Observational evidence is clearly better than opinion, but it is thoroughly unsatisfactory. All research on the effectiveness of therapy was in this unfortunate state until the early 1950s. The only exceptions were the drugs whose effect on immediate mortality were so obvious that no trials were necessary” (pp.21-2)

In the midst of this critique, Cochrane goes from observational evidence as limited, to ‘never very satisfactory’, to finally ‘thoroughly unsatisfactory’. The argument which sways him comes when he offers an example of a rather hamfisted study of the effectiveness of caning on the reduction of smoking by a colleague, which he goes on to criticise on the grounds that “the results do not tell us anything at all. They are equally compatible with caning increasing, decreasing, or having no effect on cigarette consumption.” (p.21). This argument is likely familiar to students of the EBM playbook. Cochrane offers this study as an example of observational evidence at its best and shows it does not deliver useful information. But in fact, the study is far from a paragon of the best practices in observational research, even for the time. Moreover, a result which is not definitive is far from a unique property of observational studies: indeed many RCTs are entirely compatible with the null hypothesis being true, false because of a positive effect or false because of a negative one. Cochrane is suggesting, though, that most observational evidence will be this way. At the end, he concedes that observational evidence can suffice to establish effectiveness where the “effect on mortality were so obvious that no trials were necessary“. Yet surprisingly he still maintains that observational evidence is “never very satisfactory“. This inconsistency would bedevil Evidence-Based Medicine for half a century to come.

It is intriguing to see how strong Cochrane’s claims are here. Expert opinion’s “value must be rated low“, while observational evidence with comparison groups is “thoroughly unsatisfactory“, yet Cochrane maintains that we are still “moving up the scale” in going from opinion to observational study. It is clear why: the presence of quantification, efforts to rule out biases, and the comparison group. Cochrane makes something of a straw man of the observational researchers, but at least acknowledges the “very mixed lot” he is assessing here. This gives further suggestion that we are looking at a ranking with considerably more variability than a simple three-level hierarchy.

Finally, Cochrane addresses his preferred subject: the RCT. He writes with great and justified praise of the work of Austin Bradford Hill, suggesting a ‘Bradford Award’ for the best medical statistics paper. Incidentally, Hill might have preferred a Hill Prize, as the “Bradford” was a late addition to his moniker to distinguish himself from the Nobel laureate physiologist Archibald Hill. Bradford Hill was always known as Tony, never Austin, but the Tony Award for Medical Statistics might provoke confusion.

Cochrane then lays out the case for the RCT:

“The basic idea, like most good things, is very simple. The RCT approaches the problem of the comparability of the two groups the other way round. The idea is not to worry about the characteristics of the patients, but to be sure that the division of the patients into two groups is done by some method independent of human choice, i.e. by allocating according to some simple numerical device […] In this way the characteristics of the patients are randomized between the two groups, and it is possible to test the hypothesis that one treatment is better than another and express the results in the form of the probability of the differences found being due to chance or not.” (p.22)

Cochrane’s presentation of the RCT is simple and persuasive. But his phrasing under the microscope is as stark as the most zealous proponent of randomization, albeit with a serving of subtlety. By using randomization, Cochrane says, it becomes possible to test the hypothesis and quantify the probability of the differences being found. This is surely absurd – hypothesis testing is no doubt possible without randomization. Similarly, as many philosophers have pointed out particularly following Worrall’s lead, it is precisely the concern about the patient characteristics which helps here: it is the baseline checks for even allocation which satisfy us that the random allocation is also a balanced allocation. The randomization techniques Cochrane describes here cannot afford this themselves.

But Cochrane here stops shy of declaring that RCT evidence is high value, the highest value, or making other such claims. He is not as bullish as his initial description might make out. He immediately follows up:

The RCT is a very beautiful technique, of wide applicability, but as with everything else there are snags. When humans have to make observations there is always a possibility of bias.” (p.22)

He goes on to outline something resembling treatment and selection bias, for which he prescribes double-blinding as a remedy, p-hacking and the “tendency to put too much emphasis on tests of significance” (p.23), the inferiority of trials with small sample sizes and the tendency for large trials to produce statistically significant but clinically meaningless results. He lays out the challenges of rare diseases, of subjective outcomes, and the difficulty in measuring effect size for recurrent and chronic conditions. He also considers ethical challenges. He concludes that “All results must be examined very critically to avoid all the snags.” (p.23)

He wraps up the introduction of the arguments which form the backbone of the future hierarchies with a warning against exactly such a project:

In writing this section in praise of the RCT I do not want to give the impression that it is the only technique of any value in medical research. This would, of course, be entirely untrue. I believe, however, that the problem of evaluation is the first priority of the NHS and that for this purpose the RCT is much the most satisfactory in spite of its snags. The main job of medical administrators is to make choices between alternatives. To enable them to make the correct choices they must have accurate comparable data about the benefit and cost of the alternatives. These can really only be obtained by an adequately costed RCT.” (p.25)

This is bold, stark, declarative support for the unique role of the RCT. It is not uncritical support for any RCT, but a demand that RCTs be performed and performed well. Cochrane never mentions any grading of the quality of RCT evidence, presumably due to his serious reservations about snagging. He would most likely appreciate the GRADE framework very much.

But is this a hierarchy of evidence? In reality very little turns on the question. One thing is for sure: Cochrane’s account here is a philosophical and ideological foundation for the work of Sackett, Guyatt and others. Campbell and Stanley’s comprehensive 1963 work Experimental and Quasi-Experimental Designs for Research does much more of the heavy lifting in terms of making the case for experimental designs as reducing biases and enhancing internal validity, which would subsequently be represented hierarchically despite their clear and stated opposition to such an approach. Cochrane gives voice to a much stronger preferential approach. But his conclusion never really matches with the flirtation with a hierarchical approach just pages earlier in the chapter. While he felt happy enough to call expert opinion low value, and observational evidence unsatisfactory, his conclusion is really that only RCT evidence counts for the task of evaluating hypotheses about the effectiveness of treatments. One can no doubt read a hierarchy off this text, and Dr. Vere is right to identify the germ of the idea here. But to take Cochrane at his word, there is no hierarchy of evidence – there is only the RCT or bunk.