By definition, I begin, Alternative Medicine, I continue,
Has either not been proved to work, or been proved not to work.
Do you know what they call alternative medicine that’s been proved to work?
Medicine.
—Tim Minchin, Storm
In his beat-poem Storm, the musician and comedian Tim Minchin lays out an apparent paradox. By his criteria, there can never be evidence-based alternative medicine. As soon as there’s evidence that a treatment works, it joins scientific medicine. Minchin goes on to list aspirin as a ‘natural remedy’ since co-opted into scientific medicine—“I took a natural remedy/ Derived from the bark of the willow tree/ A painkiller virtually side-effect free”.
Minchin is exaggerating for comic effect. He knows, as we do, that no medicine is ever completely “proved to work”. He sets up a three-part classification of treatments—those proved to work, those not (yet) proved to work, and those proved not to work. This distinction is much harder to maintain when we scrap talk of “proof” and talk instead about evidence. There are some treatments for which there is strong evidence that it works, others for which there’s strong evidence that it doesn’t, others for which there’s no strong evidence either way, and quite a substantial group in which different strong evidence sources provide contrary or even contradictory information. This problem poses a serious threat to the EBM model of evidence—a problem we’ll encounter head on here.
But before we explore how EBM deals with conflicting evidence through the example of alternative therapies, there is a pair of paradoxes suggested by Tim Minchin that are worth exploring. Minchin proposes that, in principle, there’s no such thing as evidence-based alternative medicine. But he’s wrong about that. There can be alternative medicines for which there is an evidence base—and even potentially a strong, high-quality one—but which cannot or will not be integrated into scientific medicine. Beyond that, there’s a second paradox here: there are alternative medicines which would be judged by the standards of evidence-based medicine to be effective and supported by high-quality evidence. Yet doctors still regard those same treatments as non-scientific, alternative therapies. In those cases, either EBM’s standards of evidence are misguided, or clinicians aren’t accepting and applying them. Either way there’s a problem for EBM, but here we’ll see that, by and large, the clinicians are in the right; it’s EBM’s misconfigured standards of evidence that are at the root of the apparent paradox.
The Paradox of Evidence-Based Alternative Medicine
The first paradox of evidence-based alternative medicine is the one explicitly formulated by Tim Minchin. It was expressed similarly in 1998 by two of the editors of the Journal of the American Medical Association, Phil Fontanarosa and George Lundberg: “There is no alternative medicine. There is only scientifically proven, evidence-based medicine supported by solid data or unproven medicine, for which scientific evidence is lacking”. 1 Unlike Minchin, Fontanarosa and Lundberg ought to have been more circumspect when speaking of ‘proof’. But their key point is clear: scientific medicine simply is evidence-based medicine, and anything else is irrelevant. Alternative medicines are definitionally excluded from scientific medicine. Either something is part of scientific medicine, and therefore scientific not alternative, or it is not and can be discarded.
Minchin and the JAMA editors are both wrong to formulate this paradox. In fact, it depends on a false dichotomy. It is not true that scientific medicine is the sum of all medicines for which there is evidence that the treatment works. Even laying aside concerns about accepted treatments for which evidence is contested or lacking, there is a mismatch between the scope of scientific medicine and the range of treatments that have an evidence base. How can there be a treatment for which there’s an evidence base, yet which does not qualify as part of medicine?
There are many ways. Each corresponds to a further criterion which we apply when deciding what is and isn’t part of scientific medicine. Evidence that a treatment works is one amongst several conditions for counting as part of medicine. Where there’s evidence that a treatment works, yet it fails one of those other criteria, it too will be excluded from scientific medicine. Let’s look at just a few examples.
In March 2017, over 30,000 British citizens signed a petition calling for conversion therapy for homosexuality to be banned in the UK. 2 Conversion therapy has a torrid history. It is now associated with fundamentalist churches in America. However, in the UK in the 1950s and 1960s, LGBT people were pressured and in some cases legally compelled to undergo conversion therapy by teachers, medical practitioners and the courts. The most famous case is that of Alan Turing, the father of modern computer science and artificial intelligence research. Turing was convicted of gross indecency in 1952 after admitting to a homosexual relationship. The court ordered a hormonal conversion therapy, a course of one year of synthetic oestrogen injections which were effectively a chemical castration. Turing took his own life in 1954.
‘Conversion therapy’ or ‘reparative therapy’ covers a broad range of interventions. Therapies can target behaviour or desire, trying to decrease homosexual attraction and/or increase heterosexual attraction. Behavioural therapies can be targeted at reducing homosexual behaviour and attraction through negative association—electric shock therapy and aversion therapy, for instance, in which painful shocks, nausea-inducing drugs or revulsive imagery is repeatedly administered when homosexual desires are felt. Or therapies can simply remove the person’s ability to act on homosexual desires, as in the case of Turing’s chemical castration.
Attempts to change sexual orientation, not just behaviour, can involve intensive psychotherapy. One survivor of conversion therapy, which took place in a church basement after school when he was 15 years old, recalls the range of tactics employed: “Aversion therapy, shock therapy, harassment and occasional physical abuse. Their goal was to get us to hate ourselves for being LGBTQ (most of us were gay, but the entire spectrum was represented), and they knew what they were doing … The second step of the program, they ‘rebuilt us in their image’.” 3
Research by Bartlett, Smith & King reveals that conversion therapy is alive in the UK and within the psychiatric profession. It is not a phenomenon unique to the US context or to religious and unlicensed institutions—although the means employed are usually very different. Surveying 1,328 practitioners registered with one of the professional bodies overseeing psychiatry and psychotherapy, they found that 222 (17%) of psychiatrists and therapists admitted they had provided some form of conversion therapy for patients to reduce homosexual feelings and behaviours. 4 They voluntarily reported 413 cases in which some form of conversion therapy had been applied, most since 1980. 40% of those patients were seen in NHS practice. But conversion therapy is more routinely practiced outside of a professional therapeutic context, and the prevalence of conversion techniques is likely to be far higher.
The response to the UK petition came from the Secretary of State for Health, Jeremy Hunt. That the reply came from the Department of Health shows that the British government classifies conversion therapy as a putative health intervention. The government did not push for a ban. Instead, Jeremy Hunt criticised the evidence base for conversion therapy: “There is no evidence that this sort of treatment is beneficial, and indeed it may well cause significant harm to some patients.” 5 A further response came from Nicola Blackwood, the Parliamentary Under Secretary of State for Public Health and Innovation. She stated: “the Government has consistently condemned gay conversion therapy, and stressed that no public money should ever be used to fund such a practice”, but that a ban was out of the question: “we consider that legislation is a blunt instrument … there is a real risk, in taking a legislative option, that we overly restrict access to therapies and capture, in any legal definition, therapies that may help some people in working through issues and feelings they have about their sexuality”. 6
Part of the reason that the government’s line comes across as strange is the focus on the lack of evidence that conversion therapy works. To be sure, the evidence base for conversion is scant. The most prominent study in favour of conversion therapy was performed by Robert Spitzer, and published in 2003. His study reported 200 cases of both men and women changing their sexual orientation through conversion therapies. Spitzer claimed: “The majority of participants gave reports of change from a predominantly or exclusively homosexual orientation before therapy to a predominantly or exclusively heterosexual orientation in the past year.” 7 But Spitzer’s study was based only on self-reports of patients in telephone interviews. He compiled no evidence of actual behavioural or psychological change beyond those self-reports. He made no attempt to measure self-deception. In 2012, Spitzer formally retracted his study and offered a public apology: “I believe I owe the gay community an apology for my study making unproven claims of the efficacy of reparative therapy. I also apologize to any gay person who wasted time and energy undergoing some form of reparative therapy”. 8
But the point is that the evidence base for conversion therapy is largely irrelevant. Even if conversion therapy had a high success rate and did not lead to increased rates of psychological distress and suicide, it would not and could not be part of scientific medicine. As the UK Council for Psychotherapy put it in their professional conduct guidelines: “It is exploitative for a psychotherapist to offer treatment that might ‘cure’ or ‘reduce’ same sex attraction as to do so would be offering a treatment for which there is no illness.” 9
Homosexuality is no longer regarded as a disease. The Diagnostic and Statistical Manual of Mental Disorders (DSM), regarded by many as the definitive statement of what does and does not count as a psychiatric disorder, removed homosexuality from its lists in 1974. 10 The change was prompted by meetings with gay rights activists and research by the psychologist Evelyn Hooker. The new diagnosis of “ego-dystonic homosexuality” was then introduced, and persisted into the 3rd edition of the DSM. Ego-dystonic desires and behaviours clash with the person’s ideal self-image. Ego-dystonic homosexuality was conceived to cover cases in which an individual’s sexual orientation conflicted with their ideals, such as cases of religious believers whose homosexuality was in direct conflict with their beliefs. The American Psychological Association subsequently removed this category from the DSM and issued a statement condemning its use in 1987. 11
“A treatment for which there is no illness” cannot be part of scientific medicine. The scientific canon excludes homosexuality, even if the individual’s sexual orientation causes them distress, from the category of psychological disorders. This renders the evidence for or against conversion therapy irrelevant. Had there been substantial evidence that conversion therapy succeeds, it would have fallen into that gap of evidence-based alternative medicine—a therapy that cannot be part of scientific medicine, yet for which there is evidence of effect.
Treating a disease which doesn’t exist or is not recognized is one way to be disqualified from scientific status. Another is to flagrantly contradict established medical principles. A treatment might work, yet the underlying principles be incompatible with medical knowledge. In this case, one or the other must give ground. This happened also in the case of conversion therapy, and provided a second reason why the evidence, or lack thereof, was not the most relevant factor excluding the therapy. Conversion therapy rests on a fundamental precept: it is possible to intervene to change or choose sexual orientation. This notion is not consistent with established psychological principles.
There are many other treatments which could not enter the canon of scientific medicine, even if there was a decent evidence base for their effectiveness. Many of these treatments explicitly invoke mechanisms or systems which do not exist according to mainstream medicine. The consistency and coherence of medical thought requires that these treatments remain under the “alternative” or “complementary” medicine label. Systems which are not recognized in scientific medicine include various forms of ‘vital energy’ such as qi, and systems for its manipulation.
Prominent complementary and alternative medicine researcher Edzard Ernst has played a leading role in submitting therapies such as these to randomized trials and systematic reviews, and in scrutinizing the evidence for alternative medicines. In a review conducted with colleagues in South Korea and Australia, Ernst assessed the evidence for qigong, an assembly of techniques using breathing, posture and meditation to manipulate the flow of qi. They found that the evidence base was lacking and inconclusive: “reviews were not conclusive and all were based on poor quality clinical trials. Given these important caveats, it would be unwise to draw firm conclusions about the effectiveness of qigong.” 12
Again, although the work done by Ernst and his colleagues has been informative, a lack of evidence is not the most significant factor affecting whether qigong is an alternative or a scientific therapy. Qigong explicitly sets out to manipulate a system which, according to scientific medicine, does not exist. It does so by affecting flows and channels which also do not exist. Even if qigong showed large effect sizes in trials, the question would not be whether qigong should join the scientific medicine club—it couldn’t be eligible. Rather, the research program that would follow would have to explain why qigong worked without reference to those non-existent systems: was it an enhanced placebo effect, or perhaps an inadvertent effect on a recognized physiological system? If such an explanation could be found, then the question would remain of whether qigong required the mysticism and scientifically unaccepted systems in order to function. If not—for instance, if the breathing and posture exercises were found to have a discernible beneficial effect on the circulatory system—then those exercises could be lifted, refined to target that benefit, and adopted into scientific medicine, divorced from their spiritual content. This second division line—exclude anything which is inconsistent with established anatomy, physiology and biochemistry—rules out a great many alternative therapies without the evidence base being a primary concern.
A third class of potential evidence-based alternative medicines is yet more subtle—the class of treatments which create enhanced placebo effects. For a multitude of ethical and professional reasons, doctors are generally not permitted to intentionally prescribe pure placebos. Psychic surgery, acupuncture and homeopathy can all display discernible physiological effects. However, so can treatments which are pure placebos: sugar pills, sham surgery, and sham acupuncture using retracting needles. Even if large effect sizes could be demonstrated for, say, acupuncture, it would not follow that it becomes admissible into the canon of scientific medicine. Studies demonstrating that acupuncture has no larger an effect than sham acupuncture with retractable needles effectively demonstrate that acupuncture’s benefits are due to an enhanced placebo effect. Even though acupuncture (and sham acupuncture) will produce genuine, replicable physiological effects for a range of patients, they cannot be admitted to scientific medicine, just as a sugar pill cannot be sanctioned as part of legitimate medical treatment for the flu. It is not enough to produce discernible physical effects for a treatment to be admitted to scientific medicine—it must also do so in a way that is not a pure placebo effect.
In cases like psychic surgery, acupuncture and homeopathy, the issues discussed here tend to compound together. These treatments also flagrantly violate basic accepted scientific tenets, whether by postulating unrecognized systems (meridian lines and acupuncture points which regulate qi flows, for instance), accepting physiological or biochemical theories inconsistent with firmly-held theories (the law of dilution in homeopathy, for example), or in basic violations of physical theory as in the case of psychic surgery.
These are just some of the ways in which there could be evidence that a treatment produces a discernible physiological or psychological effect, yet without qualifying the treatment as part of scientific medicine. These grey areas allow for a distinct category of evidence-based alternative medicine (or at least alternative medicines for which there is not an absence of evidence of effect). The notion of evidence-based alternative medicine cannot be straightforwardly dismissed as a definitional impossibility. We must meet it head-on.
Levels of Evidence and Alternative Medicine
The second apparent paradox of alternative medicine is more challenging. It derives from two observations: (1) there are therapies generally classed as ‘alternative’ for which is there is evidence which EBM’s systems rate as strong, high-quality or high-level, but which (2) are not accepted as part of scientific medicine by the majority of practitioners. It seems that either these practitioners don’t truly follow the principles of EBM through to their logical conclusions, or that those principles are flawed and practitioners have (at least implicitly) recognized that.
The contradiction comes from a type of hierarchy known either as a “Levels of Evidence” or a “Grades of Recommendation” hierarchy. These systems don’t just rate the strength or quality of evidence for some proposition about a treatment. They go further. They translate a judgment of evidence quality into a judgment about how high the overall level of evidence for the claim that a treatment works is, or how strongly doctors should recommend that treatment to their patients. The foremost and most influential hierarchies take exactly this approach. The Australian National Health and Medical Research Council’s (ANHMRC) extraordinarily influential hierarchy equates high quality evidence to a strong positive recommendation. 13 The GRADE system uses the output of the GRADE hierarchy to classify how strong a recommendation can be as ‘strong’, ‘weak’ or ‘no recommendation’. 14
The problem emerges when we consider what is needed to reach the highest levels in these hierarchies. All of these influential rankings place individual sources of evidence at their highest level. For GRADE, the highest level is characterized by the RCT. For the ANHMRC, it’s “A systematic review of level II studies”, where ‘Level II’ studies are RCTs. Most other Levels of Evidence tables are about the same. It’s the indefinite article that raises a big issue here: it’s a systematic review of all relevant trials which is needed to guarantee the highest-level evidence for the ANHMRC, an RCT for the second level, and so forth. For GRADE, things are more ambiguous, but the guidance suggests that a single positive RCT within the evidence means we think of that evidence base as ranking at the highest level (albeit adjustable for other properties like effect size, consistency, etc.) But the overall message is that a single positive trial or systematic review can be viewed as the highest level of evidence, and therefore as justifying the strongest possible recommendation for a treatment.
The problems are hardly difficult to notice. What about an evidence base composed of dozens of negative trials and a single small positive trial? Surely this is an incredibly weak evidence base for a recommending the treatment—perhaps worse than having no trials at all. The proponents of these hierarchies distance themselves from these obviously erroneous interpretations. Their hierarchies themselves, though, at best gloss over this detail, and at worst actively reinforce it.
Despite cautious words from EBMers, this is one way in which the paradox of evidence-based alternative medicine can arise. A trial or a systematic review which reports a false positive can be interpreted as providing the highest level of evidence and as justifying a strong positive recommendation for a treatment. This applies to all treatments, not just alternative therapies. But it can be illustrated easily in the case of alternative therapy.
In 2001, Leonard Leibovici took the medical records of all 3,393 patients who’d suffered blood infections at Rabin Medical Center in Israel between 1990-6. 15 These patients had been treated between four and ten years ago. He randomized the patients into a treatment and control group, following proper procedures. Both groups were (obviously) blind to which groups they were assigned. He then asked a colleague to say a short general prayer for the health and wellbeing of the treatment group. This was a retrospective prayer—a prayer to help those patients back in the past. When he analyzed the results, he found that length of hospitalization and duration of fever were significantly shorter amongst the prayer group than the control group (p=0.01 and p=0.04 respectively)! He checked for baseline differences between the groups on various dimensions, and found no significant difference. The result was presumably due only to chance, as it would stretch even the most religiously-motivated interpreter’s credulity to agree that God had taken the cursory prayer in this study, made a decade later, seriously enough to intervene in these patient’s outcomes (and not assist the control group). While he facetiously explained away the backwards causality with a diagram of angels making the universe turn, Leibovici’s point is clear: false positives happen.
Indeed, there are many ways to help false positives happen more frequently. Although Leibovici didn’t really employ many of these techniques in his study—certainly in comparison to a great number of studies whose results are taken with much more seriousness—he was certainly aware of them.
Does retroactive prayer work? Should we recommend it to patients? Leibovici concludes: “Remote, retroactive intercessory prayer can improve outcomes in patients with a bloodstream infection. This intervention is cost effective, probably has no adverse effects, and should be considered for clinical practice.” 16 This is relatively modest, given that he had just provided a trial which meets the criteria for the highest level of evidence as set by GRADE. A systematic review which included this trial could well meet the highest-level criteria for ANHMRC, and if not at least meets their Level II criteria—enough to merit a reasonably strong recommendation in most grading systems. Intercessory prayer is an alternative therapy by definition: it relies on non-physical systems which are not acknowledged as part of scientific medicine. But if these attempts to evaluate levels of evidence and grades of recommendation were implemented as written, it would certainly be an evidence-based alternative therapy.
The problem begins with a simple fallacious conflation. The idea of founding medicine upon a high-quality “evidence base” is confused with the idea that we should accept claims for which there is high-quality evidence. There’s an important difference between the claim that the evidence base supports the claim that, say, retroactive intercessory prayer is effective, versus the claim that there is strong evidence that retroactive prayer works. The latter claim is true—the Leibovici study is high-quality evidence for retroactive prayer. But the evidence base as a whole is not so supportive. The statement: “There is strong evidence that prayer works” is consistent with the claim: “There is strong evidence that prayer does not work”, and even “The evidence base for prayer is overwhelmingly negative”.
Ranking Evidence and Evidence Bases
There are two ways to go about fixing this straightforward issue which affects many prominent hierarchies. The first is to stop translating claims about the quality or strength of the evidence provided by a single study into claims about the overall level of evidence or the strength of recommendation. For sure, this limits the practical applicability of a hierarchy. But it blocks an obvious move: to keep running trials until one comes back positive, and use that to satisfy the hierarchy’s demand for high-level evidence.
The second option is to stop attempting to evaluate evidence from a study, and instead only evaluate the entire evidence base. That is, instead of asking whether there is a positive RCT, and if so ranking the evidence highly, a hierarchy would have to evaluate all of the studies performed. Only if the majority, or the perhaps the highest quality few, amongst these studies were positive, would a recommendation be justified. Now, this gives the user a much bigger task. They can no longer identify a single study of interest and check its worth against a hierarchy. Now, they must identify all studies on a question, and use a hierarchy’s criteria to evaluate that whole evidence base. Essentially, they have to perform their own systematic review. Or to save time, they might rely on someone else to do this: someone performs a systematic review of all trials, and that and only that can form the basis of a strong recommendation. No hierarchy yet has gone that far, to limit recommendations only to this form of meta-evidence.
But this won’t even solve our issue. The problem of the lone false positive result was only the first of many fatal flaws in the levels of evidence approach. The others afflict hierarchies which assess evidence bases, and attempts to make recommendations based on systematic reviews, too. We’ll meet these broad problems now.
Creating False Positives: the cmRCT Story
In the Leibovici prayer study case, I mentioned that there are many ways to engineer a false positive—that is, to manipulate trial design to make a positive result more likely, even when the treatment has no or little effect. In some cases, the EBM movement has tried to address these techniques by including conditions or modifiers in their hierarchies which refuse to categorise evidence at the highest levels if those techniques are used, or if the biases they created are detected. For example, GRADE down-grades evidence where there is “serious risk of bias”. 17 Other hierarchies specify that the highest-level evidence must be “properly performed” or “well designed”, ostensibly to exclude studies where the design or execution massages the results towards a positive outcome.
Yet it is still relatively easy to design trials which outwardly satisfy the major requirements of hierarchies, but which are systematically skewed towards positive findings. The most important criteria to achieve the highest ranking are: the trial is randomized, properly and truly; intention-to-treat analysis is used; the control and intervention populations are not too divergent in their baseline characteristics (age, gender, race, severity of condition, etc.). To see how a study can satisfy these criteria yet be systematically skewed, we’ll look at one example: the cmRCT.
The cmRCT, or “Cohort Multiple Randomised Controlled Trial”, otherwise known as the ‘Patient Cohort’ RCT, or as “TwiCs” (Trials Within Cohorts), is a study design created by Clare Relton and her colleagues at the University of Sheffield. Relton, a homeopathic practitioner and researcher, crafted the cmRCT design as part of her PhD project investigating whether homeopathy was an effective treatment for menopausal hot flushes. 18 Along with her colleagues, she published the study design in the British Medical Journal in 2010. 19
The cmRCT design claims to combine the advantages of observational studies with the rigor of the RCT. For all practical purposes, it would satisfy the criteria for an RCT according to most hierarchies, so gain that high-level ranking. In a cmRCT, the researchers first recruit a large cohort of patients into an observational study. In Relton’s pilot cmRCT on menopausal hot flushes, 856 women aged 45-64 were recruited. This cohort are then observed over a long period, as in a longitudinal observational study. Relevant data are regularly recorded.
During this time, RCTs may be performed on a subset of the population. Unusually, the approach allows for multiple RCTs to be performed on the cohort, either in series or parallel. An RCT will choose a subset of the cohort to study according to set inclusion and exclusion criteria. In the case of Relton’s menopausal hot flushes study, a subset of 48 patients was identified in this way. This subset is then randomly allocated to either the control or experimental arm—in this case, 24 to each.
A novel consent design is then used, similar to a design previously presented by Zelen. Only the patients who are randomized to the experimental arm—to receive the homeopathic treatment—are asked to consent to the trial. The control group is never informed that they are part of a trial or asked to consent. The ethical justification here is less controversial than in the ECLS case, though, because the control patients have already consented to have their data collected and studied by consenting to be a part of the cohort study.
The great boon of this design is that it prevents researchers having to obtain consent from patients to take part in a study where they don’t know whether they’ll receive a placebo or the real thing. Relton describes her difficult experiences as a practicing homeopath taking part in a clinical trial earlier in her career, having to explain to patients that they would have a 50-50 chance of receiving the “real” treatment, and finding that many patients would walk away, preferring to receive the guaranteed treatment instead. For sure, recruitment is always a challenge for clinical trials, especially double-blind placebo-controlled trials in which consent must be obtained from patients to take the chance of receiving no active treatment for their condition.
But within this feature is a serious and systematic bias which skews the trial results in favour of the experimental treatment—in this case, the homeopathic treatment. There is good evidence for the existence of a “trial effect”—the phenomenon that patients fare better when involved in trials than otherwise. 20 This effect is similar to the more well-known ‘Hawthorne Effect’, in which workers being observed perform better and become more efficient because they are being observed. 21 Patients understand the social expectations of a trial: that the researchers want to detect an effect for the treatment, especially where the researchers are practicing homeopaths. Patients tend to be compliant with these expectations, reporting larger effects and greater satisfaction with the treatment. Self-reported measures like quality of life questionnaires are particularly susceptible. There is also evidence that patients experience augmented placebo effects (and the negative version, nocebo effects) when they believe that they are taking part in a trial. The size of a placebo effect tends to be related to how dramatic the intervention is, and how it is presented. 22 In a clinical trial, the intervention is presented in dramatic terms: “the experimental intervention”. The environment of a trial has connotations of novelty, dramatic effect and innovation. The rigmarole of the consent process and the trappings of scientific objectivity could all contribute to this enhanced placebo effect.
The expectation of greater benefits is buttressed by the ‘therapeutic misconception’. 23 This is the mistaken idea that trialists have the same obligations as your ordinary clinician—to look after your health and give you the treatment which is best for you. This is not the case. Researchers are exempt from all kinds of duties which allows them to provide treatments known or believed to be ineffective, like placebos, whereas doctors in ordinary practice should not. This misconception leads many patients to believe that the experimental treatment would only be prescribed if it was widely believed to be effective.
Patients participating in trials also tend to receive much closer care, more regular meetings with doctors, and routine monitoring of health outcomes which means that problems are detected and treated more quickly. 24 Patients who have more encounters with medical practitioners are also more likely to be satisfied with the level of care they receive, and that satisfaction can feed back into their reports of their health and wellbeing.
In a cmRCT, only the patients given the experimental treatment are aware that they are in a trial. Only those patients will receive these additional benefits in the form of augmented placebo effects, a greater level of care, and heightened expectations. Only they receive the social pressure to report more positive outcomes. In a situation in which augmented placebo effects are a serious concern, and separating any real effect from a placebo effect is absolutely central to establishing whether the treatment has a viable role within scientific medicine, this approach introduces a bias which undermines any interpretation of results in terms of a real effect attributable to the treatment.
The cmRCT shows one example of a trial design which ticks the theoretical boxes of hierarchies of evidence. It genuinely is a randomized controlled trial. But it lacks the rigorous controls for bias that EBM proponents hope an RCT will ensure. The first lesson for EBM from the cmRCT is that randomization alone is far from a guarantee of balance and freedom from bias. The second is that the generic label of ‘RCT’ covers a huge range of design and implementation. Relton’s 48-patient study of menopausal hot flushes is far removed from some of the giant multi-centre double-blind studies out there, and it is untenable to classify them together within a single ‘level’ of evidence.
The cmRCT will systematically privilege the experimental treatment. We should expect to see lots of positive findings amongst cmRCTs, even where the treatment being tested is no better than placebo. Worse still, we’ll have no good way of telling whether a positive result is due to the trial effects bias the design introduces, or to a real effect. The cmRCT is an example of an effective false-positive generator—a method which is so configured to produce an abnormally large number of positive results, even in the absence of effective treatments.
Composing a Positive Evidence Base
We’ve seen how specific study design choices can make a false positive much more likely. If trials on a treatment are mainly conducted with this kind of design, it would be relatively simple to compose an evidence base filled with positive findings even in the absence of any beneficial effect. At that point, as John Ioannidis put it, positive trials and systematic reviews are merely measuring the extent of the bias in the field. 25 False-positive generators are one component of a methodology that can create a positive evidence base sufficient to meet a newly-reinforced levels of evidence approach.
There are many more components. These are not features of purely pathological fields of study, but endemic across medical research. The first and perhaps best-known phenomenon is publication bias. The basic form of publication bias occurs when negative results are less likely to be published than positive findings. 26 This happens for a multitude of interlocking reasons. Researchers may not feel that negative findings reflect well on them, or may feel that the findings are not interesting or relevant, and therefore fail to even attempt to publish the results. Funding bodies, particularly pharmaceutical companies, may suppress publication of negative results which could harm commercial interests, or simply lack any incentive to publish negative findings if they have moved on from a project. 27 Sponsors can often have a veto over potential publication decisions—and even when they do not, pharmaceutical companies have been known to use litigation to prevent, deter or delay publication of negative results. 28 Although journal editors and reviewers are increasingly aware of the problem, historically they may have been less favourable towards devoting page space to negative findings. Peer review may be more difficult to pass without a catchy headline figure. 29 Publication bias appears to be endemic in the medical literature.
Lack of publication of negative findings is not the only way in which the literature is positively skewed. There is a time lag between studies being completed and published, during which the study is written up, submitted to journals and reviewed. However, there is substantial evidence that the time lag is considerably longer for negative findings. 30 This may be due to the “file drawer effect”, 31 in which researchers are less willing to devote time to writing up a negative trial, to rejections and resubmission from journals which can be an extremely time-consuming process, or to hurdles imposed by vested interests. 32 The longer a study takes to reach its audience, the less of an impact it can have, and the more likely a positive narrative imposed by a handful of positive results can come to dominate clinical opinion. Once a positive result comes to the attention of clinicians, it can be very hard for negative findings to dispel it. 33 As John Ioannidis found, many of the most highly-cited and influential research papers contain results which were since contradicted in other studies. 34
A related phenomenon is selective outcome reporting. In any given study, researchers will likely measure many different outcomes, using different metrics. It is a tactic that makes methodological and pragmatic sense—while you have the subjects, might as well gather as much information as possible and attempt to get a full sense of the effects of a treatment. However, researchers do not necessarily write up and present the data they collected relating to every outcome. Researchers may feel that the negative findings are simply uninteresting. Negative results on one outcome with positive results on another can be hard to explain, and threaten the coherence of a paper. There may be pressure to suppress findings which don’t fit with a narrative of positivity. More pernicious still, it is possible to present a trial which reported primarily negative results as a positive trial by only reporting the data for outcomes where a positive association was found. This really matters. As we’ve seen, false positives happen. With enough outcomes measured, the probability of at least one positive result, even for a completely inefficacious treatment, approaches certainty. If a study measured dozens of outcomes and found a single positive result, it would most likely be disregarded as a false positive (unless there was a very clear rationale for taking it seriously—and even then most likely only with some replication elsewhere). But if the same study was presented as a single positive correlation without the information that dozens of other outcomes were tested and came back negative, then it could pass off convincingly as a truly discovered association.
In combination, these techniques and biases can construct an evidence base which is manipulated, intentionally or accidentally, to create an impression of a positive result and to minimize mitigating effects from fair tests. Systematic reviews can be systematically skewed. As a result, it is consistent to hold that medical practice should be based on high-quality evidence bases, but also that an evidence base consisting of multiple high-quality-rated trials which support the claim that a treatment is effective does not justify the use of that treatment. Levels of evidence and grades of recommendation hierarchies which require systematic reviews or an appraisal of the entire evidence base are at risk of judging a systematically biased set of trials as high-quality.
Diversifying Evidence Bases
We’ve seen that an evidence base could be constructed to meet the criteria of a hierarchy which ranks or grades recommendations on the basis of the appraisal of a whole evidence base. Such a constructed evidence base could exist in the absence of any real effect of the treatment. This closes the paradox: practitioners could be justified in refusing to classify a treatment as part of scientific medicine despite an evidence base which a hierarchy rates as positive and high quality. The hierarchy is prone to give misleading or unwarranted judgements of the quality, strength and even direction of an evidence base.
What can be done? The most potent approach is to harness the importance of diversity within evidence bases. A diverse evidence base is harder to construct and harder to manipulate than a homogeneous one. Thus, it’s less likely that a diverse evidence base has been deliberately or accidentally manufactured to mislead. The biases which a high-quality evidence base are susceptible to per a Levels of Evidence or Grades of Recommendation hierarchy are less able to permeate throughout a diverse evidence base, for a range of reasons.
What is a diverse evidence base? There are many ways in which an evidence base can be diverse. First, there is methodological diversity. Methodologically diverse evidence bases include studies of a range of types – perhaps including RCTs as well as cohort studies, case-control studies, outcomes research, mechanistic and laboratory studies, subgroup analyses, and predictive modeling. Methodological diversity allows us to cross-reference elements of the explanatory theories which we invoke when analyzing an evidence base. For instance, suppose I am concerned that some of the trials I’ve been looking at overstate the benefits of a treatment because they selectively recruit patients who are likely to benefit from the treatment and unlikely to benefit from the control treatment. Here, I can use observational designs and outcomes research to evaluate this explanation of the trial results—I can check whether the theory about differential benefit is borne out in the data, and can check whether the findings in the trial stand up in the wild. Perhaps I’m concerned that the estimates of the average treatment effect found in a trial won’t generalize to a particular population. Here, I might look to mechanistic studies and laboratory evidence to see whether there are disruptions to established pathways of effect to substantiate or disconfirm my concerns.
Methodological diversity also imposes barriers against biases which could permeate a more homogeneous evidence base. RCT methodology hopes to decrease the likelihood of biases such as selection bias (in which patients with different baseline chances of experiencing the outcome of interest are differentially allocated to the treatment groups) and treatment bias (in which patients in the control and experimental group are treated differently in ways other than the treatments of interest—for instance, if the experimental patients are more intensively screened and monitored). But it can’t guarantee to eliminate these biases because some of the methodological features which decrease the chance of bias are probabilistic (double-blinding makes it less likely that treatment bias occurs but doesn’t rule it out; randomization rules out selection bias in the infinite long run but can’t guarantee it in any individual case), and because others can be subverted by a suitably wily investigator or indeed by patients, analysts, and practitioners involved in patient care but unrelated to the investigators (randomization can be subverted or faked; blinding can be broken; figures can be fudged).
But different study designs are subject to different risks of bias. RCTs can often be better at controlling for certain kinds of biases, but exposed to others. RCT designs are susceptible to biased selection criteria, pandering to particular subsets of the patient population, typically male, middle aged, and otherwise healthy. These biases don’t so much throw off the ability of the trial to estimate the average treatment effect in the population as throw off generalisations from the results to recommendations in the patient population, leading to biased treatment practices which privilege certain populations at the expense of others. Other forms of study may be more susceptible to selection biases and treatment biases, but less susceptible to selective recruitment. For instance, outcomes research applied across a broad patient population may provide data which redresses some of these issues, as might broad pragmatic studies. RCTs are also, as we’ve seen, prone to publication bias and selective reporting, in part because of the high costs involved and the necessity of industrial sponsorship. Costs and feasibility concerns also limit the extent to which RCTs can provide longitudinal outcome data, and the scale at which side-effects can be reported, skewing the RCT evidence base systematically towards shorter-term effects and a focus on positive outcomes (or negative outcome prevention – e.g. stroke prevention treatments) rather than side-effects. These biases are within evidence bases of RCTs as opposed to the RCTs themselves. A diverse evidence-base may correct for such biases by incorporating evidence from observational sources. We may also be able to integrate studies of publication bias in the field directly into the evidence base and enhance it in that way, either to demonstrate the robustness of trial results against publication bias, establish cause for concern, or even to try to correct for its effects in our final analysis.
Evidence bases can also be diverse in terms of their contributors. Who is producing the evidence which composes the evidence base? What are their characteristics? Do they have affiliations, funding sources, or other potential conflicts of interest which might introduce biases, consciously or unconsciously, or might prompt the use of designs and approaches which favour a particular finding? Ideally, we’d want to see an evidence base with contributions from researchers with varied perspectives, potentially from a range of specialisms and disciplinary traditions, adopting different approaches to minimize bias in their studies, asking different kinds of questions and with different sources of funding and potential sources of bias. Where evidence bases are composed of studies by a diverse base of contributors, in terms of their funding, institutions, commitments, perspectives and research approaches, yet results remain robust and consistent across the evidence base, explanations in terms of systematic bias become less and less plausible. Where a diverse evidence base shows variation in results, and that variation correlates with the characteristics of the contributors, that suggests particular explanations for the outcomes, which affects which recommendations can be justified, and which can then be investigated further.
There are other elements of diversity within evidence bases: we’d want to see a diversity of study locations, study populations, techniques to control for bias, and so on. All else being equal, we should have greater confidence in a consensus result from a more diverse evidence base than from a less diverse one. A truly well-configured tool for assessing the quality of an evidence base would emphasise the importance of this diversity and thereby mitigate the risks of a single systematic bias producing a false positive (or false negative) result which permeates the entire base.
A particular and significant component of a diverse evidence base is mechanistic and laboratory studies. The EBM movement is very sceptical of the evidential value of understanding the mechanism by which a treatment works, and of laboratory studies which show treatment effects at the biochemical level, as opposed to at the macro-level of patient experiences. There is good reason for this scepticism: the EBM movement has primarily focused on attempting to identify the overall average treatment effect in some population, and mechanistic studies don’t add much to this project. The existence of a plausible mechanism, and the presence of effects at the biochemical level, does not mean that the treatment will work for the important patient outcomes – pain, mobility, morbidity, mortality, etc. But these studies do offer relevant corroborations for explanations of study results, and in particular of variation within patients’ responses to treatment.
They also offer corroboration for the plausibility of study results. If a treatment does not show biochemical effects consistent with the reported effect on patient-oriented outcomes, then this reduces the plausibility of a positive trial result. Where trial results are positive but mechanisms are in doubt, we must be more sceptical, and treat threats of bias very seriously. This is an important driver of scepticism towards results of trials in alternative therapies. The lack of plausible biological mechanisms by which treatments such as homeopathy could create the effects which some trials find make explanations in terms of enhanced placebo effects, systematic bias, and false positives more plausible, and explanations in terms of a specific effect of the treatment less plausible.
There is an important asymmetry here between positive and negative evidence. Alexander Bird argues that, even if a plausible mechanism is weak as positive evidence for a treatment effect, the lack of mechanistic plausibility can be very strong evidence against it. 35 The interaction between strong mechanistic evidence against biological plausibility, and the evidence of systematic biases within the positive evidence base, provides a mutually-reinforcing interpretation of the evidence bases for many alternative therapies.
To take one powerful example: multivitamin supplements have been defended by some alternative medicine proponents for healthy adults without nutritional deficiencies. “Megavitamin” therapy (aka. “orthomolecular medicine” 36) involves massive doses of vitamin supplements, often containing five or even ten times the recommended daily amount 37, commonly of vitamins C, E and β-carotene. There are some trials which suggest that large doses of vitamin E, in particular, may reduce cardiovascular event risk, 38 although there are many contradictory results. 39
But mechanistic evidence goes a long way to undermine claims of a treatment effect for megavitamin dosing, especially for a treatment effect beyond that of conventional vitamin supplements or balanced dietary interventions. Laboratory studies suggest that the body’s homeostatic mechanisms react to surplus vitamin intake by metabolizing and excreting the excess, maintaining a stable level of the vitamin in relevant organs. 40 Water-soluble vitamins including vitamin C, niacin and folic acid have little effect on vitamin concentration in the organs once a normal level of function is reached. 41 Where the metabolism and excretion of the megadose fails or is interrupted, though, harms due to excessive accumulation result, which have been repeatedly demonstrated in studies. 42 Meanwhile, fat-soluble vitamins (e.g. A, D and K) are stored and not used, which can also lead to toxic accumulation. 43 Similarly, megadosing with microminerals like zinc has been shown to be harmful by physiologic studies which demonstrate interruptions of normal metabolic pathways (e.g. interruption of the metabolization of copper and iron). 44
A biologically plausible mechanism alongside physiological studies and laboratory results which lay out coherent and responsive pathways for action is an important component of a strong evidence base. RCTs alone do not guarantee a plausible biological mechanism, even when the results are positive, because it is always possible that the results are due to bias. Given that biological implausibility is strong evidence against positive results being genuine, the absence of evidence of biological plausibility is an important lacuna for any evidence base. Providing plausible mechanisms that respond appropriately to intervention helps to argue against the hypothesis that the correlations between treatment and effect are spurious. Hierarchies which do not at least require that there is no evidence of biological implausibility as a prerequisite for a high ‘level of evidence’ rating are misguided. A fully formed evidence base appraisal process should go further still, and incorporate the relevant results of mechanistic studies. As yet, no hierarchy which attempts to assess evidence bases has met either of these requirements. They remain subject to manipulation because they fail to take account of mechanistic counterevidence, and fail to acknowledge the asymmetry between evidence for and evidence against the plausibility of a treatment effect.
Conclusion
The apparent “paradox” of evidence-based alternative medicine dissolves. There are reasons other than a lack of evidence which might consign a treatment to the ‘alternative’ category. Positive evidence that a treatment has an effect is not sufficient for it to count as a part of scientific medicine.
But the paradox exposes a deep flaw in the way in which hierarchies of evidence, including the prominent GRADE approach, assess evidence bases for recommendations. It is too easy to fix or fudge an evidence base to meet the criteria for a strong recommendation. Because these rankings have focused on providing a single kind of evidence (albeit often in bulk) to substantiate claims of effects, they can be misled by one or two significant biases which affect that form of evidence, either at the study level (as in the case of the cmRCT design and its trial effect bias) or at the meta-level of the evidence base (as in the case of publication bias, selective reporting and selective recruitment).
These approaches neglect the benefits of a diverse evidence base, which makes it significantly harder for a single source of bias to corrupt a significant proportion of the base. Much like genetic diversity within a population protecting against eradication by a single disease, methodological and contributor diversity within an evidence base helps to defend against systematic bias from a single source. Within a diverse evidence base, we can use the interactions between different sources to try to identify threats of bias and factor our understanding of those threats into our analysis and into our recommendations.
What’s more, these approaches to evidence base appraisal have omitted an important asymmetry: just because mechanistic evidence doesn’t go very far towards showing that a treatment has an effect, doesn’t mean it cannot constitute strong evidence against the plausibility of trial results purporting to show effects. Given that bias is always a possible alternative explanation for a positive result, and systematic bias an explanation for a preponderance of positive results, omitting sources which could provide evidence that there is no effect to be found, and thus that bias may well be holding sway, is a significant and unfortunate omission. At the very least, approaches to rating the quality of evidence bases should take account of the role of the plausibility of a biological mechanism. These matters have been particularly salient in debates over alternative medicines, where plausible biological mechanisms are often hard to find. But they are by no means restricted to that domain.
To solve the problem of apparently evidence-based alternative medicine, the only real option is to reject the approach to evidence base appraisal associated with organizations such as the ANHMRC and GRADE in favour of an approach which recognizes the importance of diversity to the quality of an evidence base, and acknowledges the role that mechanistic plausibility plays.
Bibliography
- Adair, John G. ‘The Hawthorne Effect: A Reconsideration of the Methodological Artifact’. Journal of Applied Psychology 69, no. 2 (1984): 334.
- American Psychological Association. Diagnostic and Statistical Manual of Mental Disorders. 2nd ed., 7th printing. Washington, D.C.: APA, 1974.
- Appelbaum, Paul S, Loren H Roth, Charles W Lidz, Paul Benson, and William Winslade. ‘False Hopes and Best Data: Consent to Research and the Therapeutic Misconception’. Hastings Center Report 17, no. 2 (1987): 20–24.
- Australian National Health and Medical Research Council (ANHMRC). Procedures and Requirements for Meeting the 2011 NHMRC Standard. Melbourne: National Health and Medical Research Council, 2011.
- Balshem, H., M. Helfand, H. J. Schunemann, A. D. Oxman, R. Kunz, J. Brozek, G. E. Vist, et al. ‘GRADE Guidelines: 3. Rating the Quality of Evidence’. J Clin Epidemiol 64, no. 4 (April 2011): 401–6. https://doi.org/10.1016/j.jclinepi.2010.07.015.
- Bartlett, Annie, Glenn Smith, and Michael King. ‘The Response of Mental Health Professionals to Clients Seeking Help to Change or Redirect Same-Sex Sexual Orientation’. BMC Psychiatry 9 (26 March 2009): 11. https://doi.org/10.1186/1471-244X-9-11.
- Bird, A. (2011) “What can philosophy tell us about Evidence-Based Medicine? An assessment of Jeremy Howick’s The Philosophy of Evidence-based Medicine”. International Journal of Person Centered Medicine, 1(4), 642-648.
- Blackwood, Nicola. ‘Letter to the Petitions Committee, Helen Jones MP’, 23 March 2017. https://www.parliament.uk/documents/commons-committees/petitions/Letter-from-Chair-to-Secretary-of-State-for-Health-and-reply-March17.pdf?utm_source=Petition&utm_medium=email&utm_campaign=174988&utm_content=chair_leter_DOH.
- Braunholtz, D. A., S. J. L. Edwards, and R. J. Lilford. ‘Are Randomized Clinical Trials Good for Us (in the Short Term)? Evidence for a “Trial Effect”’. Journal of Clinical Epidemiology 54, no. 3 (2001): 217–24.
- Easterbrook, Phillipa J, R Gopalan, JA Berlin, and David R Matthews. ‘Publication Bias in Clinical Research’. Lancet 337, no. 8746 (1991): 867–72.
- Fortmann, S.P., et al. (2013) “Vitamin and mineral supplements in the primary prevention of cardiovascular disease and cancer: an updated systematic evidence review for the US Preventive Services Task Force”. Annals of Internal Medicine, 159(12), 824-834.
- Fox, R E. ‘Proceedings of the American Psychological Association, Incorporated, for the Year 1987: Minutes of the Annual Meeting of the Council of Representatives: Use of Diagnoses “Homosexuality” & “Ego-Dystonic Homosexuality”’. American Psychologist 43 (1988): 508–31.
- Guallar, E., et al. (2013) “Enough is enough: stop wasting money on vitamin and mineral supplements”. Annals of Internal Medicine, 159(12), 850-851.
- Guyatt, G. H., A. D. Oxman, G. Vist, R. Kunz, J. Brozek, P. Alonso-Coello, V. Montori, et al. ‘GRADE Guidelines: 4. Rating the Quality of Evidence–Study Limitations (Risk of Bias)’. J Clin Epidemiol 64, no. 4 (April 2011): 407–15. https://doi.org/10.1016/j.jclinepi.2010.07.017.
- Hathcock, J.N., et al. (2007) “Risk assessment for vitamin D”. The American journal of clinical nutrition, 85(1), 6-18.
- Ioannidis, John PA. ‘Contradicted and Initially Stronger Effects in Highly Cited Clinical Research’. JAMA: The Journal of the American Medical Association 294, no. 2 (2005): 218–28.
- ———. ‘Effect of the Statistical Significance of Results on the Time to Completion and Publication of Randomized Efficacy Trials’. JAMA 279, no. 4 (1998): 281–86.
- ———. ‘Why Most Published Research Findings Are False’. PLoS Med 2, no. 8 (2005): e124.
- Kaptchuk, Ted J, Peter Goldman, David A Stone, and William B Stason. ‘Do Medical Devices Have Enhanced Placebo Effects?’ Journal of Clinical Epidemiology 53, no. 8 (2000): 786–92.
- Lee, Myeong Soo, Byeongsang Oh, and Edzard Ernst. ‘Qigong for Healthcare: An Overview of Systematic Reviews’. JRSM Short Reports 2, no. 2 (7 February 2011). https://doi.org/10.1258/shorts.2010.010091.
- Leibovici, L. ‘Effects of Remote, Retroactive Intercessory Prayer on Outcomes in Patients with Bloodstream Infection: Randomised Controlled Trial’. BMJ 323, no. 7327 (22 December 2001): 1450–51.
- Lewis, R. (2002) “Dietary supplements”, In Shermer, M. & Linse, P. (Eds.), Encyclopedia of Pseudoscience (Vol. 1, pp. 85-92). Santa Barbara, California: ABC Clio.
- Lexchin, Joel, and Donald W Light. ‘Commercial Bias in Medical Journals: Commercial Influence and the Content of Medical Journals’. BMJ: British Medical Journal 332, no. 7555 (2006): 1444.
- Lipton, M.A., et al. (1973) “Megavitamin and orthomolecular therapy in psychiatry”. American Psychiatric Association Task Force Report(7), 54.
- Mahoney, Michael J. ‘Publication Prejudices: An Experimental Study of Confirmatory Bias in the Peer Review System’. Cognitive Therapy and Research 1, no. 2 (1977): 161–75.
- Nichols, James Michael. ‘A Survivor Of Gay Conversion Therapy Shares His Chilling Story’. Huffington Post, 17 November 2016, sec. Queer Voices. http://www.huffingtonpost.com/entry/realities-of-conversion-therapy_us_582b6cf2e4b01d8a014aea66.
- ‘Petition: Make Offering Gay Conversion Therapy a Criminal Offence in the UK’. Petitions – UK Government and Parliament, 3 May 2017. https://petition.parliament.uk/petitions/174988.
- Relton, C. A New Design for Pragmatic RCTs: A ‘Patient Cohort’ RCT of Treatment by a Homeopath for Menopausal Hot Flushes. University of Sheffield: [PhD Thesis] ISRCTN 0287542, 2009.
- Relton, C., David Torgerson, Alicia O’Cathain, and Jon Nicholl. ‘Rethinking Pragmatic Randomised Controlled Trials: Introducing the “Cohort Multiple Randomised Controlled Trial” Design’. BMJ 340 (2010): c1066.
- Rennie, Drummond. ‘Thyroid Storm’. JAMA-Journal of the American Medical Association-International Edition 277, no. 15 (1997): 1238–43.
- Rosenthal, Robert. ‘The File Drawer Problem and Tolerance for Null Results’. Psychological Bulletin 86, no. 3 (1979): 638.
- Sauve, R.S., et al. (1990) “Megavitamin and megamineral therapy in childhood”. Canadian Medical Association Journal, 143(10), 1009-1013.
- Spector, R. & Johanson, C.E. (2007) “Vitamin transport and homeostasis in mammalian brain: focus on Vitamins B and E”. J Neurochem, 103(2), 425-438.
- Spector, R. (2009) “Science and pseudoscience in adult nutrition research and practice”. Skeptical Inquirer, 33(3), 35-41.
- Spitzer, Robert L. ‘Can Some Gay Men and Lesbians Change Their Sexual Orientation? 200 Participants Reporting a Change from Homosexual to Heterosexual Orientation’. Archives of Sexual Behavior 32, no. 5 (1 October 2003): 403–17. https://doi.org/10.1023/A:1025647527010.
- ———. ‘Spitzer Reassesses His 2003 Study of Reparative Therapy of Homosexuality’. Archives of Sexual Behavior 41, no. 4 (August 2012): 757. https://doi.org/10.1007/s10508-012-9966-y.
- Stephens, N.G., et al. (1996) “Randomised controlled trial of vitamin E in patients with coronary disease: Cambridge Heart Antioxidant Study (CHAOS)”. The Lancet, 347(9004), 781-786.
- Stern, Jerome M, and R John Simes. ‘Publication Bias: Evidence of Delayed Publication in a Cohort Study of Clinical Research Projects’. BMJ: British Medical Journal 315, no. 7109 (1997): 640.
- Tatsioni, Athina, Nikolaos G Bonitsis, and John PA Ioannidis. ‘Persistence of Contradicted Claims in the Literature’. JAMA 298, no. 21 (2007): 2517–26.
- UK Council for Psychotherapy (UKCP). ‘UKCP’s Ethical Principles and Codes of Professional Conduct: Guidance on the Practice of Psychological Therapies That Pathologise and/or Seek to Eliminate or Reduce Same Sex Attraction’. London: UKCP, 28 February 2011.