The Dismal Disease: Temozolomide and the Interaction of Evidence

An abridged version of this article is also available here.


Blockbuster drugs are rare. To be a blockbuster, a drug must shift over $1bn worth per year. There are thousands of drugs on the market, but only a few hundred blockbusters. Yet between them, they make up more than a third of the revenue of pharmaceutical companies. As in Hollywood, blockbusters are big business, and everyone wants to discover a new one. For Aston University and Cancer Research UK, discovering a blockbuster was unlikely and transformative.

Temozolomide cleared the $1bn threshold in 2008, earning $1.02bn for Schering-Plough under the names of Temodar and Temodal. (1) It is a chemotherapy drug used to treat glioblastoma multiforme (GBM), one of the most common and aggressive brain tumours. In the 1970s, Professor Malcolm Stevens and his group at Aston University (which he himself called an “unfashionable” institution) set out looking for nothing in particular. When research student Robert Stone joined Malcolm Stevens’ team, Stevens told him simply to “make some interesting molecules”.(2) They were searching for molecules with properties that could be harnessed in new and unanticipated ways—as Stevens put it, “the most precious commodities in drug discovery cancer research: novel molecules with novel biological properties”.(3) By the early 1980s, Stevens’ group were experimenting with compounds with multiple nitrogen atoms – easy to synthesise and sporting a range of interesting biological and chemical properties. By 1987, they’d developed the molecule which became the modern temozolomide.

Stevens and his colleagues wrote, in 1997, that the process of developing a new drug was a “mixture of intelligence and guesswork, dogged persistence and a major element of luck.”(4) Experimentation, innovation and encouraged imagination combined to give a new molecule that might succeed where others had struggled. Elsewhere, Stevens puts this across as a mixture of science, art and chance. He says he has “antennae attuned to recognise what are potentially cunning molecules”.(5)

Whenever luck is at work, expect a back catalogue of failed efforts. Stevens’ group at Aston knew that drugs which promise much in theory, and drugs that work in rodents and other animals often fail in “the only animal that really matters—the patient suffering from cancer”.(6) When Robert Stone synthesized a new nitrogen ring compound in the early 1980s, Stevens’ team found it had cancer-fighting capabilities in a range of tumours in mice. They named the interesting molecule Azolastone. “It was a great name, incorporating the azo group, the name of the lab (Aston) and of the student who synthesised it”, Stevens remembers.(7)

From the outset, things went sour for azolastone. At the American Association for Cancer Research conference in 1983, they presented the findings that a single dose of azolastone had curative effects on most mouse tumours to a mostly empty room, shelved in the last session of the day which was entitled ‘Miscellaneous’; “It was not a good omen”.(8) Then May & Baker pharmaceuticals, part-funders of the work, forced them to change the name, believing it was too similar to a big-selling antihistamine, Azelastine. They feared the scandal that would break if a hay-fever sufferer was accidentally handed a potent chemotherapy drug. Despite strong showings in animals, the rechristened mitozolomide fared poorly in human subjects(9) Effects on tumours were limited. Far more problematic were the extremely harmful effects on bone marrow. Suppression of bone marrow activity caused massive drops in platelet counts in the blood of test subjects, and put an end to mitozolomide’s promise.

This setback undermined the confidence of the pharmaceutical backers. A competitor sardonically dubbed it “Azo-last-one”. But seeing the potential in mitozolomide and hopeful of a breakthrough with a new variant, Stevens’ team gained support from Cancer Research’s Clinical Trial Committee, starved at the time for new drugs to test. Bringing a blockbuster to market through Cancer Research, funded, as Stevens puts it, by “running whist drives and flag days out in every village in the country(10), is a remarkable achievement, of which both groups are clearly proud. Learning from the problems with mitozolomide, they synthesised a new variant, temozolomide. Phase I/II studies which reported by 1997 were a success. Temozolomide appeared to have the capacity to produce substantial improvements in survival figures for glioblastoma patients. Better still, it didn’t have the disastrous effects on bone marrow of mitozolomide.

Early tests were carried out at Charing Cross Hospital, where investigator Ed Newlands discovered that although single doses had little effect, a schedule of five daily doses of temozolomide could produce large effects. “One or two Phase I patients saw almost biblical cures”(11), recalled Sally Burtles, Cancer Research’s director of drug development. The drug seemed particularly potent against brain tumours. This was an exciting finding. Brain cancers had few treatment options, especially when it came to highly aggressive forms like glioblastoma. They were confident, but remained aware of the reality of their discovery. Stevens recognised that creating temozolomide was “neither a triumph of rational drug design nor does it result from any outstanding biological insights”(12) Rather, it was a case of perseverance and successful collaboration between “bench” researchers—the chemists and pharmacologists whose detailed understanding of the properties of these “cunning molecules” allowed them to reach the clinical sphere—and the clinicians and researchers who drove the treatment forward. The interaction between bench or laboratory research and clinical research is downplayed by Evidence-Based Medicine. It is one which was fundamental, though, to the success of temozolomide, as we will see.

Despite surpassing compounds like mitozolomide and demonstrating some eye-catching effects and relative safety in humans, bringing the drug through large-scale Phase III trials to get it licensed for widespread use for glioblastoma patients would be far more challenging. Many promising drugs which provoke great excitement and forecasts of dramatic effects fail this stringent test. Many chemotherapy drugs for glioblastoma had already met that disappointment. Cancer Research licensed the drug to Schering-Plough, and Phase III began under their banner. Given the drug’s significant promise for a cancer where no existing chemotherapy was available, and its current reputation and sales as a blockbuster, one might assume that it aced these tests. Indeed, the FDA approved temozolomide in 2005, and by 2007, NICE had licensed it as a first-line treatment for glioblastoma patients. Temozolomide was rolled out to thousands of cancer sufferers globally, and Cancer Research ploughed their Schering-Plough royalties back into their charity’s research. Alongside the bake sales and tombolas, Cancer Research UK had a blockbuster drug to fund pioneering work like that of Malcolm Stevens’ team.

But the reality was quite different. It was not the “textbook example” of drug development it has sometimes been called. For all its successes with regulators and in the marketplace, temozolomide’s interactions with clinical research were far from the overwhelming positive results one might expect. In fact, if the Evidence-Based Medicine hierarchical model of clinical evidence truly held sway in this case, as in many others, temozolomide might never have made it out of the door. Cancer Research and Schering-Plough would have lost a blockbuster, but more importantly, thousands of glioblastoma patients would have faced aggressive tumours without chemotherapy. Why did it fare relatively badly according to EBM’s standards, and reach a massive market seemingly unimpeded by those results? What does it mean for the credibility of Evidence-Based Medicine’s model that a blockbuster drug has its position based on evidence the model either disparages or can’t adequately assess?




Glioblastoma Multiforme (or GBM) is a dismal disease. It is also amongst the most common adult brain tumours. Glioblastoma is extremely aggressive and badly understood. The reasons why it occurs are opaque. The prognosis is very poor. Only around 3% of patients survive to reach 5 years after their diagnosis. (13) Most patients die within a year. As well as growing fast, it returns time and again. Even with surgery followed by radiotherapy and chemotherapy, and even if those procedures apparently remove all trace of the cancer, it comes almost always returns, and often very quickly. A complete recovery is therefore very unlikely—a combination of rapid degeneration, lack of effective treatments, and high rate of recurrence.

Into this picture—or an even bleaker version before temozolomide became the standard chemotherapy intervention—stepped Malcolm Stevens and his team with their “cunning molecule”, temozolomide. Success in combating the dismal disease would be a huge breakthrough, especially given its prevalence. But for clinical trial designers, temozolomide is a problematic proposition. Even if the cancer goes into remission, it will most likely return. In some patients, the cancer spreads too rapidly for temozolomide to do its work. The reality was that even if temozolomide worked well, most patients who took it would still lose their fight with glioblastoma—either in that round or the next one.

Compare the stark picture of glioblastoma with, say, the various forms of thyroid cancer. When a thyroid tumour is detected, the standard response is to remove the thyroid gland, with radio- and chemotherapy only needed where the cancer is very advanced or has spread elsewhere. Here, 10-year survival rates in developed countries reach upwards of 90%. (14) The two scenarios are vastly different, and so must be the kinds of measurements made when evaluating treatments. It might seem that the goal of every cancer treatment is the same: remission, restoring the patient to a normal life and life expectancy. But to apply that same standard when assessing temozolomide as when assessing thyroidectomy would be far too demanding to even observe the kinds of benefits that can be achieved against GBM.

In oncology, a standard practice is to classify response to treatment in one of four ways: complete response, partial response, stable disease, and disease progression. A complete response is full remission. The cancer is gone. A partial response involves shrinking of the tumour. The disease is not eradicated, but its spread has been contained and reversed. When the disease is stable, tumour growth has halted. The disease is, for the time being, contained. When the disease progresses, the cancer is growing or spreading to other systems.

Which responses are failures? Complete response is a great success, and partial responses are usually also classed as successes. But does containing the disease count as failure? Can a treatment be successful even if the disease spreads and progresses? The U.S. National Cancer Institute’s standards define ‘treatment failure’ straightforwardly as stable disease or disease progression. If a treatment can’t shrink a tumour, let alone stop it spreading, it has failed. But when faced with a rapid, aggressive disease like glioblastoma, the NCI standard risks being unable to capture the best effects we can achieve. What of a situation in which the best option is usually only able to slow down the growth of the tumour, not to halt or reverse it? The disease progresses, but less rapidly than it would otherwise do. This possibility may nevertheless extend the patient’s life. Their quality of life too may decline less rapidly as the tumour spreads more slowly. Such effects are sometimes called “marginal benefits”, although the term ‘marginal’ should not make us think they can’t be substantial and significant for patients and their families.

The standard treatment framework for cancer therapies is not calibrated for this line of reasoning. (Oncologists facing cancers with poor odds of complete or partial recovery know this perfectly well, and operate differently. As we will see, they have their own distinct standards of evidence and measurements for success.) Standardly, clinical guidelines distinguish between “first-line” and “second-line” treatments. A first-line treatment is the one which is believed to be most likely to have the largest effects. Note here that there is often tension between the most likely and the large effect. It’s not always the case that a treatment that can bring about major successes will work for all, most, or even a good percentage of patients. And it’s not always the case that the treatment which has the highest probability of eliciting at least some positive response across a broad group of patients is also the one capable of getting the best responses or the highest proportion of complete responses for the individuals in that group. Again, oncologists are aware of this, and much work is devoted to breaking down first-line treatments by specific forms of cancer, and according to other measurable disease markers. This kind of stratification of first-line guidance proves vital to understanding how temozolomide became the part of first-line treatment for glioblastoma. It is also vital to seeing how RCT-focused model of EBM fails in modern cancer treatment.

First-line treatments are the go-to course of action. These are usually very tightly specified and well-tested regimens. When first-line treatment fails, the “second-line” treatments can be tried instead. These are often more loosely defined and less well supported by trial evidence. As Roger Stupp put it, when asking what the best second-line therapy for a cancer is, “you will get as many different answers from so-called experts as the number of experts you ask”.(15) But this means that knowing when a treatment has “failed” is necessary to following the protocol. When has temozolomide in combination with surgery to remove a glioblastoma failed? When the cancer returns? If the cancer is spreading? None of the above?

On top of this problem, the first-line/second-line distinction causes another problem for those testing treatments like temozolomide. A carefully specified treatment regimen is necessary for a proper controlled trial. Tight control of the dosage and schedule is a must. Otherwise, a positive result would not clearly indicate whether any form of temozolomide regime or just one of the various regimens in the trial was responsible. But with a fast-moving and poorly understood disease like glioblastoma, inflexibility is a problem. It’s not just the beneficial effects of temozolomide that might require a change in dosage, schedule or number of rounds of treatment. Like most chemotherapy, temozolomide involves some serious side-effects, including nausea, hair loss, fatigue, and suppressed platelet and white blood cell counts which can make patients very vulnerable to infections. Flexibility helps in managing these effects, and in decreasing the likelihood that patients will decide to terminate treatment early due to unmanageable adverse effects.

So how should the performance of temozolomide be measured? A standard measure is 5-year survival. But a study of 5-year survival rates would be extremely difficult to use to test for a beneficial effect. With the extremely high relapse rate, even patients for whom temozolomide worked well in the first instance would be unlikely to survive for 5 years. The survival percentages are low enough that excessively large trials would be needed to attribute a detectable increase in 5-year survival to temozolomide. But more importantly, this is not the realistic goal of the treatment. Rather, the goal is to slow patients’ decline and extend their life expectancy. A different measure, median survival, is usually employed instead. This much less optimistic sounding figure measures not how many patients survive to a particular point, but the point at which half of the patients studied have died.

Glioblastoma has a median survival of around one year. That means that half of all GBM patients die within a year of diagnosis. Long-term survival rates are close to zero. Median survival as a figure can be particularly useful here as a primary measure—in combination with 2 and 5-year survival rates as secondary measures—because it puts the emphasis on increasing life expectancy without demanding that patients recover entirely to count as a success for the treatment. But one of the drawbacks of this kind of measure is that the further fates of the other half of the population, the ones who outlive the others, aren’t taken into account. So, if a treatment had a major effect for a small group of patients, it might go unnoticed by this measurement. The 2-year and 5-year survival rates could help to catch that possibility.


Testing Temozolomide:


Phase I and Phase II trials were completed, and temozolomide came out showing plentiful promise. There was excitement. But just like mitozolomide flunking out in the early phases, very few new treatments pass the third stage—large scale randomized controlled trials. Professor Roger Stupp, the Swiss neuro-oncologist who spearheaded temozolomide’s Phase III trial, darkly jokes: “I often state that I stay in drug development because I like to use the agents when they are still considered effective or promising.” (16)

Starting in 2000, Stupp brought together groups based at the European Organisation for Research and Treatment of Cancer (EORTC) and the National Cancer Institute of Canada (NCIC) to begin a big multicentre trial. (17) Their trial studied whether temozolomide in combination with radiotherapy improved median survival for newly-diagnosed glioblastoma patients, compared to radiotherapy alone. These are quite favourable terms for a test of a new treatment. There was no additional placebo for the control group—just the radiotherapy. One might suspect, then, that even without strong benefits for the temozolomide group, the trial might give a slight positive result. The knowledge that they were receiving chemotherapy and the hope of receiving a much-hyped experimental treatment might help the patients to survive longer on its own. There was no way the control or experimental group could be blinded to the treatment they were receiving, and absence of blinding will also tend to favour the experimental treatment in many ways.

Otherwise, the trial ticked most Evidence Based Medicine boxes. From a relatively large sample size at 573 patients, to thorough checks for balanced baseline characteristics between the control and experimental group, to intention-to-treat analysis, most markers of evidence quality were met. This would be a major test of whether temozolomide could deliver on its promise as a glioblastoma chemo drug. A negative result would most likely kill the drug, and Malcolm Stevens’ line of research, entirely.

The patients in the experimental arm got six weeks of radiotherapy each weekday, along with a daily dose of temozolomide. They then had a four week break, before starting up to six courses of temozolomide treatment. Each course lasts 28 days, with 5 days of temozolomide followed by 23 days off.

But glioblastoma doesn’t wait. 85% of the experimental group patients made it through the six weeks of radiotherapy + chemotherapy, but by the time the initial radiotherapy had been completed, only 78% of the patients were willing or able to continue with the trial. Those who dropped out at this stage mainly did so due to disease progression. Having lost nearly a quarter of patients to the rapid advance of the disease, 53% then discontinued their temozolomide treatment before receiving the full 6 rounds of chemotherapy. 105 patients received the full course of the drug. Again, most dropped out due to the progression of their disease, and a few due to the toxic side-effects of the drug.

All of these patients counted in the final analysis, whether they made it through the treatment or not. This is a basic principle of intention-to-treat analysis (ITT). Evidence Based Medicine’s standards of evidence require that trials use ITT analysis. There is a very good reason: attrition bias. Attrition is the process of patients dropping out of trials. If we only count the patients who stuck out the course, this creates a major bias. The patients who stay are the ones whose disease did not progress too seriously to kill them or to cause them to withdraw, and whose bodies (and spirits) could better tolerate the side-effects. Without the patients who died or dropped out before they got the full run of temozolomide treatments being included in the analysis, the trial would essentially cherry-pick the patients most likely to benefit from temozolomide, and those most likely to survive longer independently of it—the healthier cases. Evidence Based Medicine here asks a particular question, and grounds its requirements on that question: ‘What’s the chance of temozolomide making a difference compared to radiation alone for a newly-diagnosed patient?’ To properly answer that question, we must first deduct all the patients who wouldn’t make it through a full course of the drug.

As ever with glioblastoma, the results make for tough reading. By 28 months later, 480 of the original 573 patients had died—84%. Median survival in the control group was 12.1 months. For those in the temozolomide group, it was 14.6 months. On average, then, patients’ median survival could be expected to increase by around 2 and a half months with temozolomide.

Given the serious toxic side-effects of the drug, the chance of opportunistic infections, and the burden of the hospital regime, for many patients facing a short life expectancy, temozolomide might not seem an appealing proposition. Worse, for insurers and healthcare providers, both private and national, this increase in life expectancy might look too small to justify the costs involved in treatment. Temozolomide had faced a trial partly rigged in its favour, and come out with marginal results. For most EBM analyses, this result would probably mean little was heard of Malcolm Stevens’ team’s “cunning molecule” again.

In the report of the trial, published in the New England Journal of Medicine in 2004, Stupp and his colleagues sound relatively optimistic: “In conclusion, the addition of temozolomide to radiotherapy early in the course of glioblastoma provides a statistically significant and clinically meaningful survival benefit. Nevertheless, the challenge remains to improve clinical outcomes further.”(18) One of the factors that underpins this interpretation was an increase in the rate of 2-year survival associated with temozolomide. 10% of control group patients were alive after 2 years, but 27% of temozolomide patients reached the 2-year mark. This large increase might be surprising given that median survival remains so short. But the combination of these findings suggest that what is happening bears out the concerns about using median survival as a measurement. A minority of patients (possibly around 17% of them, given the change in survival rates) were experiencing a big effect on life expectancy due to the temozolomide treatment. That effect was not being as strongly reflected in the median survival statistic because the effect was being swamped out by the majority who didn’t gain much if anything from the drug.

On the face of it, Roger Stupp and his colleagues’ international trial gave not so much an evidence base for using temozolomide as a first-line treatment for glioblastoma, as evidence that something more complex was occurring. The interaction between their two tiers of findings producing a finding more interesting and significant than either alone, and answered a question it was not set up to ask. Most promisingly, if temozolomide was helping a small group of patients, they might be identified. If they were identified, temozolomide might be used to bring them major benefits. And perhaps, further down the road, oncologists might be able to use what they’d learned about who benefits from temozolomide to help other patients get those benefits too, or to understand the dismal disease.


MGMT and the re-analysis


When Roger Stupp and his colleagues published their results in the New England Journal of Medicine, they came alongside an editorial by prominent neuro-oncologist Lisa DeAngelis. (19) She declared “a new beginning” in the treatment of brain tumours. She focused on temozolomide as “a substantial step forward” to address the “dismal outcomes” of glioblastoma. She lamented that median survival for glioblastoma has advanced very little since the 1970s, when the Brain Tumor Study Group reported median survival at 10 months. (20) She compared temozolomide favourably to the range of chemotherapy drugs which had previously been tried and failed.

Her enthusiastic response was not shared by all. For instance, one letter to the New England editors by Dr. Robert Aitken is damning of DeAngelis’s optimism: “I am surprised that Dr. DeAngelis […] chose “A New Beginning” as the subtitle of her editorial […] I would have chosen the more sanguine expression “Is That All?”.”(21) He goes on to argue that a slim 2.5 months increase in median survival, and the small advance on the 1978 median survival figures is “not a very auspicious “beginning.””, and that after 27 years of failed experiments with a class of chemotherapy drugs, the researchers are “barking up the wrong trees”.

Lisa DeAngelis provided some fuel for this fire in her editorial. The dismal disease was not in retreat. The 2-year survival rates in Stupp et al.’s trial looked much higher than oncologists would expect, in the control arm no less than the experimental arm. They reported 10% 2-year survival for radiotherapy alone. DeAngelis had several explanations, and increased quality of treatment was not required to explain the discrepancy. The trial participants had been selected to be more likely than the average to survive. They were relatively young—all under 70. Most had received some form of surgery to reduce the size of their tumor, which indicated that they were in good enough health to withstand surgery. These selection criteria make a lot of sense. The more that disease progression prevented participants from receiving the full course of treatment, the more of an uphill struggle it would be to discern the effects of temozolomide. On top of those factors, temozolomide has a less toxic side effect profile, which means patients on the new drug might be able to take it for longer, explaining at least part of why they fared better than patients outside the trial on average.

Had the trial results been published alone, this response might have taken root as the default view of temozolomide, like many other chemo drugs before it. But the Stupp et al. study was packaged in this special edition alongside a reanalysis of the trial data performed by Monika Hegi and her colleagues. (22) Hegi crafted a translational study to investigate a different question: ‘Which patients benefit from temozolomide?’

To understand her study, and the reasons for performing it and taking its findings seriously enough to cast a new and positive light on the Stupp trial data, we must delve into the details of what temozolomide does. Temozolomide is an alkylating agent. It affects DNA. Damaging a cancerous cell’s DNA can be a powerful way to stop tumour growth. Cancer cells divide and spread quickly. Attacking their DNA and introducing errors in transcription could slow or halt the tumour growth and even cause the cancerous cells to destroy themselves.

But targeting DNA comes with costs. Most alkylating agents cannot target exclusively the intended cancerous cells, so the DNA of other body cells is often affected in the bargain. This lack of targeting makes alkylating agents unfashionable in modern oncology. (23) That temozolomide could be so successful despite this indicates the desperation in glioblastoma treatment. Other cells that divide frequently—in the bone marrow and hair follicles, for instance—are often damaged. Drugs like mitozolomide had unpredictable and uncontrollable harmful effects on bone marrow function and blood cell production. This produces knock-on effects on the immune system, making some patients extremely vulnerable to opportunistic infections. It also explains why bone marrow can be seriously affected in some cases by temozolomide, and why hair loss (attacking the hair follicles), nausea and diarrhoea (attacking cells in the digestive tracts) are common side-effects.

Our bodies have mechanisms that repair damage to DNA done by alkylating agents. The protein O6-alkylguanine-DNA-alkyltransferase (AGT) counteracts some of the effects of alkylating agents. This protein is coded for by the MGMT gene. It had already been hypothesized that AGT would repair the damage that temozolomide did in tumours. When repaired, the cells would not die. The body’s self-defense mechanisms would undo the harms deliberately inflicted on the cancerous cells. Malcolm Stevens and his colleagues clearly foresaw this issue. In their 1997 paper, they named AGT repair of the DNA lesions created by temozolomide as the first challenge to the drug working in live patients.(24)

When it fixes the damaged DNA, AGT is consumed in the process—it’s known as a suicide enzyme. So, cells need to produce more to repair ongoing damage. The MGMT gene which allows more AGT to be produced, though, can be “switched off”. The process of methylation can effectively silence specific genes. In some patients, the promoter of the MGMT gene in the cancerous cells may be methylated. Often, silencing genes which inhibit abnormal growth is an early stage in cancer development. In those patients, the cells won’t be able to replenish AGT to repair the damage done to their tumour by temozolomide. The cells should be much more vulnerable to temozolomide’s effects—more likely to fail in cell division and destroy themselves. The idea, then: turn a fact about the patient or a feature of their cancer to the advantage of the chemotherapy drug.

These were the high responders, a fortunate foreseeable group of patients who might benefit much more than the average glioblastoma patients. Their tumour had a vulnerability that could be exploited to stop it growing and kill its cells. On the other hand, if a patient’s tumour cells were producing large quantities of AGT—the MGMT gene was not silenced—then the oncologist would suspect that the damage temozolomide did to the tumour would be swiftly repaired. For those unfortunate patients, temozolomide was not likely to work: they were the low responders.

Enter Professor Monika Hegi of the Laboratory of Tumor Biology and Genetics in Lausanne, Switzerland. Her lab produces translational research, bringing the underlying mechanisms of tumour growth into clinical focus. For glioblastoma, she says, “we need to identify predictive factors for response to therapy and discover new targets for future therapies.” (25) Hegi and her colleagues tested tumour tissues from the patients in the Stupp trial, and determined which had the MGMT gene methylated, and which did not. They re-analysed the trial data, performing a subgroup analysis. Patients in the control and experimental groups were divided up according to whether they were MGMT-active or not. The hypothesis was that the difference in median survival between the control and experimental group would be much higher amongst those patients who were MGMT-inactive than amongst those who were MGMT-active. In other words, they’d find a larger effect in the patients whose MGMT gene had been silenced.

The study was not as large as the original data-set—only 307 of the original 573 patients had tumour specimens taken. Of those, 206 were able to be classified as MGMT-active or inactive. Of those 206, 92 were MGMT-inactive and 114 were MGMT-active. So just under half of the patients whose epigenetics could be analysed happened to fall into the predicted high-responder category, and just over half were predicted to be low responders. (26) The MGMT-active and inactive patients were well-balanced between the control and experimental groups.

The first highly interesting result was that whether the patient received temozolomide or not, the median survival in the MGMT-inactive group was higher than in the MGMT-active group. Going in, the hypothesis had only been that MGMT status would matter for outcomes when given temozolomide. But in fact, patients whose MGMT gene was silenced were doing better across the board. There are many reasons why this could have happened. It is possible that the patients who had tumor samples taken and whose MGMT status could be determined had been healthier going in. It could be that the MGMT-inactive tumours are a distinct, less virulent variant of glioblastoma than the MGMT-active ones. It might be that the radiotherapy treatment was also doing DNA damage which AGT was helping to repair, or that MGMT-inactivation correlated with decreased function in other aspects of DNA-repair, making those tumours more vulnerable to other treatments, not just temozolomide. Finally, as one commentator pointed out, after disease progression, patients in the control group were offered temozolomide treatment—so it might be that the improved survival figures were due to temozolomide in both groups. (27)

The headline result for the prospects of temozolomide, though, was the gulf between the effects of temozolomide on median survival across the two subgroups. Remember that when viewed as an undifferentiated block, temozolomide patients’ median survival was 14.6 months, compared to 12.1 in the radiotherapy-only group. When only MGMT-inactive patients—the prospective high responders—were considered, the temozolomide group’s median survival was 21.7 months, with the control group at 15.3. Temozolomide treatment seemed to have increased the median survival by over half a year—perhaps more if the high result in the control group was also due in part to receiving temozolomide after their cancer progressed.

By contrast, the low responders outcomes were well within margins of error of having any effect at all. The MGMT-active temozolomide group had a median survival of 12.7 months, compared to 11.8 months in the control. As Malcolm Stevens’ team had anticipated, the desired effects were negated where the cells were readily producing an AGT antidote to temozolomide’s tumour poison.

But, as with the original trial, it is the two-year survival rates which provide the starkest comparison, pushing the plausibility of the MGMT interpretation. Again, MGMT inactivity is a big deal on its own. Of the MGMT-inactive patients in the control group, 22.7% were alive at the two-year mark, most by then having taken temozolomide too. In the MGMT-active patients who didn’t receive temozolomide, none survived at two years. Temozolomide had still affected two-year survival for the MGMT-active group. Compared to no survivors in the control follow-up, 13.8% of the MGMT-active temozolomide patients were alive at two years. But this was still way below the benchmark set by the control patients who were MGMT-inactive. On this measure, MGMT status was clearly extremely important. But the most impressive figure was the survival rate for the MGMT-inactive patients who received temozolomide, the core high response group—46% survived to two years.


Subgroups and Meta-Analysis


How does a hierarchy of evidence judge this study? The answer depends on what kind of study you think it is. The data comes from an RCT. But the study itself is certainly not an RCT. It could be conceptualized in two ways—as a subgroup analysis or as a form of meta-analysis. Even though many of EBM’s hierarchies rank meta-analysis as the highest level of evidence, they definitely don’t mean this kind of meta-analysis. In fact, no matter which way you look at it, Hegi et al.’s study would be considered weak, low-quality evidence.

Hegi’s work is easiest categorized as a subgroup analysis. After the fact, you take the data, divide it into two subgroups, and compare their results. Even when (unlike Hegi’s study) performed on all the data from an RCT, this kind of study is not randomized. As the Cochrane Collaboration put it: “Subgroup analyses are observational by nature and are not based on randomized comparisons”.(28) Patients weren’t randomly allocated to have or not have the MGMT gene activated. This means that features of the patients other than their gene activity and their treatment group could be correlated with which subgroup they are in. For example, imagine that younger patients were more likely to have MGMT-inactive tumours than older people. This would mean an imbalanced comparison between the groups. Other features of the tumour might correlate with it being MGMT-inactive—as the ‘two kinds of glioblastoma’ hypothesis runs. According to Evidence Based Medicine’s own logic, the best way (maybe even the only way) to achieve balanced comparisons in which nothing is correlated with treatment is to randomize. But introducing subgrouping to the mix takes away that benefit. Hegi’s study must drop down the hierarchy to, at best, the level of an observational study.

Many Evidence Based Medicine proponents—and for that matter, critics—are skeptical of subgroup analyses. They are right to be skeptical in general. There are lots of ways to abuse subgroup analysis as a tool. Perhaps the most pernicious is through data mining. There is big money to be made in getting a treatment like temozolomide onto the market. Publications and career prospects for the researchers too often hinge also on massaging a positive spin out of results. What’s a researcher to do if the results come back showing no evidence of a marketable and appealing effect? Look for a subgroup which does show an effect. But you look at look at hundreds or thousands of subgroups, you’ll eventually stumble on one—probably several—in which the treatment comes back showing a significant effect. If you can parlay that subgroup into a viable market for the drug through some plausible enough concocted story, you can pass off the result as a success for all involved. The treatment isn’t ineffective—it’s just choosy.

This problem is particularly apparent where genetics is involved. There are somewhere in the region of 20,000 genes in the human genome. Nowhere else is the scope for data mining so vast. An unscrupulous or uninformed analyst could keep picking genetic variations to test against the data, sure in the knowledge that they would find a strong correlation eventually.

On the face of it, there’s lots to worry about in the temozolomide case. Stupp’s study was funded by Schering-Plough, who stood to make a lot of money. Stupp’s own career was propelled forward as the man responsible for a breakthrough on a previously intransigent disease. Hegi’s analysis was performed separately after the fact, not built into the study from the outset. Even the way things were presented—couple the study with the reanalysis in the same edition of the same journal to ensure the positive spin is immediately received—may rankle.

But just because subgroup analysis can be abused to the extent that many are worthless or even exploitative, does not mean that no subgroup analyses can provide powerful and important information. The same potential for manipulation and the same problematic incentives are very much present in randomized trials, and indeed their very power and credibility makes them a major target for manipulation. This alone does not warrant downgrading all subgroup analyses—rather, we have to go case by case and see whether each instance is credible.

Hegi was not testing a big array of genes against the data, hunting indiscriminately for a correlation. She chose the specific gene methylation and tested only that. Second, she had a very good reason for her choice—the reasoning behind how temozolomide would work—which was not cooked up to explain the subgroup analysis, but clearly formulated and explained beforehand by the drug’s discoverers, amongst others. Third, commercial interests would likely be actively harmed, not benefited, by her work. Hegi concludes: “patients whose tumors are not methylated at the MGMT promoter appear to derive little or no benefit from the addition of temozolomide to radiotherapy. For these patients, alternative treatments with a different mechanism of action or methods of inhibiting MGMT should be developed.”(29) Temozolomide had left Stupp’s trial with a weak result, but one sufficient—in the absence of any viable alternatives—to get it immediately licensed by the FDA. Hegi was suggesting that the treatment barely worked, if it worked at all, for over half of its prospective market.

Most importantly, though, the best defense against data mining is replication. If the association between a gene and an outcome appears in multiple studies on different data sets by independent investigators, it’s increasingly unlikely to be a chance artefact. Not only had Hegi and Stupp found the same effect (albeit at a smaller scale) the year previously when applying the same subgroup analysis to Stupp’s small Phase II of temozolomide(30), but the link had been shown in a preliminary study by Manel Esteller back in 2000. (31)

Each of these studies in isolation may be weak evidence in favour of the link between MGMT inactivation, temozolomide and survival. But put together, they form far stronger evidence than the sum of their parts. The interaction of the three sources is synergistic—they support one another to reinforce each other by removing (at least) one potential way to explain their findings away: data mining.

But even more importantly, these findings simply could not be achieved through a randomized trial. There is no good way to answer the question, ‘Who benefits from temozolomide?’, using an RCT design. We can’t randomise some patients to be MGMT-inactive and others to be MGMT-active. If we want this kind of information—and the example of temozolomide makes the case that we definitely do—then we will have to turn to non-randomised sources to get it.

Why can’t we just do another RCT to answer this question? The problem can be seen by thinking about Hegi’s design in a different way. Rather than thinking of it as taking an RCT and splitting it into subgroups, imagine her work as taking what was originally one RCT and separating it out into two different RCTs. Let’s call them the Active Trial and the Inactive Trial. In the Active Trial, Stupp’s researchers enrolled only MGMT-active patients, randomized them, and then Hegi and her colleagues compared their outcomes. In the separate Inactive Trial, Stupp’s researchers enrolled MGMT-inactive patients, randomized them, and Hegi and co. did that analysis. Two trials, two sets of results, two average effects of temozolomide compared to radiotherapy alone, in two different populations. There is not too much difference between this re-imagined scenario and what Stupp and Hegi actually did, or between it and what would be done if two entirely new trials were commissioned to answer the question.

But the problem is the same. It does not handily evaporate when reworded. We are now comparing the results of two trials with one another. One trial, Active Trial, shows an average treatment effect of a 0.9-month increase in median survival. Another trial, Inactive Trial, shows a 6.4-month increase. But if we want to know about which factors make a difference in temozolomide treatment, we can’t stop there. We have to make some kind of comparison between the results of the trials. Statistical methods of comparing results across trials are called meta-analyses.

This may sound like the EBM hierarchical approach has an out. Reclassify Hegi’s work as a form of meta-analysis, which ranks at the top of many (but far from all) hierarchies, and it can be called high-quality work. But meta-analysis comes in many forms. Evidence Based Medicine has been very specific about the kind of meta-analyses their hierarchies rank highly. These are meta-analyses performed as part of a systematic review—meta-analyses designed to estimate the average treatment effect in a broad population. In other words, a big RCT made up of other smaller RCTs, not designed tease out differences in results in different populations.

A systematic review is a process of gathering together all of the data from studies of a particular type relating to a predefined question. For example, a researcher could perform a systematic review of the question of whether temozolomide is more effective than radiotherapy alone for glioblastoma patients. The researcher would use a clearly specified protocol to search the medical literature—systematically—for any trials relating to temozolomide. They gather those studies together, and make judgments informed by the trials taken together as a data set. When several trials are brought together in this way, the hope is, the researcher can identify the true effect of temozolomide more precisely and confidently because she has a much larger data set.

If the study results are comparable enough, the researcher might perform a meta-analysis. There are many techniques available to do this. The most straightforward is to take a weighted average of the effect sizes found in each trial. Usually, the trials are weighted by size—so, a larger trial (which is expected to give a more precise estimate of the effect and be less vulnerable to bias) counts for more than a smaller trial which is, in theory, more vulnerable to biases. This pools the data from each trial, to try to answer the same question the individual trials attempted to answer, but with more data, precision and confidence.

This kind of meta-analysis provides the evidence that EBM classifies as high-quality. It amounts to an amalgamation of RCT results. Meta-analysis is almost never ranked higher than the ranking of the individual RCTs that compose it—and for good reason. A bunch of biased trials giving flawed results will only be amplified in their biases when gathered together. In many ways, a meta-analysis of poor-quality studies will be even worse—as the Cochrane Handbook plainly recognizes: “If bias is present in each (or some) of the individual studies, meta-analysis will simply compound the errors, and produce a ‘wrong’ result that may be interpreted as having more credibility”.(32)

For temozolomide, what matters is that comparing the results of Active Trial and Inactive Trial to investigate who benefits from temozolomide is certainly not a meta-analysis of this kind. We are not amalgamating the results of Active Trial and Inactive Trial, averaging them, and producing a single average result. In fact, doing that would be nothing more than stating the overall results Stupp et al. found in their trial! Clearly, if Hegi’s study was a meta-analysis, it wouldn’t be one as EBM hierarchies think of them.

Instead, this second breed of meta-analysis sees trial results as data points, and characteristics of the trial population as the variables of interest. Looking at between-trial variation is very similar to looking at within-trial variation. Rather than dividing patients in existing trials into subgroups, this kind of meta-analysis asks whether the effect sizes found in trials are correlated with properties of those trials—properties such as the proportion of patients who were MGMT-inactive.

Let’s imagine that 10 trials of temozolomide against radiotherapy alone had been performed. It just so happens that one trial was done only on MGMT-active patients (Active Trial) and one only one MGMT-inactive (Inactive Trial). The other 8 were done on a mixture. But, as luck would have it, the proportion of MGMT-inactive patients in each study was different. In some it was the population-level average of around 45%, but in others it was as high as 75% or as low as 15%. If we suspect MGMT inactivity correlates with high response to temozolomide, we’d make a simple prediction: the higher the percentage of patients who were MGMT-inactive, the larger the effect the trial will show. In particular, we’d expect the largest effect size in Inactive Trial and the smallest in Active Trial. This is the kind of meta-analysis that can be performed to study which features of patients or their diseases are linked to high and low response to treatment.

Make no mistake: this 10-trial meta-analysis is an observational study. It takes 10 RCTs as its input, but it has not performed any form of randomization. The variable of interest is the proportion of patients who are MGMT inactive. We didn’t randomize some trials to be 75% inactive and others to be 15% inactive. The comparison is possible only by happenstance. The randomization that happened inside each individual trial is not particularly important here. If we rejected Hegi’s study because it’s non-randomized and then went out to do randomized trials in a bunch of different populations, and then compared them, this still would not be a randomized study. It would remain an observational study, and suffer all the penalties in EBM’s evidence calculus that come along with that.

Could we do a randomized study on this topic? Technically, yes. We could recruit a bunch of researchers to plan a bunch of trials, and then randomly assign each to recruit a specific proportion of patients with MGMT-inactivation, from 0% up to 100%. But for randomization to accrue any of the benefits that EBM proponents hope it will have—using the law of large numbers to (roughly) evenly balance other variables between the studies—we’d need to conduct a huge number of trials. Certainly, we’d need enough trials to be creating massive evidential overkill, along with a huge waste of resources. The resources would be particularly wasted because, as a reader of Hegi’s study and well-conducted meta-analyses of between-trial variation will see, observational studies like this can provide powerful evidence to answers questions about which patients benefit the most and the least.

So, questions about who benefits and how much don’t require randomized trials to provide strong evidence for an answer. Randomized trials can sometimes provide the data which feeds into these studies, as in the case of Hegi’s subgroup analysis or our imagined meta-analysis. But again this is not always necessary. Sometimes, observational studies can feed these approaches. Outcomes research can be a really powerful way to gather a large amount of data. Resources like the Swedish Biologics Register gather together data about patients, their diseases, their treatment, and their outcomes across large, often nationwide, populations. This data can then be interrogated to answer questions about the relationship between patients’ features, treatments and outcomes. If a large international database of cancer patients, their tumour genetics, treatment data, and survival existed, for example, researchers could study that data to determine whether MGMT activity in temozolomide-treated patients correlated with outcome. In combination with Hegi’s study, a positive result in outcomes research of that nature would be yet stronger evidence, and further synergize, enhancing confidence in the finding.

Prospective, but non-random, studies can also be performed. The idea that MGMT-inactivation is a key predictor of response to treatment forms the first strand of a predictive model of temozolomide’s effects. To test this model, we can recruit patients newly diagnosed with glioblastoma, test their tumours for MGMT expression, and then make a prediction about the probability that they’ll survive to two years. If the outcomes match the predictions, then the study provides evidence for the accuracy and predictive power of the model, and evidence for the underlying assumptions.

Of course, the relationship between MGMT-inactivation and temozolomide responsiveness might prove not be causal. A fully-fledged RCT cannot be performed here because MGMT-activation might correlate with some other factor which actually influences glioblastoma survival. As long as those variables aren’t separable, the causal link can’t be proven. For instance, the idea that there might be two (or more) kinds of glioblastoma, one of which is MGMT-inactive, and which is less virulent, might provide an alternative explanation. (33) Randomization can do nothing to prevent this. Deriving the underlying data from randomized trials would not remove the relationship between MGMT-activity and tumour type or allow us to attribute causation more precisely.

The idea that another factor might underlie the correlation between MGMT-activity and temozolomide response is only worrying if the relationship between that factor and MGMT-activity is weak. If everyone who has the less virulent tumour has MGMT-inactivation, and everyone who has the more virulent tumour doesn’t, then from a treatment perspective there’s really no important difference between saying ‘Temozolomide works better in MGMT-inactive patients’ and saying ‘Temozolomide works better in patients with the less virulent tumour type’. The difference only matters if new projects are undertaken to inhibit MGMT expression in MGMT-active tumours. If the MGMT-activity status was a red herring, those projects will probably fail. If they fail, though, this might just lend further evidence to flesh out the taxonomy of glioblastoma types.

But if the association is imperfect, there’s a bigger problem. Imagine we found that a new drug worked really well on long-haired patients, and really badly on short-haired ones, on average. Suppose we acted on that information, and gave the drug only to long-haired patients. Short-haired patients don’t get the drug—it won’t work for them, so it’s not worth the costs to the health service or the side-effects for the patients. But in reality, it’s not hair length that mattered. Actually, the drug works very well on women, and very badly for men. Women are more likely to have long hair, men less so. So there was a correlation between hair length and response to the drug. Because we thought to test hair length and didn’t think to test sex, we end up giving the drug to a bunch of patients for whom it won’t work very well—long-haired men—and denying it to people who really would’ve benefited—short-haired women. The worry is that MGMT-inactivity is like having long hair, and there’s something else analogous to biological sex which is the real reason why temozolomide works for some and not for others, but which is loosely associated with MGMT status.

So should we be giving temozolomide to everyone, or just the high responders? Hegi and Stupp co-authored a paper asking this question in 2015. (34) They wrote that: “By continuing to treat the majority of MGMT unmethylated patients with [temozolomide], we are missing an opportunity to do better.” Let’s assume, for the sake of argument, that in low responders, the narrow effect on median survival is not worth the costs—both financially and in terms of side-effects. Are Hegi and Stupp right that we should withdraw temozolomide for those patients, and instead “try a potentially efficacious new agent”?

The answer depends on how confident we are that MGMT-inactivation is the underlying determinant of high response. If MGMT status is only loosely related to temozolomide response, then withdrawing temozolomide from MGMT-active patients is like refusing to give our new drug to short-haired women. There is one powerful and obvious defense: in the absence of evidence of any superior predictor of temozolomide effect, we have to go with the best we have. Given how good MGMT-inactivation has proven to be as a predictor of response in Hegi’s study and in many subsequent studies, we are confident that there are low and high responders, and using MGMT-inactivity to predict high-response does a lot better than anything else we have. We shouldn’t rest there, but it makes a good starting point.

A second response is to draw on mechanistic evidence. Evidence relating to underlying biological mechanisms has been relegated to the bottom of every hierarchy in which it appears. Elsewhere, EBM proponents have argued that it doesn’t count as evidence at all. The idea is that work by Malcolm Stevens and his ilk is not admissible evidence. It is important, of course, for creating new drugs and innovations which can then be tested. But only the trials of those new drugs in humans count as evidence.

This position might make sense when it comes to the big question every EBMer wants to ask: what is the average treatment effect, and is that effect better than nothing? Just the fact that there’s a well-understood mechanism underpinning how the treatment should work doesn’t mean it in fact will work, as any number of cases have shown. But that is not the question we’re asking here.

When you look at the hair length and biological sex case, it’s obvious that hair length is not the real cause, sex is. Why do you already believe that, even without any details about the drug? Because sex is the kind of thing that affects whether or not drugs work, and hair-length isn’t. In other words, there’s no plausible mechanism by which the length of your hair can affect whether this drug works. The absence of a plausible mechanism can be killer counter-evidence. But there’s always going to be a deeper explanation than just biological sex of why the drug works for women and not men—structural, genetic or hormonal features which women have and men lack (or vice versa) which affects whether the treatment works. Knowledge of the mechanism and evidence that the availability of a particular protein, like AGT, interrupts the way a drug works, then, could act as evidence that the correlation between MGMT-inactivation and responsiveness to temozolomide is causal. These two sources of evidence synergistically interact. Together, they provide a much stronger reason to believe that MGMT-inactive patients are high responders and MGMT-active patients are low responders than we could account for by looking at the quality of both separately. Two potentially “low-quality” sources can mutually reinforce each other to the extent that their conclusions are extremely compelling. Just as the mechanistic evidence from Stevens and others acts as evidence that Hegi wasn’t data mining but was testing relevant correlations, it also acts as evidence that the correlation is more likely to be causal, and less likely to be due to some third variable which is related to both MGMT and temozolomide response.

The lesson for analysts of evidence from temozolomide, and from the relationship between the work of Stevens, Stupp and Hegi, is the importance of interaction. Hierarchies evaluate every study individually, on their own methodological merits. In the temozolomide case, this would have been a disastrous mistake. Temozolomide is not likely to be unique in this respect. It’s through the light that Hegi’s study shines on Stupp’s data that we see the power of temozolomide as a treatment for glioblastoma. It’s in the relationship between her work and Stevens’ mechanistic evidence and detailed study of the causal process by which temozolomide works that we can be confident in predicting high and low responses to temozolomide. It’s through understanding evidence interaction that we can turn independent studies of varying levels of “quality” into a strong, compelling picture that clinicians can use to predict effects, and patients can use in making major decisions about their care plan. There was no room and no need to evaluate each study independently in the bargain. It’s not really possible to say anything useful or meaningful about Hegi’s work without the evidence from Stevens, Esteller, and others. In the temozolomide case, a hierarchical approach to evidence wouldn’t just get its judgment of temozolomide and its recommendations to doctors, patients and regulators wrong—it would get the way medical evidence works wrong.


What is the value of Stupp’s Trial?


RCTs are called the ‘gold standard’ of clinical evidence. They have been equated with a ‘high level’ of evidence, with ‘high quality’ evidence and with ‘strong’ evidence. As they go, Stupp and his colleagues produced a decent trial. It was large, the control and experimental groups were well-matched for variables believed to affect outcomes, and the analysis was performed by the book for Evidence Based Medicine’s standards. It was not perfect—the trial couldn’t be blinded, and a trial can always be larger. But if anything is, it’s a good candidate for the ‘high-quality’ and ‘strong evidence’ labels that hierarchies assign.

But evidence is always evidence for or against some theory. The theory Stupp’s trial was designed to evaluate was that temozolomide would have a greater effect on median survival than radiotherapy alone. It provided evidence for that claim—an increase in median survival, albeit a modest one. But given what we know from Hegi’s reanalysis and the other studies which have shown the power of MGMT as a predictor, is this useful information at all? Does anyone benefit from knowing that on average in a mixed population of both high and low responders, MGMT-inactive and MGMT-active patients, temozolomide outperforms radiotherapy alone?

It seems not. Patients don’t need this information. They need to know whether they have an MGMT-inactive tumour or not, and then they need the information Hegi’s analysis gives them about the difference temozolomide is likely to make for them. Doctors also don’t need it. Once they have Hegi’s information, they know that they need to test their patients’ tumour genetics and treat accordingly. Regulators might need the information. Some regulators license a treatment for first-line use only once it has passed Phase III large-scale RCT trials and shown a positive average treatment effect. But this seems like a mistaken approach to regulation, putting too much faith in EBM’s hierarchical approach. It’s Hegi’s reanalysis, more than Stupp’s trial, that shows the potential of temozolomide, and that is what regulators should be interested in. Should state healthcare providers and insurers want this information? Again, only if they’ve put too much stock in the ‘all-or-nothing’ approach. Why should a treatment need to be provided by the state or an insurer for every glioblastoma patient, when it predictably only works well for an identifiable few? Why should they refuse a treatment to a predictable high-responder because the average effectiveness in groups that include low-responders is marginal? Basing the decision to pay for or not to pay for temozolomide on whether the patient has glioblastoma, instead of whether they have MGMT-inactive glioblastoma, is misguided in the light of Hegi’s work, and would be an error only justified by relying on the average treatment effect data from Stupp’s study instead.

In fact, once the information from Hegi’s study is available, nobody really needs the information about the average treatment effect in the broad population provided by Stupp’s trial. Decisions made on the basis of that data will be worse decisions than decisions made on Hegi’s figures alone, or by drawing on Hegi’s figures and the other studies that have asked how big of a treatment effect one can expect for MGMT-active vs. MGMT-inactive treatments. That ‘14.6 months’ prognosis for temozolomide patients, and that ‘increase of 2.5 months compared to radiotherapy alone’ estimate of the treatment effect is not clinically useful once we know that there is predictable variation in effect size. Another firm lesson from the temozolomide case, then: once there’s evidence of predictable variation in effects, information about the average effect is not valuable.

Does Stupp’s trial provide high quality evidence? It provided a high-quality data set for Hegi to analyse and provide evidence for a different claim about effect sizes. But providing reliable data is not the same as providing strong, high-quality evidence. Stupp’s trial provided pretty strong evidence for a claim which is not clinically important. Whether it also provided ‘high-quality’ evidence depends on what you think is necessary for ‘quality’. Is it enough to provide strong evidence for a claim whether or not that claim has clinical significance? Or to be high-quality evidence, does a study’s result need to be applicable and useful as well as reliable? If your approach to medicine says that you should pay attention to the quality evidence first (or only the quality evidence), as EBM has in the past, then you better hope that clinical importance is part of the notion of quality.

RCT evidence will still be clinically important where the effects of a treatment are homogeneous—where patients have the same or similar responses to the treatment. That usefulness, though, is contingent on the evidence that the effect is homogeneous. So the RCT evidence is not useful on its own: the evidence about variation (i.e. that there isn’t any) is necessary before the RCT evidence becomes clinically significant. Where the effects are heterogeneous—varied, as in the temozolomide case—it’s the evidence about variation and the effect-sizes achieved in those groups which is important. There, the RCT evidence is useful only as a data-set to be interrogated, and is not necessarily especially useful just because it is RCT data, because the methods used to extract useful information from that data set don’t preserve the benefits of randomisation. Either way, RCT evidence alone is not useful. The evidence concerning the distribution of effects is needed to make the RCT data clinically important.

So we have seen that it makes little sense to evaluate evidence from lower down the hierarchy in isolation. The picture that comes together of temozolomide is compelling because of the interaction of a range of sources. Nor would it make sense to judge RCT evidence as high quality in isolation. RCT evidence is clinically significant only where other evidence, not from randomized studies directly, makes it so.

The final case in which RCT evidence alone might be clinically useful is where there is no evidence at all about variation. If doctors simply don’t know whether a treatment’s effects are homogeneous or heterogeneous, then they can do no better than forecasting an average treatment effect as seen in a clinical trial. This is reasonable behavior, but hardly constitutes either a strong or a high-quality evidence base for their prediction. Given that the fundamental task of a clinician in advising their patient about treatment choice is to recommend the treatment most likely to produce the best prognosis for them, the absence of evidence about the distribution of treatment effects from an evidence base makes it a weak, low-quality basis for their work. If Evidence Based Medicine seeks to improve clinical practice and ensure doctors base the core of their work on strong evidence, then evidence about variation must be a central plank.

In the temozolomide case, there are two consequences of under-appreciating variation. We miss the fact that there are a group of patients for whom the treatment works far better than the average effect predicts. We also miss the fact that there are a group for whom the treatment is more likely an imposition than an improvement. As Stupp and Hegi noted in their 2015 editorial, “Patients with unmethylated GBM are in need of better treatments. This population not only offers the opportunity to test novel treatments but actually requires—more than other patients—that they be offered innovative therapies.”(35) MGMT-active patients are poorly served by the one-size-fits-all mindset encouraged by relying only on RCT evidence. The opportunity to test much-needed new approaches is being missed due to the rationale that temozolomide improves survival on average, so must be available to all. Taking Stupp and Hegi’s work seriously means admitting that temozolomide isn’t the best answer for every patient. Part of the reason that glioblastoma remains so dismal is that the now-standard treatment is not suitable for over half its recipients.




Temozolomide treatment became the norm for glioblastoma sufferers. A review by Derek Johnson and Brian O’Neill in 2012 of “the temozolomide era” in glioblastoma treatment pinpointed 2005 as the sea-change in treatment, caused by “a pivotal phase III clinical trial which showed that temozolomide chemotherapy plus radiation was more effective than radiation alone”—Stupp’s trial. (36) No other study impacted the field so decisively.

Their review found that median survival has indeed increased since temozolomide treatment took hold. The few months of increase in median survival found in Stupp’s trial is echoed in the wider population data. But glioblastoma remains a dismal disease. Young patients with glioblastoma have a fighting chance. Median survival for 20-29 year olds is pushing towards three years. But for older patients—including many excluded from the scope of Stupp’s original study—the figures are grim. Median survival for the over 80s is 5.6 months from diagnosis.

Johnson and O’Neill criticised the fact that the figure of 14.6 months median survival has become the starting point for predictions of survival time. Rightly, they point out that the oldest patients included in the trial were 70 years old. The average age of a glioblastoma sufferer is 60. (37) Their data alone indicates that age is a stronger predictive factor than any other in survival with glioblastoma. They warn that clinicians, relying on RCT results in narrow healthy populations, consistently overestimate the prognosis for glioblastoma patients and cancer patients in general. This can be devastating for patients, friends and family, who then are unprepared for an unexpectedly rapid decline. In Stupp and co.’s defense, they never claimed to be providing a figure for these prognostications. They were answering the questions that the EBM model prioritises: Does it work better than nothing, on average, in a pre-specified population? What is the average effect? But their headline figure has clearly caught on, while other information remains systemically undervalued. A quick search of the glioblastoma pages of cancer support sites shows ‘14.6 months’ is repeated ubiquitously. The answers clinicians need in order to improve their forecasts are minimized and omitted, in no small measure because outcomes research like Johnson and O’Neill’s valuable review features at the bottom of evidence hierarchies—if it appears at all.

Beyond temozolomide, research findings on glioblastoma have been dismal, too. Roger Stupp was central in the development another therapy licensed by the FDA using tumour treating fields of electromagnetic waves. (38) The “Stupp Protocol”, based on the treatment regimen from the 2004 trial, became and remains the global standard for glioblastoma care. Despite his personal successes, Stupp is disappointed by the lack of progress. In a 2013 interview, he pointed to the slew of negative research findings in the last decade. (39) Intensifying and extending temozolomide treatments did not improve results. Nor did adding new drugs, despite all their earlier promise. Responding to yet another negative finding in 2014, he wrote: “The path to improved treatment of glioblastoma remains paved with disappointments and unconfirmed promises.”(40)

Stupp and his team push on, though, investigating new approaches to treatment. Stupp sees the breakthrough with respect to the role of MGMT in glioblastoma treatment as a model for future work. In 2011, he said: “I think the challenge is really that we still, not only in glioblastoma, but in oncology at large, treat the majority of patients with a one-size-fits-all approach. I think the challenges are to be more individualized, to be able to identify the patient who should be treated with chemotherapy A vs chemotherapy B”.(41) For all his work with the Evidence Based Medicine gold-standard of large-scale trials, Roger Stupp has witnessed little in the way of advances. He is not walking away from the RCT, but seeking to supplement it with a broader base of evidence that can allow oncologists and pharmacologists to ask different kinds of questions. He is interested, for instance, in how glioblastomas change when they return. Investigational studies can try to answer this question, comparing tissue samples at the genetic and molecular levels from surgically removed first tumours with subsequent tumour samples from patients whose cancer has returned. These studies will always rank at the lowest quality levels of EBM’s hierarchies, but nonetheless could be integral looking ahead to advancing glioblastoma treatment.

In 2009, Stupp, Hegi and their colleagues reviewed the outcomes of all their original trial participants, five years after the publication of their trial, and up to nine years after the initial diagnosis for some of the participants. (42) All but eight patients in the control group had died (97%). In the temozolomide group, 33 survived of the 287 who enrolled. Prospects for glioblastoma were indeed dismal, but the cunning molecule that began in Malcolm Stevens’ lab in Aston University in the 1980s had saved dozens of lives.


  • Aitken, Robert D. ‘Treatment of Brain Tumors: Letters to the Editor’. New England Journal of Medicine 352 (2 June 2005).
  • Arney, Kat, and Malcolm Stevens. ‘The Story of Temozolomide (Video Transcript)’. Cancer Research UK – Science Blog, 18 July 2013.
  • DeAngelis, Lisa M. ‘Chemotherapy for Brain Tumors — A New Beginning’. Editorial. Http://, 8 October 2009.
  • Esteller, Manel, Jesus Garcia-Foncillas, Esther Andion, Steven N. Goodman, Oscar F. Hidalgo, Vicente Vanaclocha, Stephen B. Baylin, and James G. Herman. ‘Inactivation of the DNA-Repair Gene MGMT and the Clinical Response of Gliomas to Alkylating Agents’. New England Journal of Medicine 343, no. 19 (9 November 2000): 1350–54. doi:10.1056/NEJM200011093431901.
  • Hegi, Monika E. ‘Laboratory of Tumor Biology and Genetics: Department of Clinical Neuroscience – CHUV’. Accessed 5 June 2017.
  • Hegi, Monika E., Annie-Claire Diserens, Sophie Godard, Pierre-Yves Dietrich, Luca Regli, Sandrine Ostermann, Philippe Otten, Guy Van Melle, Nicolas de Tribolet, and Roger Stupp. ‘Clinical Trial Substantiates the Predictive Value of O-6-Methylguanine-DNA Methyltransferase Promoter Methylation in Glioblastoma Patients Treated with Temozolomide’. Clinical Cancer Research: An Official Journal of the American Association for Cancer Research 10, no. 6 (15 March 2004): 1871–74.
  • Hegi, Monika E., Annie-Claire Diserens, Thierry Gorlia, Marie-France Hamou, Nicolas de Tribolet, Michael Weller, Johan M. Kros, et al. ‘MGMT Gene Silencing and Benefit from Temozolomide in Glioblastoma’. New England Journal of Medicine 352, no. 10 (10 March 2005): 997–1003. doi:10.1056/NEJMoa043331.
  • Hegi, Monika E., and Roger Stupp. ‘Withholding Temozolomide in Glioblastoma Patients with Unmethylated MGMT Promoter—still a Dilemma?’ Neuro-Oncology 17, no. 11 (1 November 2015): 1425–27. doi:10.1093/neuonc/nov198.
  • Higgins, J. P. T., and S. Green, eds. Cochrane Handbook for Systematic Reviews of Interventions. 5.1.0. available at:, accessed 01/04/15: The Cochrane Collaboration, 2011.
  • Johnson, Derek R., and Brian Patrick O’Neill. ‘Glioblastoma Survival in the United States before and during the Temozolomide Era’. Journal of Neuro-Oncology 107, no. 2 (1 April 2012): 359–64. doi:10.1007/s11060-011-0749-4.
  • Newlands, E. S., G Blackledge, J A Slack, C Goddard, C J Brindley, L Holden, and M F Stevens. ‘Phase I Clinical Trial of Mitozolomide.’ Cancer Treatment Reports 69, no. 7–8 (1985): 801–5.
  • Newlands, E. S., M. F. G. Stevens, S. R. Wedge, R. T. Wheelhouse, and C. Brock. ‘Temozolomide: A Review of Its Discovery, Chemical Properties, Pre-Clinical Development and Clinical Trials’. Cancer Treatment Reviews 23, no. 1 (1 January 1997): 35–61. doi:10.1016/S0305-7372(97)90019-0.
  • Sansom, Clare. ‘Temozolomide – Birth of a Blockbuster’. Chemistry World July 2009 (26 June 2009): 48–50.
  • Seiter, Karen. ‘Treatment of Brain Tumors: Letters to the Editor’. New England Journal of Medicine 352 (2 June 2005).
  • Stevens, Malcolm. ‘Malcolm Stevens – School of Pharmacy’. Accessed 5 June 2017.
  • Stevens, Malcolm F. G. ‘Temozolomide: From Cytotoxic to Molecularly Targeted Agent’. In Cancer Drug Design and Discovery (Second Edition), edited by Stephen Neidle, 145–64. San Diego: Academic Press, 2014. doi:10.1016/B978-0-12-396521-9.00005-X.
  • Stewart, Bernard W, and Christopher P Wild. World Cancer Report 2014: World Health Organisation, International Agency for Research on Cancer. Geneva, Switzerland: WHO Press, 2015.
  • Stupp, Roger. ‘Bevacizumab for Newly Diagnosed Glioblastoma: More Disappointment’. PracticeUpdate, 6 March 2014.
  • Stupp, Roger. ‘My Approach to Glioblastoma’. PracticeUpdate, 9 August 2011.
  • Stupp, Roger, Monika E Hegi, Warren P Mason, Martin J van den Bent, Martin JB Taphoorn, Robert C Janzer, Samuel K Ludwin, et al. ‘Effects of Radiotherapy with Concomitant and Adjuvant Temozolomide versus Radiotherapy Alone on Survival in Glioblastoma in a Randomised Phase III Study: 5-Year Analysis of the EORTC-NCIC Trial’. The Lancet Oncology 10, no. 5 (May 2009): 459–66. doi:10.1016/S1470-2045(09)70025-7.
  • Stupp, Roger, Warren P. Mason, Martin J. van den Bent, Michael Weller, Barbara Fisher, Martin J.B. Taphoorn, Karl Belanger, et al. ‘Radiotherapy plus Concomitant and Adjuvant Temozolomide for Glioblastoma’. New England Journal of Medicine 352, no. 10 (10 March 2005): 987–96. doi:10.1056/NEJMoa043330.
  • Stupp, Roger, Eric T. Wong, Andrew A. Kanner, David Steinberg, Herbert Engelhard, Volkmar Heidecke, Eilon D. Kirson, et al. ‘NovoTTF-100A versus Physician’s Choice Chemotherapy in Recurrent Glioblastoma: A Randomised Phase III Trial of a Novel Treatment Modality’. European Journal of Cancer 48, no. 14 (September 2012): 2192–2202. doi:10.1016/j.ejca.2012.04.011.
  • Walker, Michael D., Eben Alexander, William E. Hunt, Collin S. MacCarty, M. Stephen Mahaley, John Mealey, Horace A. Norrell, et al. ‘Evaluation of BCNU And/Or Radiotherapy in the Treatment of Anaplastic Gliomas’. Special Supplements 112, no. 2 (7 May 2009): 333–43. doi:10.3171/jns.1978.49.3.0333@sup.2010.112.issue-2.
  • Zoeller, Lauren, and Roger Stupp. ‘Glioblastoma: Despite All the Disappointment, There Has Been Progress’. PracticeUpdate, 7 November 2013.