The Dismal Disease: Temozolomide and the Interaction of Evidence

“I think the challenge is really that we still, not only in glioblastoma, but in oncology at large, treat the majority of patients with a one-size-fits-all approach.”

— Roger Stupp ⁽¹⁾

Blockbuster drugs are rare. To be a blockbuster, a drug must shift over $1bn worth in one year. There are thousands of drugs on the market, but only a few hundred blockbusters. Yet between them, they make up more than a third of pharmaceutical revenues. Temozolomide cleared the $1bn threshold in 2008, earning $1.02bn. ⁽²⁾ It is a chemotherapy drug used to treat glioblastoma multiforme, one of the most common and aggressive brain tumours.

Its conspicuous success at the height of the “Evidence-Based Medicine” (EBM) movement makes it an intriguing case study of how ideas of what characterizes strong, high-quality medical evidence affect practice. EBM advanced a vision of medical evidence which prioritizes Randomized Controlled Trials (RCTs) and de-emphasizes non-randomized studies, mechanistic evidence, and clinical experience. Particularly, EBM employs hierarchies of evidence.⁽³⁾ These hierarchies rank the strength and quality of evidence based on the methodology of the study. RCTs (and meta-analyses and systematic reviews of RCTs) dominate the highest levels of these hierarchies. Other forms of evidence are relegated or excluded entirely. The quality of each study is assessed individually.

For a drug like temozolomide to thrive during EBM’s heyday, one might suspect that the evidence for it is characteristically strong and high-quality. Far from it. Temozolomide leapt into prominence based largely on one trial which showed a marginal effect, coupled with a re-analysis using techniques which EBM has downgraded and denigrated. This chapter demonstrates how a range of sources of evidence, far from the pinnacle of EBM’s hierarchies, combined to create a well-founded understanding of how, and for whom, temozolomide works. It also shows how reliance on the EBM model has undermined attempts to translate that understanding into practice. The hierarchical model is misguided because it assesses the quality and strength of every individual piece of evidence in isolation. Evidence interacts. The combination of two or more sources which individually would be far from compelling can synergize to create a picture clearer and more compelling than the sum of their individual contributions. The hierarchical approach is far too limiting: in the kinds of questions it allows researchers to ask and answer, in the kinds of evidence considered, and in failing to treat evidence bases as an interacting whole.

Temozolomide and Glioblastoma:

Glioblastoma Multiforme is a dismal disease. Glioblastoma is extremely aggressive and poorly understood. The reasons why it occurs are opaque. It is also amongst the most common adult brain tumours. The prognosis is dire. Just 3% of patients survive to 5 years from diagnosis. ⁽⁴⁾ Most die within a year. It spreads quickly and recurs time and again. Complete recovery is unlikely—a combination of rapid degeneration, lack of treatments, and high recurrence rates.

Into this picture stepped Malcolm Stevens and his team with their “cunning molecule”⁽⁵⁾, temozolomide. Stevens’ team at Aston University were searching for molecules with properties that could be harnessed in new and unanticipated ways. By 1987, they’d developed the molecule which became temozolomide. Phase I and II studies which reported by 1997 were a success. Temozolomide produced substantial improvements in survival figures for a select few glioblastoma patients. Better still, it omitted the disastrous effects on bone marrow of some previous contenders. “One or two Phase I patients saw almost biblical cures”⁽⁶⁾, recalled Sally Burtles, Cancer Research’s director of drug development.

There was excitement for a potential weapon against glioblastoma. But few promising treatments pass the third stage: large-scale randomized controlled trials. Professor Roger Stupp, the Swiss neuro-oncologist who spearheaded temozolomide’s Phase III trial, darkly jokes: “I often state that I stay in drug development because I like to use the agents when they are still considered effective or promising.”⁽⁷⁾ In 2000, Stupp coordinated a large multi-centre trial, studying whether temozolomide in combination with radiotherapy improved median survival for newly-diagnosed glioblastoma patients, compared to radiotherapy alone. ⁽⁸⁾ These are favourable terms for a trial: there was no additional placebo or comparator for the control group, just radiotherapy. The trial ticked most other Evidence-Based Medicine checkboxes: a relatively large sample size (573 patients), thorough checks for balanced baseline characteristics, and intention-to-treat analysis. This would be a crucial test of whether temozolomide could deliver on its promise.

The patients in the experimental arm received six weeks of radiotherapy each weekday, along with a daily dose of temozolomide. A four-week break followed, before up to six courses of temozolomide treatment. Each course lasted 28 days, with 5 days of temozolomide followed by 23 days off.

But glioblastoma doesn’t wait. Only 85% of the experimental group patients made it through the initial six-week period. Only 78% were willing or able to continue with the trial. Having lost nearly a quarter of patients to the rapid advance of the disease, 53% then discontinued temozolomide treatment before receiving the full 6 rounds. 105 patients received the full course of the drug. Most dropped out due to the progression of their disease, and a few due to toxic side-effects.

As ever with glioblastoma, the results make for tough reading. By 28 months, 480 of the original 573 patients had died: 84%. Median survival in the control group was 12.1 months. For those in the temozolomide group, it was 14.6 months. On average, then, patients’ median survival could be expected to increase by around 2 and a half months with temozolomide.

Given the serious side-effects, the chance of opportunistic infections, and the burden of the hospital regime, temozolomide might not seem an appealing proposition. Worse, for insurers and healthcare providers this increase in life expectancy might look too small to justify the costs involved in treatment. Temozolomide had faced a trial partly designed in its favour, and come out with marginal results. For most EBM analyses, this result would probably mean little was heard of Malcolm Stevens’ team’s “cunning molecule” again.

In the report of the trial, published in the New England Journal of Medicine in 2004, Stupp and his colleagues sound relatively optimistic, calling it: “a statistically significant and clinically meaningful survival benefit.”⁽⁹⁾ One factor underpinning this interpretation was an increase in a secondary outcome, 2-year survival rates, associated with temozolomide. 10% of control group patients were alive after 2 years, but 27% of temozolomide patients reached the 2-year mark. This large increase might be surprising given that median survival remains so short. But the combination of these findings suggests an alternative explanation. A minority of patients were experiencing a big effect on life expectancy. That effect was not being reflected in the median survival statistic because the effect was swamped by the majority who didn’t gain much if anything. If half or more of the patients weren’t benefitting, the median survival wouldn’t budge.

On the face of it, Stupp’s trial gave not so much an evidence base for temozolomide, as evidence that something more complex was occurring. The interaction between their two findings produced a result more interesting and significant than either alone, and answered a question. Most promisingly, if temozolomide greatly helped some patients, they might be identified, and temozolomide might bring them major benefits.

MGMT and Hegi’s re-analysis

When Stupp’s team published their results, they came alongside an editorial by prominent neuro-oncologist Dr. Lisa DeAngelis. ⁽¹⁰⁾ She declared “a new beginning” in the treatment of brain tumours. She focused on temozolomide as “a substantial step forward” to address the “dismal outcomes” of glioblastoma. She lamented that median survival for glioblastoma has advanced very little since the 1970s, when the Brain Tumor Study Group reported median survival at 10 months. ⁽¹¹⁾

Her enthusiastic response was not shared by all. For instance, one response by Dr. Robert Aitken is damning of DeAngelis’s optimism: “I am surprised that Dr. DeAngelis […] chose “A New Beginning” as the subtitle of her editorial […] I would have chosen the more sanguine expression “Is That All?”.”⁽¹²⁾

He argues that a slim 2.5 months increase in median survival, and the small advance on the 1978 median survival figures is “not a very auspicious “beginning.””Had the trial results been published alone, this response might have taken root as the default view of temozolomide. But Stupp’s study was packaged alongside a reanalysis of the trial data performed by Monika Hegi and her colleagues. ⁽¹³⁾ Hegi crafted a translational study to investigate a different question: ‘Which patients benefit from temozolomide?’ To understand her study, and the reasons for performing it and taking its findings seriously enough to cast a new and positive light on the trial data, we must investigate how temozolomide works. Temozolomide is an alkylating agent. It affects DNA. Damaging a cancerous cell’s DNA can be a powerful way to inhibit tumour growth. Cancer cells divide and spread quickly. Attacking their DNA and introducing errors in transcription could slow or halt the tumour growth and even cause cell death.

Our bodies have mechanisms that repair damage to DNA. The protein AGT counteracts some of the effects of alkylating agents. This protein is coded for by the MGMT gene. It had already been hypothesized that AGT would repair the damage that temozolomide did in tumours. Stevens and his colleagues foresaw this issue. In their 1997 paper, they named AGT repair of the DNA lesions created by temozolomide as the primary challenge to the drug working. ⁽¹⁴⁾

When it fixes the damaged DNA, AGT is consumed in the process—it’s a suicide enzyme. Cells need to produce more to repair ongoing damage. The MGMT gene which allows more AGT to be produced can be “switched off” in some cells. Methylation effectively silences specific genes. In some patients, the MGMT gene promoter in the cancerous cells is methylated. Those cells can’t synthesize AGT to repair the damage done by temozolomide, and should be more vulnerable to temozolomide.

These were the high responders: a fortunate group of patients who might foreseeably benefit much more than average. But, if a patient’s tumour cells were producing large quantities of AGT—the MGMT gene was active—then the damage temozolomide did might be swiftly repaired. For those unfortunate patients, temozolomide was not likely to work: they were the low responders.

Hegi and her colleagues tested tumour tissues from the patients in Stupp’s trial, and determined which had the MGMT gene methylated. They re-analysed the trial data, performing a subgroup analysis according to whether patients were MGMT-active or inactive. They hypothesized that the difference in median survival between the control and experimental group would be much higher amongst those patients who were MGMT-inactive than amongst those who were MGMT-active. Just under half of the patients whose epigenetics could be analyzed happened to fall into the predicted high-responder category, and just over half were predicted to be low responders. ⁽¹⁵⁾ The MGMT-active and inactive patients were well-balanced between the control and experimental groups.

The first intriguing result was that whether the patient received temozolomide or not, the median survival in the MGMT-inactive group was higher than in the MGMT-active group. Going in, the hypothesis was only that MGMT status would matter for outcomes when given temozolomide. But MGMT-inactive patients did better across the board. There are many reasons why this could have happened. Perhaps MGMT-inactive tumours are a distinct, less virulent variant of glioblastoma. Maybe radiotherapy was also doing DNA damage which AGT was helping to repair, or MGMT-inactivation correlated with decreased function in other aspects of DNA-repair, making those tumours more vulnerable to other treatments. Finally, after disease progression, patients in the control group were offered temozolomide treatment—so improved survival figures might be due to temozolomide in both groups. ⁽¹⁶⁾

The headline result for temozolomide, though, was the gulf between the effects of temozolomide on median survival across the two subgroups. When viewed as an undifferentiated block, temozolomide patients’ median survival was 14.6 months, compared to 12.1 in the radiotherapy-only group. When only MGMT-inactive patients—the prospective high responders—were considered, the temozolomide group’s median survival was 21.7 months, with the control at 15.3. Temozolomide treatment seemed to have increased the median survival by over half a year. By contrast, the low responders’ outcomes were well within margins of error of having any effect at all. The MGMT-active temozolomide group had a median survival of 12.7 months, compared to 11.8 months in the control. As Malcolm Stevens’ team had anticipated, the desired effects were negated where the cells were producing an AGT antidote to temozolomide’s tumour poison.

But, as with the original trial, it is the two-year survival rates which provide the starkest comparison. Again, MGMT inactivity alone is important. Of the MGMT-inactive patients in the control group, 22.7% were alive at the two-year mark (most, by then, having taken temozolomide too). In the MGMT-active patients who didn’t receive temozolomide, none survived at two years. Compared to no survivors in the control follow-up, 13.8% of the MGMT-active temozolomide patients were alive at two years. But this was still way below the benchmark set by the control patients who were MGMT-inactive. On this measure, MGMT status was clearly extremely important. But the most impressive figure was the survival rate for the MGMT-inactive patients who received temozolomide, the core high response group—46% survived to two years.

Subgroups and Meta-Analysis

How does a hierarchy of evidence judge Hegi’s study? That depends on what kind of study it is. The data comes from an RCT. But the study itself is certainly not an RCT. It could be conceptualized in two ways—as a subgroup analysis or as an odd form of meta-analysis. Although many of EBM’s hierarchies rank meta-analysis as the highest level of evidence, they definitely don’t mean this kind of meta-analysis. No matter which way you look at it, Hegi’s study would be considered weak, low-quality evidence by EBM hierarchies.

Hegi’s work is easiest categorized as a subgroup analysis. Take the data, divide it into two subgroups, and compare their results. This kind of study is not randomized. As the Cochrane Collaboration put it: “Subgroup analyses are observational by nature and are not based on randomized comparisons.”⁽¹⁷⁾ Patients weren’t randomly allocated to have or not have the MGMT gene activated. Features of the patients other than their gene activity and their treatment group could be correlated with which subgroup they are in. Other features of the tumour might correlate with it being MGMT-inactive—as the ‘two kinds of glioblastoma’ hypothesis runs. According to EBM’s own logic, the best way to achieve balanced comparisons in which nothing is correlated with treatment is to randomize. Introducing subgrouping to the mix removes that benefit. Hegi’s study must drop down the hierarchy to a low ranking at best.

Many EBM proponents are skeptical of subgroup analyses. They are right to be skeptical in general. There are many ways to abuse subgroup analysis. Perhaps the most pernicious is through data mining. If you look at look at dozens (or even hundreds or thousands) of subgroups, you’ll eventually stumble upon one in which the treatment comes back showing a significant effect. If you can parlay that subgroup into a viable market for the drug through some plausible enough concocted story, you can pass off the result as a success. The treatment isn’t ineffective—it’s just choosy.

On the face of it, there’s lots to worry about in the temozolomide case. Stupp’s study was funded by Schering-Plough pharmaceuticals, who stood to make a lot of money. Stupp’s own career was propelled forward as the man responsible for a breakthrough on a previously intransigent disease. Hegi’s analysis was performed separately after the fact, not built into the study from the outset. Even the way things were presented—couple the study with the reanalysis in the same edition of the same journal to ensure the positive spin is immediately received—may rankle.

But just because subgroup analysis can be abused does not mean that no subgroup analyses can provide important information. The same potential for manipulation and the same problematic incentives are very much present in randomized trials, as the next chapter shows. This alone does not warrant downgrading all subgroup analyses—rather, we must go case-by-case and see whether each instance is credible. Hegi et al.’s analysis seems just that.

First, Hegi was not testing various subgroups, hunting indiscriminately for correlations. She chose the specific gene methylation and tested only that. Second, she had good reasons for her choice which were clearly formulated and explained beforehand by the drug’s discoverers, amongst others. Third, commercial interests might be actively harmed by her work. Hegi concludes: “patients whose tumors are not methylated at the MGMT promoter appear to derive little or no benefit from the addition of temozolomide to radiotherapy. For these patients, alternative treatments with a different mechanism of action or methods of inhibiting MGMT should be developed.”⁽¹⁸⁾ Temozolomide had left Stupp’s trial with a weak result, but one sufficient—in the absence of viable alternatives—to get it immediately licensed by the FDA. Hegi was suggesting that the treatment barely worked for over half of its prospective market.

But the best defense against data mining is replication. If the association between a gene and an outcome appears in multiple studies on different data sets by independent investigators, it’s increasingly unlikely to be a chance artefact. Not only had Hegi and Stupp found the same effect (at a smaller scale) the previous year when applying the same subgroup analysis to Stupp’s small Phase II study of temozolomide⁽¹⁹⁾, but the link had been shown in a preliminary study by Manel Esteller in 2000. ⁽²⁰⁾

Each of these studies in isolation may be weak evidence in favour of the links between MGMT inactivation, temozolomide and survival. But combined, they form far stronger evidence than the sum of their parts. The interaction of the three sources is synergistic—they support and reinforce one another by removing potential ways to explain their findings away.

Even more importantly, these findings simply could not be achieved through a randomized trial. There is no way to answer the question, ‘Who benefits from temozolomide?’, using an RCT. If we want this kind of information—and the example of temozolomide makes the case that we do—then we will have to turn to non-randomised sources to get it.

Why can’t we just do another RCT to answer this question? The problem can be seen by thinking about Hegi’s design in a different way: a form of meta-analysis. Reimagine her work as taking what was originally one RCT and separating it out into two different RCTs. Let’s call them the Active Trial and the Inactive Trial. In the Active Trial, Stupp’s researchers enrolled only MGMT-active patients, randomized them, and then Hegi and colleagues compared their outcomes. In the Inactive Trial, Stupp’s researchers enrolled MGMT-inactive patients, randomized them, and Hegi and colleagues performed the analysis. Two trials, two sets of results, two average effects of temozolomide compared to radiotherapy alone, in two different populations. There is not a huge difference between this reimagined scenario and what Stupp and Hegi actually did, and between it and commissioning two new trials to answer the question.

But the problem remains. We are now comparing the results of two trials. One trial, Active Trial, shows an average treatment effect of a 0.9-month increase in median survival. Another trial, Inactive Trial, shows a 6.4-month increase. But to find out which factors make a difference in temozolomide treatment, we can’t stop there. We must make some kind of comparison between the results of the trials. Statistical methods of comparing results across trials are called meta-analyses.

This may sound like the EBM hierarchical approach has an out. Reclassify Hegi’s work as a form of meta-analysis, which ranks at the top of many hierarchies, and it can be deemed high-quality. But meta-analysis comes in many forms. EBM proponents are specific about kind of meta-analyses their hierarchies rank highly. These are meta-analyses performed as part of a systematic review—meta-analyses designed to estimate the average treatment effect in a broad population. In other words, a big RCT made up of other smaller RCTs, not designed tease out differences between them.

A systematic review gathers together all the data from RCTs relating to a predefined question. When several trials are brought together, the hope is, the researcher can identify the true effect more precisely and confidently because she has a much larger data set. If the studies are comparable enough, the researcher might perform a meta-analysis. There are many techniques available. The most straightforward is to take a weighted average of the effect sizes found in each trial. This pools the data from each trial, to try to answer the same question the individual trials attempted to answer, but with more data, precision and confidence. This kind of meta-analysis provides the evidence that hierarchies deem ‘high-quality’.

Comparing the results of Active Trial and Inactive Trial to investigate who benefits from temozolomide is certainly not a meta-analysis of this kind. We are not amalgamating the results of Active Trial and Inactive Trial, averaging them, and producing a single average result. In fact, doing that would be nothing more than stating the overall results found in Stupp’s trial! Clearly, if Hegi’s study was a meta-analysis, it wouldn’t be one as EBM hierarchies imagine them. Instead, this breed of meta-analysis sees trial results as data points, and characteristics of the trial population as variables of interest. Studying between-trial variation is similar to analyzing within-trial variation. Rather than dividing patients in existing trials into subgroups, such meta-analyses ask whether the effect sizes found in trials are correlated with properties of those trials—such as the proportion of patients who were MGMT-inactive.

Imagine that 10 trials of temozolomide against radiotherapy alone had been performed. It so happens that one trial was done only on MGMT-active patients (Active Trial) and one only one MGMT-inactive (Inactive Trial). The others were done on a mixture. Fortunately, the proportion of MGMT-inactive patients in each varied, from 15% up to 75%. We make a simple prediction: the higher the percentage of patients who were MGMT-inactive, the larger the effect the trial will show. We’d expect the largest effect size in Inactive Trial and the smallest in Active Trial.

Make no mistake: this 10-trial meta-analysis is an observational study. It takes 10 RCTs as its input, but has not performed any randomization. The variable of interest is the proportion of patients who are MGMT inactive. No one randomized some trials to be 75% inactive and others 15%. The comparison is possible only by happenstance. The randomization that happened inside each individual trial is not particularly important here. If we rejected Hegi’s study because it is non-randomised and then performed randomized trials in a bunch of different populations, and then compared them, this still would not be a randomized study. It would remain an observational study, and suffer all the penalties in EBM’s evidence calculus.

Could we do a randomized study on this topic? Technically, yes. We could recruit a bunch of researchers to plan a bunch of trials, and then randomly assign each to recruit a specific proportion of patients with MGMT-inactivation. But we’d need to conduct a huge number of trials, along with a huge waste of resources—particularly wasted because, as a reader of Hegi’s study will see, observational studies like this can provide powerful evidence to answers questions about variation in treatments’ effects.

Of course, the relationship between MGMT-inactivation and temozolomide responsiveness might prove not be causal. The true reason a fully-fledged RCT cannot be performed here is that MGMT-activation might correlate with some other factor which actually influences glioblastoma survival. As long as those variables aren’t separable, the causal link can’t be proven. For instance, the idea that there might be two (or more) kinds of glioblastoma, one of which is MGMT-inactive, and which is less virulent, might provide an alternative explanation. ⁽²¹⁾ Randomization can do nothing to prevent this. Deriving the underlying data from randomized trials would not remove the relationship between MGMT-activity and tumour type or allow us to attribute causation more precisely.

The idea that another factor might underlie the correlation between MGMT-activity and temozolomide response is only worrying if the relationship between that true causal factor and MGMT-activation is weak. If everyone who has the less virulent tumour has MGMT-inactivation, and everyone who has the more virulent tumour doesn’t, then from a treatment perspective there’s really no important difference between saying ‘Temozolomide works better in MGMT-inactive patients’ and saying ‘Temozolomide works better in patients with the less virulent tumour type’. The difference only matters if new projects are undertaken—for instance, to inhibit MGMT expression in MGMT-active tumours. If the MGMT-activity status was a red herring, those projects will probably fail. If they fail, though, this might just lend further evidence to flesh out the taxonomy of glioblastoma types. That knowledge would ultimately build and clarify our picture of the effects of temozolomide.

But if the association is imperfect, there’s a bigger problem. Imagine we found that a new drug worked well on long-haired patients, and badly on short-haired ones, on average. Suppose we acted on that information, and gave the drug only to long-haired patients. Short-haired patients don’t get the drug—it won’t work very well for them, so it’s not worth the costs to the health service or the side-effects for the patients. But in reality, it’s not hair length that mattered. The drug works very well on women, and very badly for men. Women are more likely to have long hair, men less so. So, there was a correlation between hair length and response to the drug. Because we thought to test hair length and didn’t think to test sex, we end up giving the drug to a bunch of patients for whom it won’t work very well—long-haired men—and denying it to people who really would’ve benefited—short-haired women. The worry is that MGMT-inactivity is like having long hair, and there’s something else analogous to biological sex which is the real reason why temozolomide works for some and not for others, but which is loosely associated with MGMT status.

So, should we be giving temozolomide to everyone, or just the high responders? Hegi and Stupp co-authored a paper asking this question in 2015. ⁽²²⁾ They wrote that: “By continuing to treat the majority of MGMT unmethylated patients with [temozolomide], we are missing an opportunity to do better.” Let’s assume, for the sake of argument, that in MGMT-active patients, the narrow effect on median survival is usually not worth the costs—both financially and in terms of side-effects. Are Hegi and Stupp right that we should withdraw temozolomide for those patients, and instead “try a potentially efficacious new agent”?

The answer depends on how confident we are that MGMT-inactivation is the underlying determinant of high response. If MGMT status is only loosely related to temozolomide response, then withdrawing temozolomide from MGMT-active patients is like refusing to give our new drug to short-haired women. There is one powerful and obvious defense: in the absence of evidence of any superior predictor of temozolomide effect, we must go with the best we have. Given how well MGMT-inactivation predicted responsiveness in Hegi’s study and in previous and subsequent studies, we are sure that there are low and high responders, and MGMT-inactivity predicts high-response better than anything else we have.

A second response is to draw on mechanistic evidence. Evidence relating to underlying biological mechanisms has been relegated to the bottom of every hierarchy in which it appears. The idea is that work by Malcolm Stevens and his ilk is not admissible evidence. It is important, of course, for creating new drugs and innovations which can then be tested. But only the trials of those new drugs in humans count as evidence. This position might make sense when it comes to the big question every EBMer wants to ask: what is the average treatment effect, and is that effect better than nothing? Just the fact that there’s a well-understood mechanism underpinning how the treatment should work doesn’t mean it will work. But that is not the question we’re asking here.

When you look at the hair length and biological sex case, it’s obvious that hair length is not the real cause, sex probably is. Why? Because sex is the kind of thing that affects whether drugs work, and hair-length isn’t. There’s no plausible mechanism by which hair affects whether a drug works. As the next chapter will show, the absence of a plausible mechanism can be powerful counter-evidence. There will be a deeper explanation than just biological sex of why the drug works for women and not men. In temozolomide’s case, knowledge of the mechanism and evidence that the availability of AGT interrupts the way a drug works was strong evidence that the correlation between MGMT-inactivation and responsiveness to temozolomide is causal. These two sources of evidence synergize. Together, they provide a stronger reason to believe that MGMT-inactive patients are high responders than we could account for by appraising each separately. Two potentially “low-quality” individual sources can be mutually reinforcing to the extent that their conclusions are extremely compelling. Just as the strong mechanistic evidence from Stevens’ lab acts as evidence that Hegi wasn’t data mining but testing relevant correlations, so it acts as evidence that the correlation is more likely to be causal, not due to some third variable which is related to both MGMT and temozolomide response.

What is the value of Stupp’s trial?

RCTs are called the ‘gold standard’ of clinical evidence. They have been equated with a ‘high level’ of evidence, with ‘high quality’ evidence and with ‘strong’ evidence. ⁽²³⁾As they go, Stupp and his colleagues produced a decent trial. It was large, the control and experimental groups were well-matched for variables believed to affect outcomes, and the analysis was performed by the book for Evidence Based Medicine’s standards. It was not perfect—the trial couldn’t be blinded, and a trial can always be larger. But if anything is, it’s a good candidate for the ‘high-quality’ and ‘strong evidence’ labels that hierarchies assign.

But evidence is always evidence for or against some theory. The theory Stupp’s trial was designed to evaluate was that temozolomide would have a greater effect on median survival than radiotherapy alone. It provided evidence for that claim—an increase in median survival, albeit a modest one. But given what we know from Hegi’s reanalysis and the other studies which have shown the power of MGMT as a predictor, is this useful information at all? Does anyone benefit from knowing that on average in a population of both high and low responders, MGMT-inactive and MGMT-active patients, temozolomide outperforms radiotherapy alone?

It seems not. Patients don’t need this information. They need to know whether they have an MGMT-inactive tumour or not, and then they need the information Hegi’s analysis gives them about the difference temozolomide is likely to make for them. Doctors also don’t need it. Once they have Hegi’s information, they know that they need to test their patients’ tumour genetics and treat accordingly. Regulators might need the information, but only if their systems and procedures are ill-equipped to handle a case like temozolomide. Some regulators license a treatment for first-line use only once it has passed Phase III large-scale RCT trials and shown a positive average treatment effect. But this seems like a mistaken approach to regulation, putting too much faith in EBM’s hierarchical approach. It’s Hegi’s reanalysis, more than Stupp’s trial, that shows the potency of temozolomide, and that is what regulators should be interested in. Should state healthcare providers and insurers want this information? Again, only if they’ve put too much stock in the ‘all-or-nothing’ approach. Why should a treatment need to be provided by the state or an insurer for every glioblastoma patient, when it predictably only works well for a foreseeable few? Why should they refuse a treatment to a predictable high-responder because the average effectiveness in groups that include low-responders is marginal? Basing the decision to pay for or not to pay for temozolomide on whether the patient has glioblastoma, not whether they have MGMT-inactive glioblastoma, is misguided in the light of Hegi’s trial and would be an error based on over-relying on the average treatment effect data from Stupp’s study instead.

In fact, once the information from Hegi’s study is available, the information about the average treatment effect in the broad population provided by Stupp’s trial is no longer clinically useful. Decisions made on the basis of that data will be worse decisions than decisions made on Hegi’s figures alone, or by drawing on Hegi’s figures and the other studies that have asked how big of a treatment effect one can expect for MGMT-active vs. MGMT-inactive treatments. Once we believe that temozolomide works at all, we don’t need Stupp’s trial’s average treatment effect finding. That ‘14.6 months’ prognosis for temozolomide patients, and that ‘increase of 2.5 months compared to radiotherapy alone’ estimate of the treatment effect is not clinically useful once we know that there is predictable variation in effect size. A firm lesson from the temozolomide case, then: once there’s strong evidence of predictable variation in effects, information about the average effect is not valuable.

Does Stupp’s trial provide high quality evidence? Well, it provided a high-quality data set for Hegi to analyse and to use in providing evidence for a different claim about effect sizes. But providing reliable data is not the same as providing strong, high-quality evidence. Stupp’s trial provided pretty strong evidence for a claim which turns out not to be clinically important. Whether it also provided ‘high-quality’ evidence depends on what you think is necessary for ‘quality’. Is it enough to provide strong evidence for a claim whether or not that claim has clinical significance? Or to be high-quality evidence, does a study’s result need to be applicable and useful as well as reliable? If your approach to medicine says that you should pay attention to the high-quality evidence first (or only to the high-quality evidence), as EBM has in the past, then you better hope that clinical importance is integral to the notion of quality.

RCT evidence will still be clinically important where the effects of a treatment are homogeneous: where patients have the same or similar responses to the treatment. That usefulness, though, is dependent on the evidence that the effect is homogeneous. The RCT evidence is not useful on its own: the evidence about variation (i.e. that there isn’t any) is necessary for the RCT evidence to become clinically significant. Where the effects are heterogeneous—varied, as in the temozolomide case—it’s the evidence about variation and the effect-sizes achieved in those groups which is important. There, the RCT evidence is useful only as a data-set to be interrogated, and is not necessarily especially useful just because it is RCT data, because the methods used to extract useful information from that data set don’t preserve the benefits of randomisation. Either way, RCT evidence alone is not clinically useful. The evidence concerning the distribution of effects is needed to make the RCT data clinically important.

The final case in which RCT evidence alone might be clinically useful is where there is no evidence at all about variation. Imagine that Hegi, Stupp, Esteller and their colleagues had never analysed the variability of temozolomide’s effects. Under these circumstances, clinicians can do no better than forecasting an average treatment effect as seen in the clinical trial. This is reasonable behavior, but hardly constitutes either a strong or a high-quality evidence base for their prediction. Given that the fundamental task of a clinician in advising their patient about treatment choice is to recommend the treatment most likely to produce the best prognosis for them, the absence of evidence about the distribution of treatment effects makes their evidence-base a weak, low-quality basis for their work. If Evidence Based Medicine seeks to improve clinical practice and ensure doctors base the core of their work on strong evidence, then evidence about variation must be a central plank.

In the temozolomide case, there are two consequences of under-appreciating variation. We miss the fact that there are a group of patients for whom the treatment works far better than the average effect predicts. We also miss the fact that there are a group for whom the treatment is more likely an imposition than an improvement. The latter point remains under-appreciated. As Stupp and Hegi noted in their 2015 editorial, “Patients with unmethylated GBM are in need of better treatments. This population not only offers the opportunity to test novel treatments but actually requires—more than other patients—that they be offered innovative therapies”.⁽²⁴⁾ MGMT-active patients are poorly served by the one-size-fits-all mindset encouraged by relying only on RCT evidence. The opportunity to test much-needed new approaches is being missed due to the rationale that temozolomide improves survival on average, so must be available to all. Taking Stupp and Hegi’s work seriously means admitting that temozolomide isn’t the best answer for every patient. Part of the reason that glioblastoma remains so dismal is that the now-standard treatment is not suitable for over half its recipients.

Conclusion:

The pivotal lesson for analysts of evidence from temozolomide is the importance of interaction. Evidence hierarchies have trained practitioners to evaluate every study individually, on their own methodological merits. In the temozolomide case, this is a significant mistake. Temozolomide is not likely to be unique in this respect. It’s through the light that Hegi’s study shines on Stupp’s data that we see the power of temozolomide as a treatment for glioblastoma. The relationship between her work and the mechanistic evidence of the causal processes of temozolomide generates confidence in predicting high and low responses to temozolomide. Prior prediction and independent replication enhance the causal inference. Through understanding interacting evidence, we can turn independent studies of varying levels of “quality” into a well-justified, compelling model that clinicians can use to predict effects, and patients can employ in making informed decisions about their care. There was no room and no need to evaluate each study independently in the bargain. There is no reason to attempt to assess Stupp or Hegi’s work independently of the evidence from Stevens, Esteller, and others. In the temozolomide case, a hierarchical approach to evidence doesn’t just get its judgment of temozolomide and its recommendations to doctors, patients and regulators wrong—it gets the way medical evidence works wrong.

Postscript:

Temozolomide treatment became the norm for glioblastoma sufferers. A review by Derek Johnson and Brian O’Neill in 2012 of “the temozolomide era” in glioblastoma treatment pinpointed 2005 as the sea-change in treatment, caused by “a pivotal phase III clinical trial which showed that temozolomide chemotherapy plus radiation was more effective than radiation alone”—Stupp’s study. ⁽²⁵⁾ No other study impacted the field so decisively.

They found that median survival has indeed increased since temozolomide treatment took hold. The few months of increase in median survival found in Stupp’s trial is echoed in the wider population data. Johnson and O’Neill criticised the fact that the figure of 14.6 months median survival has become the starting point for predictions of survival time. They warn that clinicians, relying on RCT results in narrow healthy populations, consistently overestimate the prognosis for glioblastoma patients and cancer patients in general. This can be devastating for patients, friends and family facing an unexpectedly rapid decline. The answers clinicians need in order to improve their forecasts are minimized and omitted.

Despite his personal successes, Stupp is disappointed by the lack of progress. In a 2013 interview, he pointed to the slew of negative research findings in the last decade.⁽²⁶⁾ Intensifying and extending temozolomide treatments did not improve results. Nor did adding new drugs, despite all their earlier promise. Responding to yet another negative finding in 2014, he wrote: “The path to improved treatment of glioblastoma remains paved with disappointments and unconfirmed promises.”⁽²⁷⁾

Stupp and his team push on, though, investigating new approaches to treatment. Stupp sees the breakthrough with respect to the role of MGMT in glioblastoma treatment as a model for future work. In 2011, he wrote: “I think the challenge is really that we still, not only in glioblastoma, but in oncology at large, treat the majority of patients with a one-size-fits-all approach. I think the challenges are to be more individualized, to be able to identify the patient who should be treated with chemotherapy A vs chemotherapy B”.⁽²⁸⁾

In 2009, Stupp, Hegi and their colleagues reviewed the outcomes of all their original trial participants, five years after the publication of their trial, and up to nine years after the initial diagnosis for some of the participants. ⁽²⁹⁾ All but eight patients in the control group had died (97%). In the temozolomide group, 33 survived of the 287 who enrolled. Prospects for glioblastoma were indeed dismal, but the “cunning molecule” had saved dozens of lives.

References:

Aitken, Robert D. ‘Treatment of Brain Tumors: Letters to the Editor’. New England Journal of Medicine 352 (2 June 2005).
Balshem, H. M., et al. ‘GRADE guidelines: 3. Rating the quality of evidence.’ J Clin Epidemiol (2011) 64: 401–6.
DeAngelis, Lisa M. ‘Chemotherapy for Brain Tumors — A New Beginning’. Editorial. Http://Dx.doi.org/10.1056/NEJMe058010, 8 October 2009. http://www.nejm.org/doi/full/10.1056/NEJMe058010.
Esteller, Manel, Jesus Garcia-Foncillas, Esther Andion, Steven N. Goodman, Oscar F. Hidalgo, Vicente Vanaclocha, Stephen B. Baylin, and James G. Herman. ‘Inactivation of the DNA-Repair Gene MGMT and the Clinical Response of Gliomas to Alkylating Agents’. New England Journal of Medicine 343, no. 19 (9 November 2000): 1350–54. doi:10.1056/NEJM200011093431901.
Guyatt, G.H., and D. Rennie. ‘The philosophy of evidence-based medicine.’ In Users’ Guides to the Medical Literature, ed. G. H. Guyatt and D. Rennie, 9–16. (2008) New York: McGraw Hill Medical.
Hegi, Monika E., Annie-Claire Diserens, Sophie Godard, Pierre-Yves Dietrich, Luca Regli, Sandrine Ostermann, Philippe Otten, Guy Van Melle, Nicolas de Tribolet, and Roger Stupp. ‘Clinical Trial Substantiates the Predictive Value of O-6-Methylguanine-DNA Methyltransferase Promoter Methylation in Glioblastoma Patients Treated with Temozolomide’. Clinical Cancer Research: An Official Journal of the American Association for Cancer Research 10, no. 6 (15 March 2004): 1871–74.
Hegi, Monika E., Annie-Claire Diserens, Thierry Gorlia, Marie-France Hamou, Nicolas de Tribolet, Michael Weller, Johan M. Kros, et al. ‘MGMT Gene Silencing and Benefit from Temozolomide in Glioblastoma’. New England Journal of Medicine 352, no. 10 (10 March 2005): 997–1003. doi:10.1056/NEJMoa043331.
Hegi, Monika E., and Roger Stupp. ‘Withholding Temozolomide in Glioblastoma Patients with Unmethylated MGMT Promoter—still a Dilemma?’ Neuro-Oncology 17, no. 11 (1 November 2015): 1425–27. doi:10.1093/neuonc/nov198.
Higgins, J. P. T., and S. Green, eds. Cochrane Handbook for Systematic Reviews of Interventions. 5.1.0. available at: http://handbook.cochrane.org, accessed 01/04/15: The Cochrane Collaboration, 2011.
Johnson, Derek R., and Brian Patrick O’Neill. ‘Glioblastoma Survival in the United States before and during the Temozolomide Era’. Journal of Neuro-Oncology 107, no. 2 (1 April 2012): 359–64. doi:10.1007/s11060-011-0749-4.
Newlands, E. S., M. F. G. Stevens, S. R. Wedge, R. T. Wheelhouse, and C. Brock. ‘Temozolomide: A Review of Its Discovery, Chemical Properties, Pre-Clinical Development and Clinical Trials’. Cancer Treatment Reviews 23, no. 1 (1 January 1997): 35–61. doi:10.1016/S0305-7372(97)90019-0.
Sackett, D.L. et al. ‘Evidence-based medicine: how to practice and teach EBM’. 2nd ed. (2000) Edinburgh: Churchill Livingstone.
Sansom, Clare. ‘Temozolomide – Birth of a Blockbuster’. Chemistry World July 2009 (26 June 2009): 48–50.
Seiter, Karen. ‘Treatment of Brain Tumors: Letters to the Editor’. New England Journal of Medicine 352 (2 June 2005).
Stevens, Malcolm. ‘Malcolm Stevens – School of Pharmacy’. Accessed 5 June 2017. https://www.nottingham.ac.uk/pharmacy/people/malcolm.stevens.
Stevens, Malcolm F. G. ‘Temozolomide: From Cytotoxic to Molecularly Targeted Agent’. In Cancer Drug Design and Discovery (Second Edition), edited by Stephen Neidle, 145–64. San Diego: Academic Press, 2014. doi:10.1016/B978-0-12-396521-9.00005-X.
Stewart, Bernard W, and Christopher P Wild. World Cancer Report 2014: World Health Organisation, International Agency for Research on Cancer. Geneva, Switzerland: WHO Press, 2015.
Stupp, Roger. ‘My Approach to Glioblastoma’. PracticeUpdate, 9 August 2011. http://www.practiceupdate.com/content/my-approach-to-glioblastoma/17399.
Stupp, Roger. ‘Bevacizumab for Newly Diagnosed Glioblastoma: More Disappointment.’ PracticeUpdate. March 6 2014.
Stupp, Roger, Monika E Hegi, Warren P Mason, Martin J van den Bent, Martin JB Taphoorn, Robert C Janzer, Samuel K Ludwin, et al. ‘Effects of Radiotherapy with Concomitant and Adjuvant Temozolomide versus Radiotherapy Alone on Survival in Glioblastoma in a Randomised Phase III Study: 5-Year Analysis of the EORTC-NCIC Trial’. The Lancet Oncology 10, no. 5 (May 2009): 459–66. doi:10.1016/S1470-2045(09)70025-7.
Stupp, Roger, Warren P. Mason, Martin J. van den Bent, Michael Weller, Barbara Fisher, Martin J.B. Taphoorn, Karl Belanger, et al. ‘Radiotherapy plus Concomitant and Adjuvant Temozolomide for Glioblastoma’. New England Journal of Medicine 352, no. 10 (10 March 2005): 987–96. doi:10.1056/NEJMoa043330.
Walker, Michael D., Eben Alexander, William E. Hunt, Collin S. MacCarty, M. Stephen Mahaley, John Mealey, Horace A. Norrell, et al. ‘Evaluation of BCNU And/Or Radiotherapy in the Treatment of Anaplastic Gliomas’. Special Supplements 112, no. 2 (7 May 2009): 333–43. doi:10.3171/[email protected].
Zoeller, Lauren, and Roger Stupp. ‘Glioblastoma: Despite All the Disappointment, There Has Been Progress.’ PracticeUpdate. November 7 2013.