Bias and the Myth of the Objective Average

Bias and the Myth of the Objective Average

“Suppose you have cancer and you have to choose between a black box AI surgeon that cannot explain how it works but has a 90% cure rate and a human surgeon with an 80% cure rate. Do you want the AI surgeon to be illegal?”

Geoffrey Hinton, 20/02/20

Geoff Hinton, neural network pioneer, posed this apparently rhetorical question to Twitter. The answer is not as clear-cut as Hinton appeared to assume, and exemplifies an overgeneralised approach to medical evidence. AI researcher Kate Crawford’s response puts it succinctly: “Let’s talk about the 10%. What if the AI surgeon was trained on data that oversampled white men (as per many controlled trials)? And it consistently produces worse outcomes for Black people and women? Seems like it matters who “you” are in this hypothetical”.

Crawford is spot on here, and she identifies a problem which foregrounds how ignoring heterogeneous treatment effects often threatens to exacerbate and even create bias in medicine. These biases frequently enter medicine through screening protocols which remove women, older patients, children, and racial minorities from clinical trials. But bias can also enter into the interpretation of medical evidence when we assume that all treatments operate homogeneously. If two treatments have heterogeneous treatment effects, but one shows a higher average treatment effect than the other, then the orthodoxy of Evidence-Based Medicine would likely hold that we should exclusively use the ‘more effective’ one. The answer to Hinton’s question would be: certainly not – but maybe we should think about banning the human. But where treatment effects are heterogeneous, varying accordingly to some features of the patients, the groups of patients who benefit most from the ‘inferior’ treatment, or least from the ‘superior’ treatment, will be underserved by the medical orthodoxy.

A better answer to Geoff Hinton’s rhetorical question depends on both who “you” are and what properties you have, and on which properties of patients are associated with failure – in both the AI and in the human surgeon. But the problem runs deeper still than the potential for bias in the training data identified by Crawford. The dominant model of clinical testing is simply incapable of answering—and even asking—the kind of question Crawford raises here. The question is what overlap there is between the 10% who are not cured by the AI surgeon and the 20% who are not cured by the human. There are a few possibilities.

For one, it could be the case that there is a single common predictive variable or set of variables which affects the likelihood of surgical success. For example, the size of the tumor or the severity of the condition might be the main determinants of success across all patients, regardless of whether treated by the human or the AI surgeon. In that case, perhaps 20% of cases are too severe for the human surgeon to cure, but the AI surgeon does much better and only the 10% of most severe cases are beyond its skills. The AI failures are a proper subset of the human failures. In that case, the answer looks clear: always use the AI surgeon. Or, better still, if the relationship between severity and success is strong enough and well understood, use the AI surgeon unless the condition is too severe for there to be any reasonable chance of success.

This situation is the simplest case, the world of homogeneous treatment effects. The assumption that treatments have homogeneous effects across diverse populations underpins Hinton’s tweet. But this is not a reliable assumption. In most domains of healthcare – and especially so in both surgical interventions and oncology, compounded at the intersection of the two – heterogeneity is the norm. Hinton should not feel too abashed here: his assumption is widely shared across models of evidence in medicine and is the statistical orthodoxy. But when clinical realities are rarely so obliging, we should look beyond the simplest scenario.

Alternatively, the AI and the human surgeon’s success rates might be determined (at least in part) by different variables. For example, suppose that the AI surgeon’s success rate is based on how typical the case is – how much it resembles the cases in its training data, perhaps. The more the tumour’s location, size and shape resemble the average tumour, the better able the AI is to excise it. In only the 10% most atypical cases does the AI fail. Otherwise, it succeeds.

By contrast, suppose the human surgeon’s success rate is based on something else. Maybe it is based on size: the human surgeon does well with large tumours but is much less capable of excising smaller ones. Now the decision is less obvious. For any fairly typical tumour we will use the AI surgeon. When the tumour is small and atypical, perhaps we’ll still plump for the robot – though we would probably want to research success rates in that subtype more extensively as a priority. Yet when the tumour is large but is in an unusual position or shape, we might decide to deploy the person. This strategy uses the AI for the cases it is good at and the human at the cases where the AI is not well configured but the human is. Using the system with the highest success rate where neither is particularly adept, this approach is highly likely to outperform using any single system alone. If we can achieve >90% cure rate with this hybrid strategy, then we should. This will be particularly successful at mitigating bias in treatment outcomes where the variables which correlate with success in one or both of the treatment options are correlated with broader patient features, be they physiology, gender, age, ethnicity, socioeconomic status, or otherwise.

But the human surgeon is not just one individual with their scalpel. The ‘human surgeon’ is not even a single human. They are a surgical team. Actually, ‘the human surgeon’ is many different teams, distributed across the globe. The 80% success rate for the human surgical teams represent the average success rate across surgical teams. Some teams might vastly outperform the 90% successful AI and should not be taken off cases in its favour. At the very least, we should not contemplate banning human teams in favour of the AI before we have checked whether any surgical team can outperform the machine.

But that 80% success rate also probably represents different teams which have different determinants of success. One team might be very good with larger tumours but tend to fail on small tumours. One team might be vexed by atypical tumours but unaffected by size. Another might tend to fail in patients with comorbidities—other conditions affecting them other than the cancer. Some might cherry-pick easier cases and pad their stats (and we should carefully verify that this did not happen in the data which backs up the AI’s 90% figure). Others might specialise in the most difficult cases. And so on. The effect of these different tendencies is to produce the phenomenon that on average the success rate is 80% and on average the success rate is about equal from patient to patient. In the abstract, we might then ignore determinants of success when considering the human surgeons, and just say that there’s an 80% chance of success regardless of patient properties.

This would be a costly mistake. If the chance of success is not random for a given surgical team (i.e., it doesn’t just fail in 1 of 5 patients by luck alone), each individual surgical team has correlates of success. So, the optimal decision for any given patient depends on the success determinants of the available surgical teams. In fact, even if every team had an 80% success rate, it may still be possible to improve significantly upon the 80% success rate by matching the surgical candidate to the surgical team with the best chance of success for that candidate. The 80% figure represents only a neutral strategy which omits one of the most powerful evidence-led approaches to tailoring treatment. Indeed, it may be that the most effective role for AI in improving surgical outcomes is outside the theatre, in facilitating a matching process between surgical teams (whether AI, human or hybrid) and the tumours they are most adept to handle.

The problem is that we may have very little or no information about the determinants of surgical success at the level of the individual team. The sample sizes of procedures conducted by each team may be small. With only 1 in 5 failures on average, the dataset of failed procedures is even smaller. It would be easy to overgeneralize from such a small sample, so often the best we can do may be to treat surgical team failure as if it were random, even if we believe it is not. Improving clinical outcomes would likely be possible with greater research into determinants of success by surgical teams, and the variation between teams’ performance.

So, suppose we treat surgical teams’ performance as if it were random. We should not treat the AI surgeon’s performance as if it were random unless we are similarly unable to learn anything about the patients for whom the surgical procedure fails. Where the AI surgeon is functionally the same surgeon in all of its procedures, not an assemblage of distinct teams, this may be much easier, even if the AI surgeon has performed fewer procedures than the human-only surgical teams.

It may be possible to draw some hypotheses about the AI’s performance based on the training data alone. We can look at who is under-represented in the training set, and which kinds of tumours are infrequent within the data. We should expect that patients and tumours which dis-resemble those in the training data will be underserved by the AI. One issue, though, as Hinton notes, is that many algorithms are effectively black boxes. The networks which underpin the hypothetical AI surgeon’s performance are extremely complex and difficult to interpret for human observers. It may be practically impossible to understand how the AI functions and therefore to make good inferences about its likely failures. If such a black box algorithm is used, all we can do is look to the outputs: which patients are cured and which are not. It would be too quick to jump from under-representation in the training data to the assumption that failure rates will be higher in such patients in practice. We would need to study the question through outcomes research.

In fairness, the AI is not the only black box involved here. The brains of surgeons are black boxes to us, too. The surgeon might appear to be able to tell us why they did a certain thing during a procedure. But they are really telling us why they think they did it, in hindsight. What is actually going on in the surgeon’s mind is hard to understand with any conviction and hard to extrapolate from their descriptions with confidence. So, while the black box problem certainly applies to the AI surgeon, we should not assume this is a unique problem for AI in medicine. In fact, the argument made here is predicated on us having insufficient evidence to unpack the effective black box situation created by having the human surgical success rate of 80% be composed by the varied success rates and success determinants of many different human surgical teams.

The combination, then, of the limitations of data at the individual team level and the black box structure of most deep learning algorithms often leaves us reliant on the outcomes data alone. If we want to know whether the AI surgeon or the human surgical team is the better option for the individual, then we primarily need to look to the data of past success and failure rates.

A first step is to look at the overall success and failure. 80% and 90% respectively. If we can go no further than that, then the AI surgeon is the one to choose. But if we can stratify the outcomes data by relevant patient features and make comparisons in those groups alone, we may see different comparisons emerging. If, as Crawford envisions, the AI surgeon’s success rate in black people and women—and if we adopt an intersectional approach, likely particularly in black women—is significantly lower and the success rate in white men significantly higher, then that might be enough to bridge the gap for black women to the point that the human surgeon is the better choice. Of course, this depends on the assumption that the human surgeons too are no worse (or perhaps better) when operating on black women. If there is a bias in both sets of training—the human and the AI—then it is perhaps unlikely that the humans can outperform the AI even then. There is a tendency for AI to replicate human biases because those biases are enshrined in training data, and we should not assume that AI bias will not track human bias.

The two-tweet debate between Crawford and Hinton exemplifies two approaches to medical evidence. Hinton’s approach assumes that population-level data provides the best information for determining practice. The AI surgeon is better than the human at the broadest population level. Crawford thinks that we should stratify more and try to discover subgroup variation because she knows that AI tends to produce skewed results based on the representativeness of its training data. Crawford’s approach moves us away from the focus on large population-wide studies which is usually advocated by proponents of Evidence-Based Medicine. The concerns she raised can only be answered by a more targeted and stratified approach to evidence, which has tended to be epitomized by the movement towards ‘Precision Medicine’ or ‘Personalised Medicine’.

There is a final irony here: the use of machine learning approaches is perhaps the most fertile ground for the development of Precision Medicine, which depends on the use of big data to drive ever more fine-trained taxonomy of patients and their conditions. A Precision Medicine approach might likely start elsewhere – by rejecting the simple taxonomy of ‘the tumour’ which we are treating in this hypothetical situation. It might begin from the assumption that if success rates for the surgery are so variable, there might be underlying differences between patients’ tumours which account for (at least some of) the difference between success and failure for the surgeons. Precision Medicine might seek to bring deep learning to bear on the question of tumour stratification, classifying patients into groups according to their features and to features of their tumours, and then looking to see whether some groups respond to the surgery while others do not. Perhaps the AI surgeon is a good advance, but the AI stratification would be better, allowing us to target treatment where it most likely to produce the desired cure and avoiding a risky procedure where it is not likely to produce the desired effect.

Latest edit: 06/09/2022.