With my recent move to a Data Science team & acknowledging that my recent reading has been overwhelmingly data-optimistic, I thought I’d take a recommendation from an analyst colleague of a book that is quite the opposite – “Weapons of Math Destruction”, by Cathy O’Neil. Below follows my summary of it, followed by a one-page nine-question anti-WMD framework inspired by the book, and finally a hypothetical worked example of applying the framework to an algorithm in the RAF (Machine Learning in Promotions). The views expressed mainly reflect content provided by Cathy (and hence there are predominantly US examples, rather than UK – not all will be generalisable to the UK) with a couple of my personal observations – none represent my employer’s views.
“Weapons of Math Destruction” is written by Cathy O’Neil, a data scientist who draws on both her own personal experiences working in e-commerce & a number of illustrative case studies to outline how big data and the algorithms powered by it are increasingly being used as “Weapons of Math Destruction” (‘WMD’ for short).
In short, not every algorithm or model (if these terms are confusing, see below) is a WMD – only those that cause harm, usually on account of three hallmark features: their opacity, scale & the damage they cause. This damage seems most acutely felt by the least fortunate, usually unknowingly, and often seems to perpetuate vicious cycles of poverty, inequality, and incarceration. Let’s dive into some examples.
Assessing Teachers’ value-add
One of the first and most shocking examples of a WMD Cathy cites is Washington DC Schools’ “IMPACT model”. It came about in 2007, as a result of plans DC’s mayor set out to turn around the city’s underperforming schools. Their hypothesis was that a key reason for schools underperformance was a small minority of poor teachers. It followed that by identifying and firing the worst teachers the average quality of teaching would rise. Though pretty uncompromising, it doesn’t sound like an entirely maligned plan.
To implement this plan, they collected data on teachers performance so they could identify the bottom 2%. Their algorithm allocated 50% of it’s score on the teacher’s “value-add” to students. Put simply if a child predicted a B gets an A, the teacher has a positive “value-add” score and vice versa if the child who is predicted a D gets an E.
Though this sounds an intuitively reasonably fair way of assessing teacher quality, it turned out to be anything but, leading to such crazy variances in scores that one teacher went from scoring 6/100 one year to 96/100 the next. The teacher in question clearly had not changed from a complete liability to once-in-a-generation genius in the space of a year, so what had gone wrong?
Firstly, this is a great example of how big data is often used in an overly-reductionist way – associating a complex phenomenon (i.e. kids’ educational attainment) with a single factor (teacher performance) simply because it’s convenient to do so (or ignorance, perhaps). Kids educational attainment is of course in part due to teachers, but it’s also due to the child’s personal motivation, mental health, curriculum design, home circumstances (a reviewer of this article kindly pointed to many studies confirming this which are summarised here) and a great number of other things. Instead in this case, even though teachers’ jobs were on the line, the algorithm wasn’t made to account for the complexities of real-life, real-life was reduced to an overly-simplistic score to suit the purpose of efficiently sacking 2% of teachers a year.
Secondly, this faux-mathematical approach is laughable statistically – on account of it’s inappropriate statistical power. Big data is more statistically powerful at making predictions when the data is, well… big. Small samples (i.e. a class size of 20-40) do not qualify as big data and can often display large variance over time – meaning that one year a teacher might have a dreamy class of highly engaged students whom they’re able to add significant value too, whilst the next class has a disruptive child and a couple of children with difficult home circumstances meaning suddenly the balance is tipped. Should an algorithm be sacking teachers who have tough classes this year? What a perverse incentive to put teachers off going to the more challenging schools where they might truly be able to make a difference because there is a risk that an algorithm will lead to their sacking in year one.
These two features combine to make the algorithm harmful (trait 1 of a WMD) to teachers – they could be sacked because of an overly simplistic algorithm and just having a tough class one year. Imagine the number of good teachers who lost their confidence ever to teach again after being falsely labelled as in the bottom 2% of teachers. Unfortunately when this did happen, it happened at scale across the entirety of DC (trait 2 of a WMD). Though there was of course uproar, teachers had no way to challenge the decisions because the algorithm’s opacity (trait 3 of a WMD) – justified on account of the algorithm being ‘proprietary’. This meant teachers weren’t able to delve deeply into why exactly they had been sacked nor, therefore, appeal the decision.
Race and the poverty cycle
Perhaps one of the most topical subjects that comes up time and again in this book are WMDs that unfairly harm the poor and ethnic minorities. Below, I’ll try to bring a few of the examples given into the book into a single narrative.
Let’s imagine you’re an 18-year old black male from a rough neighbourhood in the Bronx.
Through no fault of your own you find yourself with little money, a low standard of education and little in the way of prospects. If you were asked who was to blame for your situation, I doubt algorithms would so much as cross your mind.
You wake up in the morning & check your social media – as usual, ads for payday loans are plastered all over your newsfeed – you probably think this is normal, that everyone else sees these ads, but actually you’re selectively targeted for these based upon your demographics (presumed low educational attainment, low level of savings & income). Just a few blocks away in the nicer part of town, little to your knowledge, the lads your age don’t ever see those predatory loans – they get ads for low-interest rate loans from the reputable banks their parents set them up with as children.
You lie in bed daydreaming. You’ve always wanted a car and without one it’s hard to get a better paying job outside of the neighbourhood, and your savings will almost cover it. You deliberate the decision for a while then later that week you find a car that fits, take the loan and buy it.
Unfortunately, you hadn’t really accounted for just how expensive insurance is – way more than your white middle-class friend from high school. Perhaps he was exaggerating the deal he got? Unfortunately, unbeknownst to you, this is because many car insurance quotes in the US take into account your credit rating. No doubt when the insurers built their algorithms, they felt that credit rating might be a proxy (a surrogate) for reliability, or that it correlated to insurance claims in some way. But this misuse use of a proxy prejudices against the poor, as they typically will have lower credit ratings – but why would being poor make you a worse driver? Cathy cites one shocking example where one study found that car insurance was far more expensive for someone with a completely clean driving record & a low credit score than for someone with a drink driving conviction & a high-credit score.
Having taken out another payday loan to cover the insurance, you miss a payment – you know your credit rating will take a hit, so you seek to find a job ASAP to get yourself out of this spiral – it should be easy now you have a car and can be more flexible. You apply for dozens of jobs in the local area, being rejected from many of them. Though you put it down to your lack of experience in the service industry, an algorithm may have more to do with it than you think. Many employers take into account your credit score (derived from an algorithm that is roughly 35% payment history, 30% amount owed, 15% length of history, 10% new credit, 10% types of credit used) as part of their application process as a proxy for how reliable you are – presuming that if you can keep up with your loan repayments, you’ll be reliable with work. Unfortunately this often then rules out the people who most need jobs (those in debt) from getting them – and highlights the scale at which this algorithm (the credit score) unintentionally creates harm by propagating the circle of poverty, whilst you probably don’t even realise it’s happening.
Finally you secure a job in a famous-coffee-chain I won’t name in this blog. It’s not your dream job, and you’ve only got 24-hours a week but hey – that’s fine, because you’ll get a second job to fill in the time and supplement your earnings.
Unfortunately this proves impossible – because your schedule seems horrendous. Damn the boss must hate you – or perhaps, you ponder, it’s some tactic to weed out those who aren’t committed? You seem to only have shifts at opening time (0500-0900), closing time (2000-2200) and seemingly random day shifts usually at weekends.
You guessed it, our-famous-coffee-chain uses an algorithm for it’s scheduling – which optimises for profits by minimising staffing, responding dynamically to historical demands, weather forecasts, and local events. This means you only know your shifts a matter of a couple of days in advance and most shifts fall around the opening up/commuting times and then closing up time, despite the fact that this schedule is exhausting, precludes you getting another job, and forces two commutes a day. You raise it with the manager, but he points out that at such fine margins, they need a mathematical model to dictate scheduling – it wouldn’t be cost-efficient to keep you all in over the quiet mid-morning or mid-afternoon periods. The mathematical model has been ruthlessly trained to optimise shift patterns for cost-efficiency at the expense of staff wellbeing & retention. And why would it? There’s a queue of replacement candidates at the door should they need them.
You’re exhausted. Weekends are written off with opening and closing shifts, you get 5 hours sleep between these shifts, and you’re only actually working 4-6 hours a day so the pay is terrible. You can’t take on a second job due to the unpredictability of the shifts and your monthly repayments on the loan seem to take barely anything off the amount outstanding since the interest rate is sky high.
You finally have an evening off. One of your friends is hosting a little get together – a few beers whilst watching the basketball. You take a couple of your dad’s beers from the fridge, stuff them in your bag, and make your way over to his house just around the corner – cracking your bottle with excitement and taking a few hard-earned sips on the way over.
Unfortunately on the way, for the 3rd time this year, you’re stopped and frisked by the police. This isn’t an isolated case – a study by NY Civil Liberties Union found that though 14-21yr old black & latino males made up ~4% of the population, they accounted for ~40% of the stop & frisk checks. 90% turn out to be innocent, but many of the remaining 10% are done for petty crimes such as underage drinking or carrying a joint. Crimes that rich white kids do every weekend in college frat parties, but never end up charged for.
You’d be forgiven for thinking that there was no model underlying this, but unfortunately you’re probably wrong. Police departments (like almost any other public department) have to be careful with their spending, focussing it as efficiently and as effectively as they can on tackling crime. Recently, some police forces have turned to predictive modelling software to direct their policing efforts to where the most crime is. Though this sounds like a good idea, let me just play through how this actually pans out. A rough predominantly black neighborhood will have more historic crimes in it – partly due to it being a low-income high-unemployment area, and partly some might argue, due to historic Police bias towards black people. The model is trained on this data and therefore focuses police attention on this area.
Now, of course policing this area more and conducting more stop-and-frisk searches increases the number of “hits” for the model – the number of crimes picked up. All whilst in the rich white area people may be abusing class A drugs, underage drinking or perhaps vandalising property these crimes aren’t picked up, because less policing is focussed there – due to the algorithm. So the algorithm ends up feeding it’s own sick version of reality – focussing police efforts on the poor neighbourhoods and picking up large numbers of often petty crimes whilst ignoring the better off neighbourhoods – training a model that is increasingly unbalanced towards the poor neighbourhoods. That’s not even to mention that the more petty offences punished within these neighbourhoods, the more incarceration, the less the prospects of locals ever getting a job and breaking the poverty cycle.
Back to the story… so after you’re stopped you resist arrest by running off – in hindsight it’s a bad decision but in the moment you’re terrified at the thought of criminal record on your employment prospects. You’re apprehended though, and end up in a courtroom the next week. You know that historically race has played a major part in sentencing – sentences imposed on black men in the federal system are 20% higher than white men convicted of similar crimes – so flag it with your legal aid. Not a problem anymore, she replies, there’s an algorithm to aid in sentencing (at the time of print of WMD these were being used in 24 states) to make it more fair.
The problem is though, is that these algorithms are opaque to the people they are being applied to, exist at scale and cause real harm – by perpetuating the cycle of poor, and largely black, people being incarcerated.
How? Let’s explore the example of the “LSI-R” (Level of Service Inventory – Revised). Prisoners fill out the questionnaire and a simple algorithm (totting up scores from the different questions) then judges them to be either low-, medium- or high-risk of reoffending if released – and this, in some states, is used to determine sentence lengths. Though you’re not overtly asked your race, imagine how different the results would be in this risk score for a white middle class man vs a black working class man, regardless of crime. One question asks about “the first time you were ever involved with the police” – the white guy has probably never been stopped and frisked, the black guy has quite possibly been stopped and frisked a number of times that year, just because of his neighbourhood & colour of his skin. Another question asks whether any of their friends or family have criminal records – which is, of course, more likely for a black person from a poor neighbourhood than a white middle class person. So instead of judging your sentence length purely on you and your crime, your sentence length reflects a system which makes it ten times more likely to be stopped and frisked as a young black man and something else you have no control over – your friends and families criminal records.
What you measure you become
Having spoken there about the inappropriate use of proxies to judge the likelihood of an outcome, such as whether your friends/family have a criminal record being in part used to determine your sentence length, I want to explore one other example Cathy gives of proxies damaging the University education of all members of society.
Back in the 1980’s, a second-tier news provider “U.S News”, decided to run a new feature – university rankings. Their journalists, who we can probably all agree should not be defining the shape of national education, defined 15 proxy metrics for measuring the quality of US universities. They then took these, mashed them together through an algorithm they’d designed on the back of a fag packet, and outputted a national ranking.
The ranking became extremely popular, and became living proof (albeit a twisted and unfortunate truth) of one of my maxims – “what you measure, you become”.
Now, one might argue that this could only have driven up the quality of the educational offer. Let’s explore why this hasn’t been the case in the US.
First of all it’s important to acknowledge just how crucial these rankings became – slip in the rankings and suddenly universities would find themselves in a vicious spiral – a lower ranking meant lower quality of applicants, lower pull of high quality professors and less money donated by alumni to their alma mater. Over the coming years their ranking would be in free-fall. The opposite effect could start a virtuous cycle to the elite. And so Universities poured extraordinary quantities of cash into upping their ranking scores – focussing on optimising their offering to satisfy the ranking model – a model, just to recap, not informed by academics or even students themselves, but by a few journalists in a second-tier news publication – designed to sell papers.
The first point to stress about the pure madness of this ranking system is that proxies are not necessarily directly related to the outcome which they intend to predict. To give but one example – one of the fifteen proxies used for the US news ranking system was the “SAT scores of incoming students” – higher average SAT scores of incoming applicants was believed by the journalists to reflect a higher quality of candidates – and thus bumped up your ranking. The problem is of course two-fold with this – firstly, those at poor state schools may be very academically talented but score relatively poorly in the SAT compared to those in expensive private schools – use of this proxy motivates universities to bump up the minimum SAT score requirement, sacrificing diversity and the opportunity for very bright individuals from less well-off backgrounds to attend their universities. Secondly, this motivated some absolutely crazy responses – Baylor University paid for all of their incoming students (after they’d been accepted!) to re-sit their SAT, to see if they could get a better score – imagine the sheer cost of administering this, for absolutely zero educational benefit, simply to boost the ranking score – to feed an algorithm.
And who foots this cost, who bears the harm of this algorithm entirely unknowingly? With US university costs going up by 500% from 1985-2013, it’s obvious – the students do. And again this leads to the poor disproportionately missing out on the opportunity of a decent education, put off by the promise of extreme debt that is often simply funding universities desire to satisfy an algorithm. Is cost one of the 15 metrics by which universities are ranked by you ask? Of course not.
As a side note (not from the book) – this ranking system might indirectly lead to the end of tens or hundreds of universities over the next year or so, hastened by the COVID pandemic. The current edtech (education tech – such as massive open online courses – ‘MOOCs’) revolution means that students can now get a world-class education from the comfort of their own home for a tiny fraction of the price – and with social distancing rules, universities can no longer offer the face-to-face added value you paid over the odds for previously. These “MOOCs” are undercutting, by a frankly gargantuan amount, a university system which has optimised itself to satisfy proxies that are ill-related to educational quality whilst utterly ignoring cost. It seems likely that many of them, with the exception of those who have a strong brand or unique offering, might struggle to survive.
So what would have been a better way to do this? Well, the Obama administration did try to create a rejigged ranking system, but the pushback was fierce – these universities had spent years trying to orient around these metrics, after all. So instead the US education department simply released a whole load of data on each university online, so students could ask the data the questions that mattered to them – things like class size, employment rate post graduation, average debt held by graduating classes. It’s transparent, controlled by the user, and personal – Cathy labels it as ‘the opposite of a WMD’.
Developing an algorithm checklist
Hopefully, these cases have illustrated how mathematics – whether dressed up as an ‘algorithm’, a ‘model’, ‘machine learning’ or ‘AI’ – has the potential be harmful & opaque, at scale. However, there is no doubt in my mind, that such models/algorithms can also be very useful and will only increase in their use – just check out my podcast on AI & book review of ‘AI Superpowers’ for some of my thoughts on the positive uses of mathematics.
So how can we build great models, whilst being cognisant of the traps that we might fall into? I’ve tried to make my own little 9 point checklist for doing so (see overleaf, printed on a separate page so you can print it and pin it on your wall if that’s your thing) and in the next section we’ll road-test it with an example.
RAF case study – Machine Learning in the promotions board
Finally, let’s work through an example of putting these principles into action – taking the use of machine learning (ML) in promotions, an idea that has been explored by some large organisations. I stress that this is absolutely NOT a criticism of this approach to promotions – I’m actually beyond delighted that we’re exploring the use of cutting edge technology to aid in HR processes – nor do I know the ins and outs of what the results of these explorations were – I will just explore a hypothetical example of what might happen. Let’s use the checklist.
1a. Why is the algorithm necessary?
Well, RAF promotions are decided by a board, based upon individuals’ scores on their yearly appraisal, and two portions of free-text written feedback by their boss and boss’s boss. Reading the many thousands of reports is very time consuming, so if an ML algorithm can analyse the reports and sort them into a rough ranking, the process might be more time-efficient (saving £’s), but more importantly it will give the humans on the board more time to evaluate the borderline cases (improving promotion decisions).
1b. What are the assumptions that the algorithm relies upon (document proxies and the confidence you have in their relationship to your ultimate goal)?
The algorithm does of course rely on some assumptions, two of which I’m going to take time to point out:
- Firstly, that sentiment analysis (the likely predominant means of ranking these reports) of the written report is an accurate proxy variable in predicting how appropriate a person is for promotion. This may be a stretch… as simply having a more verbose boss who is more superfluous in their description of you could increase your chance of promotion, based upon the principles of sentiment analysis.
- Secondly, that training the algorithm on the past 4 years of reports would predict the type of people we want to promote next year. This of course could be completely untrue too – what if we decided that this year, we wanted more technical people taking leadership roles as we transformed to a more technologically capable air force. Of course we’d need to build in a corrective aspect to the algorithm that sought technical competence.
To prove these points, let’s run an example. In the RAF, we are looking to promote those with a technical background, as we seek to become more digitally enabled. Unfortunately, though a human can tell that candidate A might have the right skills to be eligible for promotion by reading this is her report: “turned around a desperately struggling IT department and kicked off cloud migration” – an online sentiment analyser get’s it completely wrong – giving a 60% confidence that the sentiment of this sentence is negative (see below screenshot). Candidate B might be an average IT worker who is rugby obsessed and so his report contained “did a phenomenal job of leading a exceedingly successful local rugby team”, the sentiment analyser believing there’s a 94% chance this is positive (see screenshot). So our sentiment analyser flunks the candidate who’s transformed an IT department, and promotes the rugby-nut up the pecking order. Of course the way the algorithm was implemented would have to done in a way which guarded against this.
1c. Are there any features that will perpetuate existing biases? (illustrate thoroughness by explaining the results of a variety of example user-interactions with the algorithm)
There are two obvious concerns that spring to mind. Firstly, we already know that the RAF higher ranks are predominantly white and male – and many would argue that this lack of diversity harms us. One would like to hope that there is no bias against either in promotion boards, but we would have to be conscious that if there was, an algorithm might be trained to adopt this bias too. Even if no human biases toward white males existed in the training set, just consider the raw numbers of representation – it’s not difficult to imagine the situation when the algorithm notices that only 5% of the reports it read containing the world “womens team for [x]” get promoted, whereas 40% of those with the term “mens team for [x]” get promoted. Unless corrected for, the algorithm might now incorrectly take this as a causal correlation (i.e. gender determines promotion probability), instead of simply being a representation of the male-heavy gender split within the RAF. To address this, the report’s would need any gender indicating words (i.e. womens/mens, she/he, female/male) replaced with a none gender-specific word.
The second concern that springs to mind is that if the algorithm is trained to favour those with glowing and strongly positive reports (a feature of sentiment analysis algorithms, which might well provide a large part of the score) it might bias against those whose report is written by a boss whose first language isn’t English, or whose boss has a tendency to be understated. Imagine the Gurkha regiment, for example. A native Nepalese Gurkha Officer writes an understated report for a bright young female officer, which is even less glowing since the Gurkha superior has a limited arsenal of positive complimentary words (i.e. exceptional, superb etc.) – the young officer barely gets a look-in on the promotions pile despite high performance, simply because the algorithm biases against those whose bosses are non-native english speakers and less verbose.
As per the wording in 1c, a key process to guard against these biases would be to validate the algorithm on past data, to see how it performs – critically looking at its performances on edge cases – how the algorithm scored the female Gurkha officer above, or a tough-upbringing black female in the engineers compared to a similar performing white male from Eton in the Rifles, amongst many other examples. Any discrepancies should be addressed at this point in the algorithm’s design, rather than when it’s been let loose on the career prospects of these people in real life.
1d. How will a feedback loop be created to correct for mistakes made by the algorithm, or the requirement to shift to a new decision paradigm?
Needless to say, regular human reviews of the algorithms performance, particularly on edge cases & appeals will be required. This would allow dispassionate recalibration of the algorithm should it be found to be selecting against diversity (one reviewer [PhD in AI] noted on a draft: “Diversity in data is key! More data doesn mean better algorithms, more diverse data does.”), or also in response to changes in people strategy (i.e. “we want a greater representation of people with technical skills to be promoted”). The strategy would also help identify those who were gaming the algorithm (1e: Could you game the algorithm, and if so, how can this be prevented?), so those loopholes could be closed (if it is feasible that is – one reviewer noted that those writing performance reports can already game a non-AI promotion system by writing overly-positive reports for underperformers – the algorithm cannot be held to impossible standards).
2a. Is there an easily-navigable appeals process for judgements made by the algorithm?
2b. In responses to appeals, is the algorithms’ decision process revealed & can it be realistically understood by someone with GCSE-level maths?
Though a requirement for transparency seems obviously essential, it’s oft-neglected in the name of maintaining ‘proprietary’ algorithms – protecting profits at the cost of causing harm. Ensuring that the bright young female gurkha officer can appeal after finding that her less able white male peer from the Rifles was promoted, is crucial.
The process for appeals will ideally need to be automated – providing a quick & transparent account of how the process had ranked her, including the contribution of the algorithm at a level that doesn’t gloss over the details nor baffle the recipient (i.e. understandable to someone with GCSE maths). This sort of system HAS to be developed before release, because it will be challenged – and any mess that results in the case of a poorly handled case will be jumped all over by the press – and rightfully so. If the challenge is maintained after a transparent response – a human HR specialist will need to explore the case manually at pace. Not only will this be critical to maintain confidence in the system and revisit cases where the algorithm’s contribution may have been wrong, but it will also be critical to closing the feedback loop – if the algorithm made a mistake it should be altered to avoid repeating it.
3a. Does the implementation plan incorporate a representative pilot study before roll-out, which compares the algorithm with humans and robustly demonstrates fairness & non-inferiority?
In terms of scale, I’m not going to delve into pilot design in this article (we’re already at 10 pages…!) but suffice to say the process should be tested prospectively, with sufficient statistical power to confidently assess that it is (or isn’t) better than human grading – with a critical focussing on edge cases to ensure avoidance of bias.
3b. Who will own the algorithm and how will it’s use be limited for other means?
A ‘custodian’ of the algorithm would be required to ensure the algorithm’s use was only rarely (if ever) granted to other applications to avoid ‘use creep’. Just as credit scores have been inappropriately used as a proxy for reliability in job application processes, the temptation would come to use the promotions algorithm in other decisions. Imagine if the RAF started paying employees like google does – paying high performers significantly more – there might be some temptation to use the promotions algorithm score of individuals to rank this. The problem is, you’re then using the proxy of promotion suitability as a proxy for pay-rise eligibility. A proxy as a proxy. This would, of course, be inappropriate – imagine the world-class data engineer with no leadership qualities. Top data engineers are often paid well into six-figures in the private sector on account of their incredible importance in enabling data-driven decisions – compromising her pay simply because an algorithm to assess suitability for promotion says so would be a poor decision and likely lead to the loss of her from the military.
I close this section by reasserting that I don’t claim to know whether using ML as part of the RAF promotion board is a good thing – simply that, based on my reading this book, the framework outlined above might be a good way of thinking about implementing it.
Hopefully you’ve enjoyed this longer-than-I-expected canter through the oft-neglected dark side of algorithms. We’ve covered the key features of WMDs – that they cause harm at scale, whilst avoiding transparency – though they’re often designed with the best intentions. Algorithms (particularly Machine Learning ones!) are sexy and seem almost reassuringly mathematical/scientific, and they doubtless will have an increasing role to play in automating boring aspects of our lives as well as creating value in a whole wealth of different ways – my key takeaway is just to think twice about the unintended consequences they can have and ensure to add safeguards against this – hopefully the framework above provides a good starting point for doing so.