SPEAKER: So welcome to the second in our series of talks on ethical and social issues in AI. Our secret today is Jon Kleinberg. For those of you who don't know, and I suspect many of you do, John is a multiply-awarded computer science professor.
He is a member of the National Academy of Engineering, National Academy of Science, American Academy of Arts and Sciences. He is a recipient of the MacArthur Genius Award. He has umpteen Best Papers Awards in practically every conference in computer science, it feels like. And best of all, for you guys, he gives wonderful talks, as you're about to see.
He's been looking recently with collaborators at issues of fairness and machine learning. And rather than taking up more of his time, I'll introduce Jon.
JON KLEINBERG: Thank you very much.
SPEAKER: Sorry. I meant to say, we have sign-up sheet just in case you want to get on the mailing list to find out about our next six talks. So I'll start one at this end and one at that end.
JON KLEINBERG: OK. Well, thanks very much. Thanks very much for coming out at this late hour to this series. And thanks so much to Joe and Bard for the idea of putting together this series of lectures. It's really an exciting idea and exactly the right moment in time, I think, to be having people come and talk about these issues.
And one thing I think hopefully you might see, if you come to several of them, is that, really, it's useful to have different perspectives get brought to bear on this. And I think you'll see a number of different takes on these questions over time. I guess thanks also to CIS for putting this together, to Workday for its generous gift sponsoring it.
So I want to talk about-- in this range of topics around the new issues that we're facing with the rise of extremely high-performance artificial intelligence, some of the kinds of questions that arise around the fairness of the algorithms that are being created.
Now, the question, of course, is, what do I mean by fairness? But that's, in a sense, what almost the entire talk is going to be about, what do we actually mean by fairness? And I'll talk about some cases where it wasn't entirely clear what we meant by fairness going in, and what the world meant by fairness when they engaged in questions about it. And so this will probably be to talk about how in trying to form careful definitions of what was going on, we were able to get some insight into the basic questions.
So most of the talk, I'll be describing some joint work with Sendhil Mullainathan and Manish Raghavan. Sendhil is an economics professor at Harvard, who was actually an undergrad here in the class of 1993. And Manish is a PHD student here at Cornell-- also, I guess, literally here in this room-- who's been working on a number of these issues. And then at the end, I'll talk about some other issues that this links back to and into some joint work I've done with my PHD student, Maithra Raghu.
OK. So in order to talk about, well, what do we mean by fairness, let's set the stage with, what are the kinds of problems that we're thinking about algorithms solving here? And the basic kind of problem that I'm thinking about is cases where an algorithm is trying to classify somebody based on personal data that it has about it.
Now, this classification can take lots of forms, but we're going to think about questions where the algorithm is trying to form some estimate of the probability that you'll do something in the future. What might you possibly do in the future? Well, it could be, for example, whether you're going to enjoy a movie that it recommends to you on Netflix.
So this is a screen from Netflix. This is actually an image from the Netflix tech blog, where they were talking about some enhancement to their recommendation system. And they're sort of bragging that with even better features about your past profile, your past viewing history, what you've rated, it can give even more accurate predictions about whether you will like the movie The Next Three Days.
And their guess for Xavier, who's an engineer at Netflix, is that Xavier is going to give this 4.2 stars out of 5. So it's making a prediction about the probability that Xavier will do something roughly two hours from now, which we'll say that he liked this movie.
OK. So this has the kind of ingredients that we want, namely there's a person, a living, breathing person, behind this doing things. But the algorithm doesn't, of course, have access to the living, breathing person. They have access to some digital trace of them, which they're going to turn into some kind of a feature set, a feature vector, about them.
Based on that, it's going to run some algorithm, and it's going to decide, are they in the positive category or the negative category? They're going to like it, the movie, or they're not going like the movie. And it's really, actually going to put a probability on this. It believes that there is an 85% chance that they are in the positive class. So person, feature vector, the date it has, positive class, negative class, some probabilities.
Now, what's interesting-- and actually, what my colleague Sendhil likes making as a point, a huge amount of work is going into recommending movies for you on Netflix. But there's a sort of enormous arbitrage opportunity because there are a whole bunch of decisions being made in the world that are arguably much more consequential, where data science has so far been arguably less directly brought to bear-- a huge amount of quantitative work, but less exactly in this paradigm.
For example, applying for a job. This too has the kind of features that we had in mind. Here's a resume. So here is-- there is a living, breathing person behind this resume. But this is some kind of set of summary statistics that they're presenting to an employer.
And the parallels are kind of similar. In this case, you didn't actually create a Netflix resume, but think of it as simply your viewing history, what you liked and didn't like, what time of day you watch things at, and all sorts of data you might not even realize you're giving in Netflix. Here, you assembled it manually. You created a resume that said, this is my-- this is the features that are relevant me.
The person evaluating this job application is trying to estimate a probability. They're trying to say, will you be a successful employee at this company, to some first approximation? They may be estimating lots of other things, but that's certainly part of their calculation. OK?
Certainly this is not part of a giant machine learning pipeline in nearly as automated away as the Netflix situation. Certainly people are bringing, as I said, quantitative tools to bear, but it's not being dropped into some gigantic machine learning system that spits out a probability estimate, typically. But this is certainly a growth area where we're seeing lots of people talking about, can we automate the process of these sort of recruiting decisions?
If you're here at Cornell, you may have filled out something like the Common Application for college. And-- if you're a student here. And this too has some of this flavor. You create some features about yourself that get transmitted to the university, and they're engaging in a range of things. I'll come back to the range of things they might be thinking about. But one thing they might be thinking about it is, what is the chance you're going to succeed in the rigorous academic program we have here at Cornell?
OK. So maybe they'll think about other things also, and I will come back to this one. OK. So that's my premise, that many problems, where we're bringing machine learning and artificial intelligence techniques to bear, involve estimating the probability of some person's future outcome about something. I'll take this tiny hand and slide it to somewhere less distracting.
So we've seen three examples-- online content, are you going to like this movie, are going to like this ad that we show you, hiring, education. I could mentioned many others. When you fill out a credit card application, they're very deliberately estimating, what is the probability that you're going to not default on this loan that we're effectively making? A medical test. Given a feature vector, what is the probability that you're going to exhibit this condition in the future?
So Sendhil, and Manish, and I were thinking about this issue, maybe, generally, but also, specifically, on yet another case where we're beginning to see some of these ideas showing up. And that's in the criminal justice system. So at the moment, if you look over the course of a year in the United States, several million people will be arrested, engaged in a pretrial hearing, where the law instructs the judge to try, in assigning bail, to estimate what is the chance that this person will return for their court appearance without committing a crime in the meantime?
And if they believe that's a very high probability, then they should assign bail. And they're making that decision as a prediction. They're not evaluating the guilt or innocence of this person. That's going to come later at some eventual trial, if one happens. They're simply saying, what is the chance they'll come back for their court appearance without committing a crime? They're being asked to make a prediction.
And similarly, when judges engage in things like sentencing, like parole, those are arguably more complex decisions, as specified under the law, than bail. But all of them involve some component of prediction. What is the chance that this person will commit another crime in the future?
And the judge has at their disposal a set of features. So they have the prior arrest history of this person. At what age did they commit crimes in the past? A number of other things. And so it'll be reasonable to ask the question, our judges predicting as well as could be hoped for? Or are there quantitative tools that might help them?
OK. Now, as we start to get into, maybe, these more and more consequential decisions, where people are estimating probabilities about a future outcome, we begin wondering, how should we think about contrasts in these probably estimates? Maybe systematic contrasts against-- across different groups, potentially groups defined by race, by gender, ethnicity, socioeconomic status, and other things. In my hiring application, is it making probability estimates for male applicants and female applicants, for example, that are consistent with the data that it has? Or are somehow biases coming into its decision-making?
Now, the interesting question here is-- you could ask, how could biases be coming into its decision-making? It's an algorithm that's being given data. But the point is, in all of these domains, this data reflects facts about the world that have resulted in the construction of this data.
So if I'm looking at-- say in the case of criminal justice, because that's what we'll be talking about, if I'm looking at past history of arrests, then I'm learning something about whether this person grew up in an over-policed area, where people were being arrested for different levels of crimes. That may not be uniform across even the jurisdiction that you're looking at. So there are all sorts of ways in which even the features that we have at our disposal might be reflecting biases upstream in the way they were constructed.
And so it's reasonable ask, are these going to be reflected in the way probability estimates get formed? And we could think about this in several levels. We think about this at the level of individuals. We can think about it at the level of groups.
So we were thinking about these questions. And some of the questions-- it's hard to get at because the data you would need to ask them is residing in different places. There's the decisions that the decision-maker is making. There is where the features came from. There's what happened downstream of the decision. Was it a good decision or a bad decision?
And really, in order to actually evaluate these kind of questions, you would need to assemble all of these things. You need to bring them all together so that some kind of short circuit could happen. And people can look at it all at once and make some kind of evaluation.
People were worried about these kinds of things, a lot of discussion. But has it been hard to get your hands on a really concrete instance where someone could say, this is what's going wrong? And then at the beginning of last summer, May of 2016, last year, the investigative journalism organization, ProPublica, released this very thought-provoking report on the internet, entitled Machine Bias, which basically was a very careful look at a particular quantitative risk tool being used in the criminal justice system. And they made the argument that this was exhibiting racial bias.
In some sense, we are already thinking about quantitative tools in the criminal justice system, and what were the risks. And so this was a perfect case study. And so we wanted to delve into it and understand what's going on. And so I wanted to tell you about it for this first part of the talk is, how we delved into it and how we tried thinking about it. And in some sense, the power of trying to come up with the right definitions to understand what the controversy was actually about-- what was being claimed, what were the rebuttals against that, and how could this whole thing be resolved?
And in the process, we'll just-- we'll fall back on these same ingredients, that there are people here, in this case, defendants in a criminal procedure. A feature vector is going to be produced, in this case, their past history of arrest, for example. In the end, they either are in the positive group or the negative group. As in, when they're released, they either go on to commit another crime, or they don't.
And we'd like to understand whether we're estimating the probabilities of these things fairly. OK. The defense come in with data, get released. We see, did they commit a crime?
OK. So there was this risk tool called Compass, which was made by a proprietary-- under sort of a proprietary algorithm, closed algorithm, by a company called North Point. And what ProPublica had done was they collected a lot of data from jurisdictions where it is used, and they found the following two things. And this was their argument, that there was bias in the way these probability estimates were being formed. So these are the two facts under discussion. Then we'll go and try to unpack these facts to figure out what's going on.
So defendants come in. They get assigned a probability of future crime. This is the algorithm's best guess. But then mostly people do go out into the world. And we actually see, well, did they go on within some observation window and actually commit a future crime, or at least get arrested for a future crime?
OK. So what they found was African-American defendants, who didn't subsequently reoffend, had higher average scores. A high score means high probability of crime. They had white defendants who also didn't reoffending. And on the flip side, white defendants who did reoffend, who did go out and commit a future crime, had lower average scores at the time the score was computed than African-American defendants who in the future went out and reoffended.
This seems like, by its very nature, unfairness, because this first bullet is a whole bunch of people, none of whom did anything wrong in the future. They got released. None of them did anything wrong. And despite that, the African-American defendants had a higher average score than the white defendants. And the flip side held for the people who went out and did do something wrong.
OK. So this was the claimant. They did a bunch of things here that made it easy to have a community-wide discussion about this. So the data that they had they posted on GitHub, along with the code that they had used. A bunch of people, including Manish, went and grabbed the data, reanalyzed what they had done. And you could see that this is really standing out, that this is actually what's going on.
So huge amount of discussion about this. Various court cases, including Wisconsin's Supreme Court case that was of that summer, invoked this report saying, we have to be careful about judges using quantitative tools if they have this kind of bias in them. And again, no one was arguing that this company had set out to engage in bias. It's not that they created this tool in order to reach biased outcomes. It's that the data, by its nature, somehow reflected bias that was in the world, and this tool was somehow reflecting it back into the system.
Although, in truth, because it was a proprietary algorithm, no one had an exact mechanism for what was happening. But that was the mental model. OK. So obviously, the makers of the tool had something to say about this, because they were the subject of a huge amount of criticism.
And actually, a whole bunch of people who worked in the area that constructs these tools, which is essentially an actuarial kind of area, estimating probabilities of future events as you would do, say, in the insurance industry, a lot of them came back and said, well, you're not thinking about this the way we thought about this when we constructed it. When we constructed it, the gold standard for such a tool is that the scores should be well calibrated. So let me tell you what well calibrated means.
OK. So well calibrated means I'm going to take these people, map them to features. So everyone is going to have some feature vector of data. And then based on that, as a function of that feature vector, I'm going to put them in a bin corresponding to a certain level of risk. So maybe my risk tool has four levels of risk. You can be a 10% risk, 20% risk, 40%, or 60%. So we're going to put people in these bins. The red dots are the ones who subsequently go on to reoffend, and the white circles are the ones who subsequently don't go on to reoffend, so the red and the other ones.
And it's well calibrated if I look inside each bin, say the 0.1 bin. Well, what should 10% mean? A 10% probability means there's a 10% chance this person goes on to reoffend. So if I look at 10 of these people, one of them should go on to reoffend in expectation. If I look at 100, 10 of them should go on to reoffend. So in here, 10% of all the people should, in fact, go on to reoffend. And in this bin, 20% of them should, and 40% here, and 60% here.
So in general, if I look at everyone who's assigned a score of s for some number, an s fraction should indeed go on and commit a crime in the future. That's what well calibrated should mean. In other words, 20% means 20%.
And moreover, when you looked at the data, and you looked at how they constructed this tool, this had been the gold standard that they were aiming for. And so it was actually constructed so that this was true for groups of different races. So if you look at all African-American defendants assigned, say, 20%, 20% of them went on to reoffend. And similarly for white defendants, a score-- assigned a score of 20%.
And that was the sense in which they're like, so this tool, argued the makers of the Compass risk tool, was race blind, because the score of s meant the same thing for all the different groups under consideration. 20% of the people labeled-- 20% went on to commit crime. And that was true regardless of whether they were white defendants, African-American defendants, and so forth.
OK. So calibration, that was the gold standard. And I should mention, calibration is something that you see when you're trying to analyze these predictive tools across a lot of domains. When you go to build a probability estimate of some future event, it's good to check that your estimates are calibrated.
The pedagogical example for this is weather forecasters. If you're trying to evaluate a weather forecaster, you should look at all the days in which they said there was a 20% chance of rain. Say, look at the past 100 of them and ask yourself, on 20 of those days, did it rain the next day? Because that would mean that their forecast was actually calibrated, that 20% actually means 20%.
Something I can't evaluate in a one-shot basis. If you tell me it's 20% once, well, it's either going to rain or it's not. But if you tell me it's 20% on many, many, many days, I can start to eventually ask, are your numbers calibrated?
OK. The funny thing was-- OK, so we have calibration, but somehow the-- their objection still stood. So first of all, is calibration a reasonable definition. Well, you'd certainly want calibration. And let me just give an example of what would happen if you didn't have calibration. Suppose we used an uncalibrated rule. Let me take a different example. Let me go back to hiring.
OK. So here's my toy example of what could go wrong with an uncalibrated rule, suppose, for example, animal hospitals are hiring veterinarians using a score. The score is the probability this person will end up being a good veterinarian. And the way they do that is they create these bins, and they just take-- they hire all the people who are in the highest bin, the ones who have the highest probability of being a good veterinarian.
Now, suppose we don't have calibration with respect to, say, gender. So suppose that female veterinarians with 60% are actually more likely to be good doctors than male veterinarians, where the number is 60%. So 60% does not mean the same thing for the two genders. That's what it means for the rule to be uncalibrated.
Well then, in fact, you as a patient, or with a client with a sick animal-- bring it to the veterinarian-- it would be rational for you to actually choose your veterinarian based on their gender, because that's what the rule says. The rule says, of the people they hired, the female vets are actually more likely to be good doctors than the male vets, because they used an uncalibrated rule.
And in fact, in some sense, the only way to get around that would be to use a calibrated rule, one in which says, we say that the vets we hire are 60% likely to be outstanding doctors. And that's true regardless of gender. So don't bother trying to use gender when you choose your doctor. We've already taken care of that [INAUDIBLE]. That's the sense in which calibration is something you would like when you're assigning probabilities, because otherwise, stuff like this happens. You're actually incentivizing your audience to start using characteristics that you'd rather they not use.
OK. So, good. We had calibration. But they didn't erase ProPublica's objections. ProPublica wasn't complaining about the lack of calibration. They were complaining that a whole bunch of people who did nothing wrong in the future-- one group got assigned high-risk scores, and one group got assigned low-risk scores. So the fact that it's calibrated sounds good, but how could it be that the fact that it's calibrated, which sounds like what we want, didn't make that other problem go away? That's kind of weird.
So this is where we started thinking about this like, why can't we just have it all. Let's keep calibration. We'd like that. But let's try to see, could you fix this problem where all these people who didn't go on to reoffend had different average scores, depending on which group they belong to. Can we patch that part, preserve calibration?
OK. So in order to figure out if we could do that, we said, OK, let's try to formalize what's going on. So I'll try to make my terminology just a bit more precise. Although it's basically what we've been talking about.
So I think, here's what we've been talking about so far. So I have an individual, like a living, breathing individual, a Med school graduate who I'm trying to hire as a doctor, a defendant standing in court, whatever. And the two salient characteristics about that person are one, they're either going to turn out to be a positive or a negative instance. They will either be a good doctor or not. They either will reoffend or not. I'm trying to put a probability on that.
And they also belong to group A or Group B. So I'm concerned about discrepancies based on gender, based on race or ethnicity, or some characters like that. As part of my machine learning pipeline, I turned that person into a feature vector, which is what the algorithm is going to use. That's all happening inside some box that we're not going to open up, for purposes of this lecture.
Instead, we're going to think about what happens when the feature vector comes out. I'd like to assign it into one of these bins. And it's up to the risk tool how it does that, and also, what the scores on these bins are.
So if you want to get four bins labeled 10%, 20%, 40%, 60%, but you could do whatever you want. So bin b will have some score, v sub b, which is just the probability estimate I put. Everyone who lands in that bin is going to have that as their probability.
OK. So I think I've covered all this. They're positive or negative. They belong to group A or B. They have a feature vector. And the risk score is just any function that maps those vectors into the bins, each bin has a score. All right. And now, based on that, we can pretty easily say what our properties are. And this is, again, just going back carefully through the terminology.
So calibration within groups means for each group, A or B, and for each bin b, the fraction of people in bin b who are positive, or the expected fraction anyway, is v sub b. That's a sense in which 20% means 20% for each group. That's calibration.
And then there were these other things that ProPublica highlighted. And we'll call those balance for the positive class and negative class. So balance for the positive class means that the average score of the positive members of group A, the ones who were going to go on to reoffend for example, should equal the average score of the positive members in group B.
And similarly for the negative class. The average score of the negative members of group A should equal the average score. This is the one that says, all the people in this third bullet didn't go on to do anything wrong. They should get the same average score. It shouldn't be that one group is getting judged as higher risk when none of them actually did anything wrong.
And what was interesting was-- actually, even by the time we got to here, we felt like we understood the problem slightly better, because somehow in the heat of all the news articles, blog posts, rebuttals, and so forth--
Is that like a we need to do something alarm? Or is that just a FYI kind of alarm?
What do you think?
Oh, yeah? OK.
All right. What's people's preference? I can keep talking with the beeping. Do that? All right. Let's do that.
OK. There really was a lot of-- a lot of energy was invested in this, a lot of passion behind these arguments. We actually felt like we'd already reached some point here that was useful, because what we learned was actually, you could write down these definitions. And at least what everyone was asking for was different.
So ProPublica was objecting about these things, that they wanted these. And these seemed like good things. And the people creating the risk tools were counter arguing, well, we have this property. And that's a nice thing also. But in some sense, the two arguments were somewhat going past each other. These are two distinct properties. And at least we can now understand, we're talking about different kinds of guarantees.
So the next question then would be, all right, now we're back to where we were. Can we have all of them? They're all nice properties. Let's just throw out this risk tool and try to make a better risk tool, which has all of them. Wouldn't that be a nice idea? OK. So that's the plan. Yes?
AUDIENCE: I have a question about the [INAUDIBLE] example and the race example. So with the vet example, [INAUDIBLE] males that would be more likely to be a bad vet. And the race example of white person in the probably-won't-reoffend group is more likely to reoffend.
JON KLEINBERG: Good question. And this is a sense in which these definitions all resemble each other. In some sense, the vet example was a failure of calibration. And the ProPublica criminal justice score example was a failure of balance. So the vet one was, it turned out that the 60% bin had the property that maybe it was 60% over all people.
But 62% of the female vets were good doctors, and 58% of the male vets were good doctors. And so that average up to 60% but it was actually different for the two groups. And that was a breakdown in calibration. ProPublica complained that, if I looked at the people who turned out to be good, to be people who didn't reoffend-- oh, it just got so quiet in here. All right.
I actually once taught in here from 1:25 to 2:15. And there was like a fan that was on some timer, very, very low hum which you didn't really notice. And then at exactly 2:00, it would always end. And that was always like a profound moment in the lecture, because you would say something, and then it would get really quiet. Anyway.
AUDIENCE: Somebody else called it in.
JON KLEINBERG: All right. Thank you.
So here, it's really-- the people who didn't reoffend had different averages. But it's exactly because those two things somehow look similar [INAUDIBLE]. Yeah, thanks. That's a good question.
So we start thinking, when could we have all these properties? Well, there's some times when you can have all these properties. Here are two simple times when you could achieve all three of these properties, which you want.
One would be if I had perfect prediction. Suppose it were the case that I take a person. I boil them down to a feature vector. And by linking the feature vector-- oh, wow. Really? It's back. Good. I boil the person down to-- boil down to a feature vector. And based on that feature vector, I am absolutely sure that they're in the negative class or the positive class.
This is something that won't really ever happen in real life, unless you have an unbelievably simple problem, problems at the level of, like, arithmetic or something. But I mean we're talking about things where there genuinely is uncertainty about the outcome. But still, we're trying to figure out when is this achievable. And when you're tracing a mathematical definition to its endpoint, you want to figure out, OK, well, when can I really do this?
If I had perfect prediction, then, based on the feature vector, I would assign a score of 0 to the people who-- where the feature vector said they're always negative class. I assign a probability of 1 to the ones where the feature vectors said they were always in the positive class. I would have calibration, because 0 means 0 and 1 means 1. And I'd actually have balance, because the people who are going to go on to reoffend are exactly the ones where I said 100%, and conversely.
OK. So I could achieve it in the case of perfect prediction-- not so interesting because that's only if I have a really easy problem on my hands. A little more interesting, if I had equal base rates between the two groups-- so say I looked at the two groups, A and B. And each of them globally say 35% of the people were going to reoffend in both groups A and B. Well then, I always, at least, do have a way to create a risk score with all the probability they want. I just declare everybody is a 0.35.
I create one giant bin. I put everyone in it, and I say 0.35. That's calibrated because for both groups, 35% of the ones in that giant bin are going to reoffend because that's the whole population. And there's also equal averages because I've given everyone 35%.
So granted, I could have done that by just looking it up in, like, a lookup table. I'm not adding value with that rule. But at least I'm giving you something which satisfies all three properties. OK. So at least we put down an anchor. We've said, all right, sometimes, you can actually achieve all three of these.
And then we thought, OK, can we work our way in from the edges. We have these two trivial cases in which you can do it. Can we now enrich the set of cases? And we tried that, and we had surprisingly limited success. And ultimately, we proved a theorem, which says, if you-- any time you can achieve all three of these properties, one of those two things happens. We must have either perfect prediction or equal base rates.
Either you can perfectly assign 0 and 1 to everybody, or you're in a situation where you can assign equal base-- the uniform base rate to everyone. And that's the answer. And in no other case can you get all three of these properties.
So this told us-- well, it told us, first of all, that this debate that was going on between the people concerned about bias being led by ProPublica and the people creating these risk tools, that what they were asking for could not all be simultaneously achieved. Because in the jurisdictions they were looking at, we didn't have equal base rates. And no one was going to get perfect prediction for this problem.
And so what everyone collectively wanted was unachievable. There was an-- as the title of the talk says-- an inherent trade-off in this problem. You had to choose, out of all of these satisfying definitions of fairness, which ones do you want to satisfy and which ones do you not want to satisfy, because you're not going to achieve all of them.
So a few comments on what does this mean. So again, we felt like just by trying to write down what everyone was asking for and then reasoning about it-- and not necessarily even going back to the data that they had, not trying our own experiments or analysis, but if I sort of reason about what they were all asking for-- you could actually end up at a point where you're like, all right, what they're arguing about is not something that's actually satisfiable.
So, a few notes. One is-- so when you prove things are impossible in computer science, sometimes you're talking about computational hardness. You're saying, there is no efficient algorithm that can compute that. That is not what I'm talking about here, because no matter how much computation you do, you just literally can't find a way to map feature vectors to bins with probabilities that gets this to happen.
It's actually also-- sometimes, when you talk about impossibility, you're like, we don't have enough data to make accurate-enough estimates of these probabilities to do this. That may happen in real situations. But we're up against a bigger problem, which is even if I had essentially in the limit of arbitrarily much data, that is not my problem. My problem is really just-- it's impossible for me to assign numbers to get all of these qualities to balance out in a way that satisfies these definitions.
And as a result, it's really not exactly a theorem about algorithms. Any decision-making procedure that you want, the algorithms, or the human decision-makers who are currently doing this, are going to suffer-- are going to experience this trade-off. In some sense, it's, therefore, more an impossibility theorem in a style that we see both in computer science and also in the mathematical social sciences.
So in computer science, we see this in areas like distributed computing, for example, which writes down conditions that you would like your system to satisfy. And with enough conditions, you discover these are not all simultaneously achievable. In economics, we have famous examples like Arrow's theorem on voting, which basically says, for some simple properties that you would like your system of elections to satisfy, there is no election system that can actually satisfy those.
A lot of these are theorems that are quite deep, and some of these proofs are quite complicated. In a sense, I would say this is a smaller observation in all of those respects. But it really clarified for us what was going on in the debates around this particular thing.
OK. So I wanted to mention a few things. And actually, I want to spend a couple slides on-- yes? Question.
AUDIENCE: I have just a question about these properties. So it's not quite that obvious to me that these properties are really properties you definitely want to have. So in particular, it's not obvious to me that average, reasoning about the average, is the right thing to do. In the average, I'm just wondering if you've thought in that direction. [INAUDIBLE]
JON KLEINBERG: So, average. I see. I think if I spend too long on one slide, and there you go. Good to know. Yes, all of these are things you could question. I did a little work to try [INAUDIBLE] calibration is a good thing. Although, we could debate that.
Average. Yeah. I guess I would say the following. If I think about the probability estimate that I assign to somebody as some kind of level of harm that I've done to it-- if you're not going to do anything wrong, and I assign you a score of 30%, that does some level of harm to you. Because later when people go to evaluate you, they're like, well, this person is pretty risky. They're 30%, which is unfair because you're not going to do anything wrong.
Often, if we think about social welfare, I say, let me sum the harm over this whole group. So in place of average, think total. Then the total harm I've done to one group is a different impact than the-- so if I'm thinking like, I would like equal impact on these two groups, total is kind of-- but, yes. One could certainly ask about other ways of aggregating it. And then one might get different kinds of things. Totally.
I wanted to spend a couple of slides because, as I said, the proof of this is not actually that hard. Like, in some sense, the decision was, let's write down the definitions carefully. Let's write down the guarantees carefully. And once you've done all that, you can actually then work out, why is this not possible? So I wanted to spend a couple slides on the proof. I promise not to make that too complicated. Then we'll pop back up to this level.
The proof. If you are spectating, looking what does the proof have to do, one thing you might notice is that the two cases where this is going on is-- they different in almost dimensionality. This sort of says, perfect prediction means I have to pin everything down to 0 and 1. Equal base rates feels more like a one-dimensional condition where I-- anyway. So it's going to have to do something that is some kind of gluing together of like a 0, 1 thing with a arbitrary base rate.
OK. So let's do that. So in some sense, this turns the corner. Actually, we were in touch with Julia Angwin, the lead reporter in the ProPublica thing throughout this whole thing. And she was kind enough to actually write up a second story talking about how, in the end, there was something mathematically inevitable at this controversy. So this was posted at the very, very end of December last year.
And she actually surveyed. And it's interesting article to check, because she surveys a bunch of groups that were thinking about similar things. So actually, Alex Chouldechova at Carnegie Mellon-- he is a statistician-- and a group at Stanford were looking at similar issues where they were actually not thinking about probability estimates, but about yes-no decisions. Like, I must say yes or no, and I want to do something that tries to balance the probability that I'm wrong when I say, yes and--
AUDIENCE: Did you about the door?
JON KLEINBERG: Someone called about the door. There he is. And then there was another group that was doing-- [INAUDIBLE], Eric Price, and [INAUDIBLE] from the CS3 community-- who essentially jettisoned calibration and asked, how well could we do if we just wanted to equalize the balance for the positive class and the balance for the negative class? I'd like to have as good performance as I could balancing those two things.
And then they get this interesting optimization problem. That becomes a constraint. When I make up a probability estimate, it must satisfy those two balance conditions. It has equal impact on those-- on the two groups. And then I want as good performance as I can. And they have an interesting paper on that as well. OK. So let me plunge into two slides on the proof, just a sketch of, how do we prove this fact?
AUDIENCE: [INAUDIBLE] Were the other two successful? Yes, those two papers.
JON KLEINBERG: So these people, again, reached negative results. So basically, in the yes-no world, you run into a similar problem than in the probability-estimate world. Here, you're successful in the sense that if you throw out calibration, you can always achieve the other two. And so now, I have a non-empty region I'm optimizing over, and I try to find the optimum. Yes. Good question. So this is a nice way to actually, at least, get started.
Good. So what [INAUDIBLE]. OK. So let me just sketch out how the proof works. OK. So let's go back to-- we have bins. We've put people into bins. And we're trying to achieve calibration and balance for the positive-negative groups. And let's assume we've conceptually achieved all three, and we're going to try to derive that we must have either perfect prediction or equal base rates.
So we've achieved all three. That gives us some things that we can conclude about how we've done it. So first of all, let's see. So there are two groups, A and B. So let's say t is equal to either A or B. Let's say n sub t is the number of people in group t. And k sub t is the number of people in the positive class in group t. OK. So k sub t over n sub t, that ratio is the base rate. That's the fraction people in group t who are in the positive class.
Now, if you think about, what is the calibration condition buying me, it's saying, take this bin that has 10 people, and the score assigned to that bin is 0.1. So there's one positive instance in that bin, because 1/10 of it, which is 1. And also, the total score I hand out to that bin is also 1, because I give out 10 copies of 0.1.
Similarly, I look over here, I give out 10 copies of 0.2. So I give a total score of 2 to this bin, and there are two red items. Here, I give out five copies of 0.4. That's 2, and I have two red items. So in fact, what calibration really buys us is that if I look at a bin, the total score handed out to everyone in that bin is equal to the expected number of people in that bin who actually belong to the positive class. That those two things are the same, that's really what calibration says.
And then I could add up that fact over all the bins and discover that over the whole group, the total score handed out is equal to the expected number of people who were actually in the positive class. So the total score handed out is this number [INAUDIBLE]. And that's what calibration is doing. It's really equating score with outcome, because it's calibrated.
OK. So what does that say? So remember, Nt is the total number of people in the group. kt is the number of people in the positive class. So by calibration, kt is also the total score in that group.
OK. That's calibration. Total score is kt. Now let's talk about this balance for the positive and negative class. So that says that the average score of a person in the negative class is some number. Let's call it x. And the average score of a person in the positive class is some number. Let's call it y.
And the point is, those are the same numbers whichever group I'm talking about. So this is not x sub t and y sub t depending on the group. Both groups have the same x, and they have the same y, same average x and y. So here's one way to write the total score in group t.
Well, there are k sub t people who are positive, and they get an average score of y. So they contribute kt times y. And then the rest of them, Nt minus kt are negative, and they contribute an average score of x. So the total score is just this, the negative people times their contribution plus the positive people times their contribution.
And we know, thanks to calibration, that that equals k sub t. So that's like one equation that's using both things I know. I have balance, so this is the same for all groups. I have this.
But the issue now is, this defines a line in xy space whose slope depends on the base rate of the group. So in order to have the same x and y satisfy the equation for group A and the equation for group B-- I have an equation here for group A and I have an equation here for a group T-- for group B, because I have two of them. So x and y are going to have to satisfy those two equations at the same time.
There are two possibilities. Either the base rates are different, and then these lines have different slopes. And they only meet at one point. That one point is where x is 0 and y is 1. In other words, I have perfect prediction.
Or they have the same slope, in which case you can actually pick anything you want. They actually lie on top of each other, but in that case, I have equal base rates. So it's my one-- my point condition and my one-dimensional condition. It says, either of these two lines, which are the ones defined by this, they always meet here. But if they meet anywhere else, it has to be because they have the same slope, equal base rates. And one has to fill in some extra details, but not a lot of extra details. That's pretty much the whole proof.
So again, as I said, it's sort of a-- once we decide to write down the definitions and figure out what the properties were, we can then just follow this through straight to the end. And in this respect, I guess I should also include a shout out to Cynthia [? Durocher ?] who read our original proof in a draft of the paper which was somehow much longer than this actually. And is like, this is strange. There must be a much shorter proof of this. So that motivated us to go and do that.
JON KLEINBERG: Yes. Two groups was for simplicity. But if I had multiple groups, I'd have the same thing. Yes, there wasn't anything special about the two groups. That is totally true.
So a couple of other things to notice here. One is, we thought maybe the way-- whenever you have an impossibility result like this, you say, is there some catch? Is there some way out? Like, maybe this isn't as bad as we think, because if we just think about the conditions differently, we'll realize that by softening them a bit, everything will be OK.
One way I could do that, the natural thing you would try, would be to have an approximate version of these guarantees. Of course, it's too much to ask for exact calibration, exact equal averages. That won't happen anyway. Let's just ask for everything to be approximately right.
So what would that mean? So for calibration, it would mean that when I look at everyone who ends up in the 20% bin, approximately 20% of them go on to reoffend, say between 19.5 and 20.5. You pick your tolerance, some tiny margin of error, and that'll be the approximate guarantee. So there is a hidden epsilon in here, approximately v sub b times 1 plus or minus some little epsilon.
Similarly, balanced with a positive class could just mean that the average score in group A is to within 1 plus or minus epsilon, the average score in group B and similarly for [INAUDIBLE]. But this turned out not be the way out. So with a more complicated proof, one can show that, actually, if all three of these properties can be approximately achieved, then, in fact, you have-- you're in one of the-- an approximate version of one of the two special cases.
You either have approximately perfect prediction, where every feature vector tells you that either there's a 99% chance they're positive or a 1% chance. Or you have approximately equal base rates, where the two base rates are basically 35 and 35.5. And so you can't really get out of it that way. If you want all these approximately, you have to have one of the two special cases approximately.
So I should say that-- and when I went back and talked with Julia Angwin, who was like, so, well, what is the way out of it, my own take on it is, there is something fundamentally incompatible that these things are asking for. That there's not some clever way to think about this differently that says, in fact, we can have all these things. It says, if the base rates are quite different, than what's challenging about fairness in that case is somehow embedded in the tension between these definitions.
Now, you can certainly do things. So for example, the [INAUDIBLE] Price and [INAUDIBLE] paper that I mentioned throws away calibration, says, OK, let's not worry about calibration. Now let's optimize with the other two. So I could take one away completely. But it's an interesting question to figure out what else we could be doing. Yes?
AUDIENCE: Assuming they went for calibration, is it accidental that the average behavior was worse for the African-Americans?
JON KLEINBERG: So in the jurisdiction they were looking at, the base rate was higher. So you would expect the average to be higher in the group with the higher base rate. And so in that jurisdiction, wherever the higher base rate was, you're going to have-- it's going to typically be unbalanced that way. I'm not quite saying that is a necessary thing, but that's going to be typical [INAUDIBLE]. That's a good question. Yes?
AUDIENCE: [INAUDIBLE] by looking at this as an asymmetric problem, like positive, negative not being the same problem like [INAUDIBLE]?
JON KLEINBERG: Yeah. So I think that's going to be one thing that one can definitely do. So you could say, there are a bunch of cases where I'm more concerned about falsely labeling someone high risk or falsely labeling someone low risk. And then you could say, maybe I want calibration, and I want one of these two balance conditions, the one that's more important. Maybe I want some kind of weighted sum of them where I'm going to deviate on both sides, but I'm going to try to-- I am going to say, this one is 10 times-- a violation here is 10 times as bad as a violation in the other direction.
And you could imagine that different settings-- so falsely labeling someone who doesn't go on to reoffend as high risk could be viewed as worse. And so I might put more weight on that. If this were, say, a medical risk score, I could say, accidentally missing someone who should have been tested for some condition is the worst thing. And so I'm going to focus on that and not the other one. So I think that is another way to go about it, to say, there are-- in different domains, these three may not all carry equally. Absolutely. Yeah?
AUDIENCE: Can we assume that the bias is somehow implicit in the feature vector that was chosen?
JON KLEINBERG: Can we assume the bias implicit in the feature vector was chosen? It's a good question. I mean, to the extent that it reflects differences in the base rate. I think what's happening is, you can certainly have situations where this could be made even worse by the fact that the features are somehow biased in their construction.
In a sense, this isn't relying on that, in that once the base rates are different, it's not requiring the features somehow be wrong in order for that to happen. But at some level, yes. There has to be something going on saying the difference in base rates are being translated into this. Yes?
AUDIENCE: So when I think it from [INAUDIBLE] kind of perspective, like a machine learning perspective, what it means is that the inherent [INAUDIBLE] at some [INAUDIBLE] limit, which I couldn't incorporate into my [INAUDIBLE]
JON KLEINBERG: The data reflected a difference in base rate that one can't make go away if you have to be both calibrated and equally balanced.
AUDIENCE: I am [INAUDIBLE] to a different solution. Is that [INAUDIBLE]
JON KLEINBERG: Again, I would think about it as this holds, even if I just have a bunch of numbers, and I'm trying to get the averages to all balance out. But in the end, it's sort of that I can't make the difference in base rate go away across all these properties. Yes? [INAUDIBLE]
AUDIENCE: [INAUDIBLE] in most calibration in this average [INAUDIBLE] is that your rating should be discriminating. Like, you shouldn't [INAUDIBLE] same number. That's just not a good risk assessment.
JON KLEINBERG: Yeah.
AUDIENCE: And so nowhere [INAUDIBLE] did you say that somehow [INAUDIBLE] risk assessment that differentiates between people is better than one that does not.
JON KLEINBERG: Right. In some sense because I was aiming low, in the sense that you already couldn't achieve it here. But actually, my next slide will get to that. When you have equal base rate, I'd like to be more-- have more power rather than less power. So let me come to that in the next slide. Yes. Good question Yes?
AUDIENCE: Going back to [INAUDIBLE], how do we know that we want this type of [INAUDIBLE]. So maybe [INAUDIBLE] that are labels or not. So what we're actually capturing instead of recidivism rate is arrest rate.
JON KLEINBERG: So that's the important part, which I mentioned, in passing, that in the end, what we're evaluating is, was the person arrested for a new crime? And so in some sense, we talk about the features are biased by construction. But in a lot of the cases, the labels are biased by construction.
Because when I release someone, say, pretrial, and then I try to figure out, did they commit a crime before their court appearance, it's important notice, I am not determining, did they commit a crime [INAUDIBLE]? I am determining, did they get arrested for a crime before their court appearance? And so if they're being released into neighborhoods with very different rates of policing, I'm actually getting biased labels.
So that's a very important thing to remember, that it's both the features and the labels that are biased by construction, all of which is working even more against this. This is saying, even if I don't have those problems, I have something else going on. But that's certainly happening. [INAUDIBLE]?
AUDIENCE: [INAUDIBLE] in A was 100% and then B you had some distribution. [INAUDIBLE] so you don't have-- wouldn't you very clearly not want to have the [INAUDIBLE]
JON KLEINBERG: So you could certainly say that-- and this gets back to, by asking for something which is calibrated and has equal averages, I'm basically asking to say, we should not-- we should somehow violate the [INAUDIBLE]. So all of these are examples where there's a difference in the base rate that I can't make go away somehow. Yes, all of these are good questions.
This question of perfect prediction never happens. Equal base rates-- what'd I say about equal base rates? When you have equal base rates, I always could give you a rule that satisfies all three properties. That's just not a very interesting rule. It just assigns the base rate to everybody. And sure, that satisfies all the properties-- it's not what I wanted, because it didn't actually give me any information that I didn't already have.
So what could we do there? So let's actually dig into this case, because there are a lot of settings where you can imagine the base rates are going to be maybe close to equal in a classification problem. I'd like to do more. So here, actually, there are some interesting open questions. I'll tell you one negative result, but then I'll mention [INAUDIBLE] questions that are-- so obviously if both groups have the same base rate p, I can always create a single bin with the score p satisfies everything, right? Everyone has 35%.
But you could ask, is there a non-trivial risk score assignment, where not everyone gets a score of p? Even just anything other than that? Just give me, like, two bins or something.
So it turns out now we descend from a world of absolute impossibility to the world of computational hardness it turns out that, if I have to actually take everyone with the same feature vector and put them into a single bin, that now becomes computationally hard to actually solve this problem.
And actually, for those of you who have taken some course in algorithms or discrete optimization that has NP-hard problems in it, basically doing this is as hard as the knapsack problem, because essentially what I'm trying to do when I create two bins to get calibration to work out is I'm trying to sort of get a whole bunch of numbers to add up just right in a way that maintains calibration and yet uses two bins. And that's actually a hard problem.
Some things we don't understand about this-- as computationally hard problems go these number addition problems, things like the knapsack problem, these are not among the hardest in practice problems. And in fact, there are some nice results that date back to people like David Shmoys and others-- David being in the audience as well-- which actually show some even provable senses in which you can actually do quite well with problems like this.
So one nice question here is, can one, in the case of equal base rates, which I'd argue is a very, very natural case, can one actually, at least in some approximate sense, find things that have much more informational power and are still sort of approximately balanced across the groups? And I think that would be sort of a nice intermediate outcome, because what it would effectively say is that, yes, when the base rates are very different, there's this gap that you can't make go away under all these definitions.
But when the base which are not very different, maybe somehow, at least computationally in practice or even in some approximate sense, we can actually get the conditions that we want. And so the [INAUDIBLE] case when the base rates are close enough, some of these guarantees might kick in.
OK. So with that, I'll wrap things up. Yes.
AUDIENCE: One more question. So you have this assumption of bins, and you're sort of discretizing the problem. Is that a fundamental assumption?
JON KLEINBERG: A good question. Yeah, so right, what's going on with bins? So I'm allowing sort of an arbitrary set of bins here, and so you could do it more finely or less finely. At the level of abstraction I've been talking about, you could ask, what about a bin free world in which I just sort of give everyone their own number?
In practice, bins are useful just because, in order to sort of achieve-- well, now we get into the fact that I have a finite amount of data I'm working with. And so what people tend to want to do is achieve sort of a calibrated set where I say, OK, this is a big enough set that I can sort of evaluate the set with a single probability. But yeah, at the level of abstraction I'm working with, that's less apparent.
So I think-- think about the question of for the amount of data I have, what's the right number of bins is a very interesting sort of question that would bring different issues into play-- Questions about, how big is my sample size compared to the estimates I make.
So let me just sort of wrap up with these thoughts. So again, this has been about these inherent trade-offs and, again, sort of about taking a debate that, as of last summer, was sort of quite messy, carried out sort of at different levels of dialogue, and trying to sort of pull out definitions of what people are talking about, write down the conditions, and then understanding what was intention and what wasn't.
And in the end, actually, sort of doing something that we found more clarifying than we expected when we went in, identifying these trade-offs, which again leaves open the questions of, how do we get past these trade-offs, right? Do we drop some of the conditions? Do we weaken some of the conditions? Do we focus in on the place where the problem may be most tractable, like when the base rates are equal? A whole bunch of possibilities for how we might make progress going forward.
And I think it's useful as we think about the different versions of this problem to think about the fact that there are many, many settings where we're going to take data about individuals, quite complex data, and then form some kind of probability estimates to sort of pull back to some of the applications at the beginning in setting settings like hiring, like school admissions, like lending, like medical testing, like engaging with online content, and like the criminal justice system as here.
I'll mention also that there are a number of other interesting questions in this space. I said at the beginning that on the final slide I would invoke some work I've been doing with another one of my PhD students, Maithra Raghu, where you ask questions not about the individual but actually about the group you're assembling. And the point being rather than simply saying, I admitted the right people, say, to college, you could say, I admitted the right set of people.
So if you think about something like college admissions, or even hiring for a company, you could argue that I can't evaluate the quality of my hiring, the quality of my admissions one person at any time, because in the end, I'm going to actually be thinking about the set of people who show up, the set of people who are actually working for my company, the set of people who arrive on campus as the entering freshman class.
And it turns out that, actually, if I going to be evaluating that, then these kind of risk scores I'm building turn out to become much more complicated. And this was really identified by the social scientist Scott Page, who pointed out that when I'm evaluating these sorts of sets-- a company, a freshman class, and so forth-- evaluating people one at a time by an individual score can actually be misleading about how good the group is in all sorts of measures.
And so that leads to a whole set of other questions about the power of risk scores. And these sort of more general things where I'm evaluating the whole group I've assembled. But that will be for another time, and some other talk. And for now, I'll stop here. Thanks very much.
SPEAKER: We have time time for a few brief questions. Jon, I'll let you [INAUDIBLE]
JON KLEINBERG: OK, sounds good. Question back there, yeah?
AUDIENCE: So what if you don't know what groups you're trying to make sure that, like--
JON KLEINBERG: Yeah.
AUDIENCE: [INAUDIBLE] those groups. What if it's a group of one? [INAUDIBLE]
JON KLEINBERG: Yeah, so a good question. Right. So I could talk about I'd like to balance things across different groups. And it's been sort of key here that we've sort of defined upfront these are the groups where I'm looking for balance or for calibration. I think there are a lot of interesting questions where, either the groups-- either there are multiple overlapping groups, which is already more complicated even if I know them. Or, the groups are somehow implicit and might emerge from the analysis I'm doing.
And so I think there are a lot of missing ties to questions in machine learning where the groups almost become sort of like the hypothesis class in some kind of a learning-type problem. And then we could ask about the sort of richness of that set. And I think there are a lot of interesting questions there. Let's see. Yeah?
AUDIENCE: So this may not apply in the case of criminal records, but in the case of [INAUDIBLE] definitely have people who will write resumes tailored to specific [INAUDIBLE]--
JON KLEINBERG: Yeah, yeah.
AUDIENCE: --aren't playing the game as hard.
JON KLEINBERG: Yep.
AUDIENCE: So is there a way to account for that so that it's [INAUDIBLE]
JON KLEINBERG: Yeah, so there is a-- so the question was, certainly in the case of hiring, people will actually write their resumes in a certain way to kind of-- and then other people will sort of work less hard.
So there actually is this rich area of machine learning called adversarial classification, where people actually look at what's the outcome of essentially this sort of two-player game between the employer and the applicant, in which the applicant is trying to make themselves look as good as possible, and now the employer is trying to make decisions knowing that that's what they're doing. And in a sense, some kind of maybe equilibrium emerges from that where, because I'm going to make decisions knowing that you're doing that, you're going to behave in a certain way, and we try to find sort of what's the equilibrium after that.
And you're right that this can happen more in some domains than in other domains. In some sense, this area got a lot of momentum from the growth of web search in the 1990s, which was all about adversarial classification, that once your position in the Google ranking is worth millions of dollars potentially for high-volume queries, you're going to be creating your web page with the Google ranking algorithm in mind. And they're going to be creating the ranking algorithm with the knowledge that people are going to be optimizing for the ranking algorithm. And you end up with a very complicated dynamic, which is in fact what you see in actual search. What do you think? Should we do one more?
JON KLEINBERG: Bart?
AUDIENCE: So overall, do you think it's better to have an algorithm to make your decisions or [INAUDIBLE]
JON KLEINBERG: I mean, it totally depends on the context, right?
AUDIENCE: In this case?
JON KLEINBERG: Oh, in this case? I mean, so this is something that we're thinking about at the moment. This has been about the kind of concerns and the potential trade-offs. We've also been doing a bunch of work on the question of, how do you-- are there better ways to do risk scores?
I mean, I think the point is, in this as in many domains, certainly for the foreseeable future, something where there's an interplay between the human decision maker and the quantitative tool is going to be going to be very important, because it's very clear when you look at the data, say, on pre-trial release that judges are letting out some predictably extremely risky people. As in, they're just making mistakes. And these mistakes are detectable by the algorithm.
But there are several things you could do when you notice that. A simple one is to say, if a small percentage of people are accounting for a lot of the risk, and the judge makes mistakes on those, the algorithm could simply sit at the elbow of the judge and flag 3% or 5% of cases that the judge sees and say, these seem like particularly risky ones. Maybe you want to think harder on these ones. Or maybe even here's my recommendation on these ones. For 95% of them, you don't actually touch them.
And so in some sense, one of [INAUDIBLE] analogies here is like, before we think about fully self-driving cars, the analogy here, at least think about driver assist, the thing on the car that flashes a big red light when you're about to back into something or crash into something. Right? We're sort of even missing driver assist in these cases, where the algorithm knows that this person looks a lot like a whole bunch of people who went out and committed crimes, to even provide that information to the judge.
So I think there's a whole spectrum in between only the human, only the algorithm. In a sense that's a sort of, in the domain as complex as this, a false dichotomy. There's a lot of space in between for the algorithm to provide assistance.
AUDIENCE: So it sounds like an obvious question, but it's not. I looked up the ProPublica article from December that you mentioned.
JON KLEINBERG: Yes.
AUDIENCE: And the last paragraph was frightening. It says, despite all of the work that's been done on this, the people who produce this COMPAS tool are not making any changes. And the people who use the tool are not going to make any changes. So what do you think is the ultimate impact of this work will be?
JON KLEINBERG: Yeah, so--
--whatever the impact was, it hadn't been realized as of December 30, 2016, when they were still-- yeah, I mean, so in some sense that's also Julia Angwin engaging in some sort of regret that she didn't-- because I mean she actually did a fair amount with her article, the first article, in that there were court cases where in the judicial ruling was a reference to this thing that we need to put some circumscription around how people are actually making use of these tools.
So I think it raised awareness of the dangers in policy circles that where it hadn't been. And that was already true in the ProPublica article before we ever showed up on the scene.
I think, actually, ironically, some of this is a story about municipal procurement policies, which initially sounds like a non sequitur. But what's going on is you have offices and city governments where someone is told, go buy us a risk tool so that we can go improve the quality of what we're doing in the criminal courts. They find a company, and they buy a risk tool.
This tool is completely proprietary. You don't know what's going inside it. And so we don't really know why it's making these decisions. We can only evaluate it as a sort of box from the outside.
And so I think to the extent that this kind of work puts academics, who think in terms of algorithms that you can describe code that's actually open source or uses public libraries, in touch with the people in city governments, who are making these kinds of decisions about what is the kind of quantitative tool I want to put inside my court system, to the extent it sort of short circuits the long chain of interactions between them and lets those groups talk to each other directly, as has actually now been happening over the past few months. I think that that's a sign that there is opportunity for progress. If we look out at the outcome of those kind of conversations. That's the hope.
SPEAKER: Maybe at this point we should just thank the speaker.
We've received your request
You will be notified by email when the transcript and captions are available. The process may take up to 5 business days. Please contact firstname.lastname@example.org if you have any questions about this request.
Recent discussion in the public sphere about algorithmic classification has involved tension between competing notions of what it means for such a classification to be fair to different groups.
Jon Kleinberg, the Tisch University Professor of Computer Science, considers several key fairness conditions that lie at the heart of these debates, and discusses recent research establishing that, except in highly constrained special cases, no method can satisfy all of these conditions simultaneously. These results suggest some of the ways in which key notions of fairness are incompatible with each other, and hence provide a framework for thinking about the trade-offs between them. His talk is part of a series on "The Emergence of Intelligent Machines: Challenges and Opportunities."
Kleinberg's research focuses on issues at the interface of networks and information, with an emphasis on the social and information networks that underpin the Web and other on-line media. His work has been supported by an NSF Career Award, an ONR Young Investigator Award, a MacArthur Foundation Fellowship, a Packard Foundation Fellowship, a Simons Investigator Award, a Sloan Foundation Fellowship, and grants from Facebook, Google, Yahoo!, the ARO, and the NSF. He is a member of the National Academy of Sciences, the National Academy of Engineering, and the American Academy of Arts and Sciences.