First, we have several estimates of the relative weights of level noise and pattern noise. Overall, it appears that pattern noise contributes more than level noise. In the insurance company of chapter 2, for instance, differences between underwriters in the average of the premiums they set accounted for only 20% of total system noise; the remaining 80% was pattern noise. Among the federal judges of chapter 6, level noise (differences in average severity) represented slightly less than half of total system noise; pattern noise was the larger component. In the punitive damages experiment, the total amount of system noise varied widely depending on the scale used (punitive intent, outrage, or damages in dollars), but the share of pattern noise in that total was roughly constant: it accounted for 63%, 62%, and 61% of total system noise for the three scales used in the study. Other studies we will review in part 5, notably on personnel decisions, are consistent with this tentative conclusion. The fact that in these studies level noise is generally not the larger component of system noise is already an important message, because level noise is the only form of noise that organizations can (sometimes) monitor without conducting noise audits. When cases are assigned more or less randomly to individual professionals, the differences in the average level of their decisions provide evidence of level noise. For example, studies of patent offices observed large differences in the average propensity of examiners to grant patents, with subsequent effects on the incidence of litigation about these patents. Similarly, case officers in child protection services vary in their propensity to place children in foster care, with long-term consequences for the children’s welfare. These observations are based solely on an estimation of level noise. If there is more pattern noise than level noise, then these already-shocking findings understate the magnitude of the noise problem by at least a factor of two. (There are exceptions to this tentative rule. The scandalous variability in the decisions of asylum judges is almost certainly due more to level noise than to pattern noise, which we suspect is large as well.) The next step is to analyze pattern noise by separating its two components. There are good reasons to assume that stable pattern noise, rather than occasion noise, is the dominant component. The audit of the sentences of federal judges illustrates our reasoning. Start with the extreme possibility that all pattern noise is transient. On that assumption, sentencing would be unstable and inconsistent over time, to an extent that we find implausible: we would have to expect that the average difference between judgments of the same case by the same judge on different occasions is about 2.8 years. The variability of average sentencing among judges is already shocking. The same variability in the sentences of an individual judge over occasions would be grotesque. It seems more reasonable to conclude that judges differ in their reactions to different defendants and different crimes and that these differences are highly personal but stable. To quantify more precisely how much of pattern noise is stable and how much is occasion noise, we need studies in which the same judges make two independent assessments of each case. As we have noted, obtaining two independent judgments is generally impossible in studies of judgment, because it is difficult to guarantee that the second judgment of a case is truly independent of the first. Especially when the judgment is complex, there is a high probability that the individual will recognize the problem and repeat the original judgment. A group of researchers at Princeton, led by Alexander Todorov, has designed clever experimental techniques to overcome this problem. They recruited participants from Amazon Mechanical Turk, a site where individuals provide short-term services, such as answering questionnaires, and are paid for their time. In one experiment, participants viewed pictures of faces (generated by a computer program, but perfectly indistinguishable from the faces of real people) and rated them on various attributes, such as likability and trustworthiness. The experiment was repeated, with the same faces and the same respondents, one week later. It is fair to expect less consensus in this experiment than in professional judgments such as those of sentencing judges. Everyone might agree that some people are extremely attractive and that others are extremely unattractive, but across a significant range, we expect reactions to faces to be largely idiosyncratic. Indeed, there was little agreement among observers: on the ratings of trustworthiness, for instance, differences among pictures accounted for only 18% of the variance of judgments. The remaining 82% of the variance was noise. It is also fair to expect less stability in these judgments, because the quality of judgments made by participants who are paid to answer questions online is often substantially lower than in professional settings. Nevertheless, the largest component of noise was stable pattern noise. The second largest component of noise was level noise—that is, differences among observers in their average ratings of trustworthiness. Occasion noise, though still substantial, was the smallest component. The researchers reached the same conclusions when they asked participants to make other judgments—about preferences among cars or foods, for example, or on questions that are closer to what we call professional judgments. For instance, in a replication of the study of punitive damages discussed in chapter 15, participants rated their punitive intent in ten cases of personal injury, on two separate occasions separated by a week. Here again, stable pattern noise was the largest component. In all these studies, individuals generally did not agree with one another, but they remained quite stable in their judgments. This “consistency without consensus,” in the researchers’ words, provides clear evidence of stable pattern noise. The strongest evidence for the role of stable patterns comes from the large study of bail judges we mentioned in chapter 10. In one part of this exceptional study, the authors created a statistical model that simulated how each judge used the available cues to decide whether to grant bail. They built custom-made models of 173 judges. Then they applied the simulated judges to make decisions about 141,833 cases, yielding 173 decisions for each case—a total of more than 24 million decisions. At our request, the authors generously carried a special analysis in which they separated the variance judgments into three components: the “true” variance of the average decisions for each of the cases, the level noise created by differences among judges in their propensity to grant bail, and the remaining pattern noise. This analysis is relevant to our argument because pattern noise, as measured in this study, is entirely stable. The random variability of occasion noise is not represented, because this is an analysis of models that predict a judge’s decision. Only the verifiably stable individual rules of prediction are included. The conclusion was unequivocal: this stable pattern noise was almost four times larger than level noise (stable pattern noise accounted for 26%, and level noise 7%, of total variance). The stable, idiosyncratic individual patterns of judgment that could be identified were much larger than the differences in across-the-board severity. All this evidence is consistent with the research on occasion noise that we reviewed in chapter 7: while the existence of occasion noise is surprising and even disturbing, there is no indication that within-person variability is larger than between-person differences. The most important component of system noise is the one we had initially neglected: stable pattern noise, the variability among judges in their judgments of particular cases. Given the relative scarcity of relevant research, our conclusions are tentative, but they do reflect a change in how we think about noise—and about how to tackle it. In principle at least, level noise—or simple, across-the-board differences between judges—should be a relatively easy problem to measure and address. If there are abnormally “tough” graders, “cautious” child custody officers, or “risk-averse” loan officers, the organizations that employ them could aim to equalize the average level of their judgments. Universities, for instance, address this problem when they require professors to abide by a predetermined distribution of grades within each class. Unfortunately, as we now realize, focusing on level noise misses a large part of what individual differences are about. Noise is mostly a product not of level differences but of interactions: how different judges deal with particular defendants, how different teachers deal with particular students, how different social workers deal with particular families, how different leaders deal with particular visions of the future. Noise is mostly a by-product of our uniqueness, of our “judgment personality.” Reducing level noise is still a worthwhile objective, but attaining only this objective would leave most of the problem of system noise without a solution.
Much longer excerpt today, but this is the quintessential part of the book that can’t be neglected, so do take some time to read through this rather long part. I’ll walk you through the key message as I had to re-read a few times to understand this.
Essentially to generalize, the biggest contributor to noise is that of pattern noise. Let me reiterate. There are 2 main kinds of noises as defined that contributes to overall system noise.
Every human is different, and while we can have varying levels of judgements (harsh or lenient) in general, the biggest contributor to overall system noise is that of Pattern Noise. Yet, this is one that is hard to measure. We usually study the average of outcomes between judges, not each case individually.
Most efforts to deal with biasedness come from studies that only look at Level Noise as it is easier to measure and study, leaving out the one that contributes the most of variability. We attempt to analyze the differences between each judge, but we should also be studying the different cases or situations they make. To do so will be to acknowledge and understand that each of us are unique in our characteristics. Differences in environment or situations can trigger drastically different responses even from 2 seemingly similar people.