Meet the data analyst putting the perpetrators of genocide in prison

10 Dec 2024

Written by Tristan Free (Senior Editor)

Careers and Publishing Computational biology Interviews

“A heavy stare from a scary guy is only so scary when he’s sitting in the defense chair in handcuffs.” As a science editor, I have many fascinating soundbites from my interviews. None of them, however, have been quite this confronting.

Where many would balk, sitting in the witness box across from a genocidal military leader, Patrick Ball (left), Director of Research at the Human Rights Data Analysis Group (CA, USA), instead sees the facts in front of him, assesses them and reacts appropriately. This may be the natural byproduct of over thirty years’ experience developing statistical analysis and mathematical models for the investigation and prosecution of the perpetrators of genocide and human rights violations, but speaking to Patrick, it is clear that his love for data runs just as deep as his sense of justice and respect for the victims of human rights violations.

It is for these qualities, and for his tireless work in this field, that Patrick was awarded this year’s John Maddox Prize from Sense About Science and Nature (Both London, UK), where we got the chance to interview him about his career, the data analysis methods behind it and the prize.

What are the key focuses of the Human Rights Data Analysis Group?

In our international work, the estimation of total mortality and conflict has been the task we have focused most time on. The problem with mortality and conflict is that we generally only have a fraction of the data and we often don’t even know what that fraction is and if that fraction is different for different periods, regions, perpetrators, or victims of the conflict, which means we have to estimate the population of total mortality. Over the last 30 years, we’ve brought a variety of methods to bear on that problem, developing it into a primary specialization.

The interesting questions though are not just about magnitude, but rather about pattern: Where did the violence occur? When did it occur? In which military zones? Which perpetrators committed it against which victims? Were they men or women? Old or young? Ethnicity A or ethnicity B? Those are interesting, valuable questions, and we correlate those with larger political questions. This leads to a social science problem after the statistical problem to solve; how can we put together a story or test a hypothesis about the causes of the violence and what they may have been?

What was the state of quantitative analysis in human rights violations and war crimes when you started your career almost 35 years ago?

Honestly, lawyers had already been doing this for quite a while. Journalists have been telling stories and area experts have been writing in-depth analyses of locations and military and irregular groups committing violations. This is a well-developed field in a qualitative sense. What was new in 1991 was cheap computing, which enabled us to build databases from all this qualitative material that people had collected.

Our first project was figuring out how to transform qualitative material into rigorously countable units. A unit of violence that is definitively one unit of violence, not two units of violence. That may be obvious in a homicide, where each person can only die once. That’s less obvious in something like torture or forced displacement or even the recruitment of child soldiers after which we were able to arrive at a ‘violation’ as the unit of violence.

How has the field developed during the course of your career?

From database work, we moved on to inferential statistics that required us to do population estimations, which is about the data you don’t have. Even now, we continue to work on the complexity of inferential statistics and estimates, population estimates and the many confounding factors that can complicate things. Missing data remains the most interesting technical piece of our problem. In order to address it, we spend a lot of time with other machine learning tools to figure out pieces of the problem. The improvement of these machine learning tools has been a useful development in recent decades.

For instance, we work a lot on a problem called record linkage, which is identifying the same person in multiple databases when none of those records has a unique identifier. There’s no government ID number or Social Security number or even telephone number. Rather, we have weirdly spelled names and dates that are inaccurate and locations that may be imprecise and so forth. To address this we have to create a probability model to link all those records together, which is a really intricate problem to solve.

Now, we do a lot of work with large language models (LLMs), which people call AI rather ambitiously in my opinion. They’re good at extracting structured data from qualitative sources, which makes them very exciting to us as we had to do this by hand back in 1991. To be honest, I’m not sure LLMs are all that good at actually generating summaries and a lot of the other processes that people use them for, but I am very confident in their ability to extract entities and acts from qualitative sources. This could transform our work because we need structured data that uniquely define specific events to do statistics. That’s possible now with LLMs at scale, and we’re really cranking into it.

The 2023 John Maddox Prize winners: defending science in the face of corporate greed

The 2023 John Maddox prizes, which commemorate the former editor of Nature (London, UK) and paragon of good scientific communication by rewarding researchers who bravely stand up for sound science in the face of personal or public persecution, have been announced.

What have been the biggest challenges you faced during your career?

At every point of the argument, from making arguments with raw data to making arguments with data that takes into account the data we don’t have, there are difficult criticisms to address. In the first scenario, somebody is likely to say, “Well, you’ve misunderstood the problem.” When you misunderstand the problem it’s usually because you’ve used raw data, and those statistics only tell the story of the people you talk to, not the story of all the people you didn’t talk to, which is a valid point to make.

The answer to that is, as I have said, to do inferential statistics, make estimations of the entire population so that you do take into consideration people you didn’t talk to. But then your critics say, “Well, you’ve just made those numbers up. You’ve played some sort of little mathematical game and made a guess.” My answer to that is that an estimate is not a guess, and the primary difference between an estimate and a guess is that an estimate comes with an estimate of the error, so you know how wrong you’re likely to be. An estimate is as likely to be right as the available data and current scientific techniques enable us to be. The raw data will only get the patterns right by random luck. To statisticians, that calculation of the error is at least as important as finding that point in the middle.

So, dealing with that continual pushback – you’ve made this up, it’s not relevant – is difficult. Of course, you also get apologists or defenders of perpetrators simply saying it’s fake news and they refuse to engage. That’s more a political problem than a technical one, but it is often the way people push back.

You’ve given evidence as an expert witness in the trials of several dictators and military leaders accused of genocide. That is something that takes a huge amount of courage. Has it been particularly scary or intimidating?

Usually, the thing that is really intimidating is the awesome, and I mean that in the formal more antiquated sense of the word, responsibility to speak on behalf of the victims. Because if you screw it up, then you’ve harmed them. It takes, I think, a lot of boldness to be willing to take on that responsibility, that obligation to the victims to tell their story correctly and not to be rebutted. Because what if I went to court and lost? I could discredit some of their stories. That would be the worst possible outcome. That’s really intimidating.

Now in a couple of cases, I’ve been intimidated by the people, or not intimidated so much as just made nervous by them, but none of the three former heads of state that I’ve testified against were intimidating. One of the colonels was very intimidating though.

During the course of his leadership of the National Police of Guatemala, he had overseen the disappearance of hundreds of students and union leaders. When I testified against him, he gazed across the room at me with a look that seemed cold and quizzical, like he was wondering, “How did I miss you? How did you slip through my net? I should have killed you in the 1980s.”

Now, point of fact, I was not in Guatemala when he was the leader of the national police. But he was still very powerful in the military when I was traveling there later in the 1980s. But, you know, a heavy stare from a scary guy is only so scary when he’s sitting in the defense chair in handcuffs.

What can people do to help support this work?

I don’t specifically need help for this work in human rights. What I need help with, and I think we all need help with, is help promoting the vision of science. What does it mean to argue with data? What does it mean to understand things with statistical methods? And I think that the most important thing I need help with is explaining to everyone who will listen that a statistic is not a number, it’s a range, that it’s an interval. If we have all the data, we don’t really need statistics. In the vanishingly rare case of perfect data, however, the naïve numbers tell their own story.

If you don’t have all the data, if you’re missing a great deal of the data, then you need to know what you can say even in the absence of that data, and say it with respect for what you’re missing, and there are rigorous methods for this. There are any number of methods for population estimation and missing data management that are respectful of the missing data, but that are foundationally scientific methods.

It’s important for us to begin to understand that when these methods are applied, the conclusions drawn can often be a long way from what someone thought the raw data was telling you about because there is all the support from other data and the rules and mechanisms of science that help us get from observation to another. In statistical reasoning today, we are often confused by the apparent successes of technologists who have a tremendous amount of data from an entirely unrepresentative part of the world. They produce results that seem compelling, and might well be compelling, but we should not be fooled. Nearly all data that we have about the world is a sample, and we have to ask, “In what sense does that data actually tell us about the world and in what sense does that data (like the critics I told you about a minute ago) just tell us about the process that created the data?”

Science gives us tools to go beyond sampled data and talk more accurately about the world. But if we just skip that step and we let the technologist give us a dashboard with a blinky graph, we’re not doing ourselves any favors. The world is much more complex than gathering a bunch of data and making a graph out of it. We have to think about what we didn’t see and keep that in the graph too, so that our conclusions are not a trivial artifact of our collection process.

If you had any advice for people starting out in your field, what would it be?

Honestly? Don’t even undertake the field unless you really love all the fundamentals. If someone says, “Well, you really need to rewrite that whole piece of code and it’s going to take you three days,” that needs to seem exciting because you get to sit with yourself and rewrite the code and think through exactly what it’s doing. If you don’t want to understand Bayesian statistics or linear algebra or fundamental algorithms in computer science, then maybe this isn’t the field for you.

You should embrace and be excited about the challenges of both software and mathematical statistics. To be an applied statistician, you don’t have to be a brilliant mathematician. I’m certainly not. You don’t have to be a Google-level programmer. I’m certainly not. But you have to really enjoy both of them, and you have to be enthusiastic enough to find your way through learning to be a good coder and to be a more than adequate reader of mathematical statistics so you can be a competent user of the techniques that the mathematical statisticians bring to us.

What platform do you use for your analytical technology?

We use Linux servers and a huge suite of open-source software. We also build some open source software, and we have a few R packages that we publish ourselves, but we write things in a combination of Linux shell scripts, make files, Python, Bash, R, and occasionally C++ and Java, although that’s not so common anymore. We have several routines written in Julia. We both benefit from and, as much as we can, contribute to the world of open-source software. LLMs are beginning to play a role, but we prefer open-source models and AI applications in which we can measure model fit. Our work would not work without the huge contribution of tens of thousands of programmers around the world.

Our code lives in GitHub, although in private repositories. But mostly we spend most of our time, and I teach all my students to be at their happiest and most comfortable, in a Linux or a Unix terminal. It’s a beautiful place.

What does winning this award mean for you?

I’m very, very honored and grateful to Sense About Science and Nature for giving me this prize, but I’m accepting it in the name of my partners. I would not have been doing this for 35 years without partners in many, many countries who have really been the ones who stood up for science. They’re the ones who really wanted to know the answers and were willing to take immense risks with their organizations, with their individual security to capture the data, to work with me, to shape the data into some kind of usable scientific analysis, and then to take the reputational risk of defending that analysis even when they are often or have never really been scientists themselves.

They’re lawyers and journalists and advocates. They’re not scientists. So, it’s been risky for them, and I admire and am grateful to my partners in so many countries. I’m going to mention Mexico, Guatemala, and Colombia as my most recent projects. The tremendous bravery of the people that I work with absolutely boggles me, day in and day out. The consequences for them are not the possibility of an unpleasant academic spat with someone, but potentially physical harm. They constantly inspire me