Misinformation and Dataset Biases

Cassandra Roundtable 3

Cornell Tech. Misinformation and dataset biases. In association with TADA 2022.
October 7, 2022, 9am EDT.

Seven people gathered at a table for the Cassandra at TADA roundtable

Link to video through Cornell Video on Demand




Kathy McKeown

On misinformation

There is a distinction to be made between misinformation and disinformation, where misinformation would refer to claims that can be verified as not true, that are verifiably false, and disinformation, where there is the intent to deceive. In the field of natural language processing, there has been a lot of work on fact checking which would fall into misinformation.

The state of the art is a very big question. I think it varies a lot depending on what domain you are in. So, we have done work on checking misinformation about Covid, and there the question is “how do you know what is true?” and “what corpus do you use to determine what is true?”, and “what is true today may not be true tomorrow.

On bias

In our field, I think of three different ‘bias. (…) One is machine learning algorithms, where algorithm is attempting to predict on a certain factor, but it is using information that has a spurious correlation with that. Then, there is bias in the data, and more recently we have realized that machine learning algorithm work from label data that not only has bias in the data but there’s bias in the people who annotate the data.

Sarah Shugars

On misinformation

In the political ream there is increasing dissent around what is actually true, and sometime particular word choices can have implications that are not necessarily false but can lead things towards misinformation, and that is where things start to get really challenging. Not only distinguishing misinformation from disinformation but distinguishing satire from bias and so on.

In terms of the state of the art, I think, one of the things that is really interesting around. This challenge is trying to figure out how to scale up some of misinformation identification approaches because there are NLP approaches for doing this that can be fairly successful. But any time you’re applying a machine learning algorithm to something, there is going to be false positives, false negatives. So, thinking about how important it is to get that classification right, how to get it right quickly (…) I think that there are a lot of interesting challenges in this space.

On bias

At a very broad level, bias represents a connect between the conceptualization of something and the operationalization of that thing. Sometimes it is a bias that you know about, so you want to measure something, but you are actually measuring something else.

The problem, I think particularly comes when you do not know about the bias. So, you think that your operationalization is measuring something, but it is not actually measuring that thing. It could be due to missing data, to being unaware of something in the data, or how the data was collected, and oftentimes it is situations where you just do not know what things to be looking for.

Arthur Spirling

On misinformation

My observation about how CS treatments versus say social science treatments differ is that folks in social science spend a lot of time, maybe too much time in a certain way, thinking about these measurement problems, of what is “misinformation” versus what is “disinformation”, and, in fact, it is very easy to give what is a seemingly clinical difference. (…) But in practice these are very, very hard to differentiate , I mean almost impossible.

What I have seen over the years, the last ten years, even in social science, which I think is in general much more sensitive to measurement issues, is this outsourcing of it saying “well, this information is very tricky for us to measure so we are going to actually measure something called fake news. And actually, we are not going to measure it. We are going to have this other outfit, that just does collation of news, decide how fake news-y a particular thing is. It is this very weird belief that the measurement problem is hard and then solving it by outsourcing it to what is actually a black box algorithm, because it is a proprietary firm, I found it quite concerning.

On bias

I have this sociological observation that bias has obviously normative overtones (…) I think the problem is as soon as something has some sort of normative implication, it makes it very hard for us to discuss, because we are introducing different objective functions for our normative understanding of the world.

People love consuming biased news. They love it, right? I know I do. My parents do. I love finding TV shows that sort of tell me things I already think about the world, and so it increases my utility in a certain sense. But still, there is this idea that bias is bad, and the fact that we disagree over exactly what we mean by the normative objective function, I think, makes it very hard to make progress on that.

David Mimno

On misinformation

My training is in humanity to some extent, but mostly in quantitative, mathematical, natural language processing and text processing, and at no point in my PhD. I expect to even talk about normative issues.

I think that in this discussion about misinformation and disinformation what is really hard in that we have to, at some point, to have an opinion and say ‘this is what I think the machine should know or say.’ Sometimes that is sort of factually grounded, and sometimes it is purely this is what we want the system to do. I feel that the CS community reacts to this like soap and water. I think that there is a big gap in culture that we have to talk about before we can even start to think about solutions.

On bias

One thing that characterizes social science and humanities work is that you start with a collection, or some collection that approximates a social process that you are interested in, and really try to get representative documents. In my work in natural language processing is much more like ‘Okay, what can we get access to?’ So, an enormous amount of work happens on Wikipedia, because you can just hit and download.

The result of that is that before we can even talk about the normative biases, there is a large category of natural language processing work that does not even think about what is in the collection. It is purely ‘Ok, (I am exaggerating) but it is purely driven by ‘what can I get?’, which is a fine thing for certain types of work, but we have to recognize that this does not even attempt to get unbiased, or even to think about what bias would mean for data.