Mismatches between computational tools and social science needs
Cassandra Roundtable 4
In association with the Center for Political Studies.
- Chris Fariss, UMich
- Andy Halterman, MSU
- Walter Mebane, UMich
- Julia Mendelsohn, UMich
- Arya D. McCarthy
- Giovanna Maria Dora Dore
- Eva M.E. Klaus
I was struck by the difference in how the two groups approach asking questions…it’s a different way of approaching the topic. I don’t want to paint with too broad of a brush because there are a lot of computer scientists who are really good at moving back and forth; but, it’s hard to switch the level of questioning from ‘I want to understand this tool and make it better’ to ‘I have a substantive question about real variables in the world that are hard to measure and I need some computational tool to help me measure one of these variables.’
What methodologists do is try to pioneer new techniques, but the vast majority of political scientists want to use a tool that works, that other people have used before them, and where there is a nice paper [to justify their methods] …So if you’re not a methodologist you don’t necessarily want to be on the absolute bleeding edge of new methods because it takes a lot of work to justify the use of a new technique in political science.
Most political scientists aren’t interested in building tools because that isn’t what they do. So, the role of the methodologist is to talk to our substantive colleagues, we figure out the problems that they have, we have a sense of what’s out there – ‘Can I build a tool that solves this data problem’ – and then we build the tool and work with them, or they use the tool. It takes a lot of work to identify the need or problem – which is often quite specific or niche – and then build the tool…and this is where a lot of fruitful collaboration happens.
One thing that I have seen [that seems of particular use to] people who firmly consider themselves within social sciences is having methods that can help with automated coding of data, so more of the data labeling side…I do think that being able to get labels for a lot more data in a much cheaper way has been something that I have seen a lot more.
Very concretely, based on the communities that these [computational] tools are designed for, it could be a pretty high barrier to entry for social scientists who have not taken many computer science courses. For example, a lot of the tools are in Python and a lot of social scientists use R or other software…so it may be hard to get started with [such tools]. It can be too much at once, which stops people.
Another mismatch [between computational tools and social science research], perhaps more foundationally, may be issues with validity. I think this has to do with different goals in different fields. In computer science, it is totally fine if your model has 60% accuracy, so long as the previous-best model had below 60% accuracy. But this may not actually be sufficient, especially because that accuracy might be distributed across classes in some weird, imbalanced way that biases all the results that you would use to answer social science research questions.
[Political science] deals with serious measurement issues, and it is not at all something where you can [intuit] the right answers…For deeper measurement issues [in labeling using computational tools] it would be really productive to have a collaboration, conversation involving social science. Political science is mainly about measurement, methodologically…we know a lot about the nuances there.
Machine learning tools are helpful for incorporating more data, but we need a collaborative effort to see what the techniques do and how we deal with prediction versus inference.
You can’t expect to point to the black box and have it solve all your problems. It’s not automatic in that sense. But you can bring to the device other infrastructure, other work, other knowledge, and do things to exploit what the black boxes – which shouldn’t be treated as black boxes, these predictive algorithms – can do.
One of the problems with social science, at least political science, is that we are not even close to where tiny refinements are the thing that is going on. We are just trying to learn what’s going on at all, and trying to find pots of data that can be applied to understand these things, either through description – which I still think is worthwhile – or through causal inference opportunities.
Complexity, rapidity, diversity, uninterpretability – [Big Data] has all of these challenges.
Learn to program, program to learn.
What I think most people mean when they talk about Big Data is a large, unstructured dataset that we can use computational tools to explore and maybe find insights from. But we know that it is easy to find patterns in large-scale datasets that may or may not be real, and that’s a key challenge. In political science, at least, some people are uncomfortable with the use of those large-scale datasets because there is a big concern about finding patterns that aren’t meaningful patterns when we explore these datasets.
It seems to me now that the mismatches [between computational and political science] may be – in the moment – cross-sectionally a challenge or a problem, but temporally it is a feature of the interrelationship between these two fields in academia and elsewhere. And there are opportunities that come from these mismatches that we can take advantage of.
Hopefully the boundaries [between computational and social science] are continuing to loosen. That’s probably the biggest challenge…those opportunities [to talk between the two fields] are still less frequent than might be best for addressing this mismatch. So, the more conversations like this, probably the better.