Back To Blog

On Tech Ethics Podcast – Navigating Big Data and Data Science Research Ethics

Season 1 – Episode 7 – Navigating Big Data and Data Science Research Ethics

Examines important factors that data science researchers need to take into account when it comes to ethical data and participant awareness.


Episode Transcript

Click to expand/collapse


Daniel Smith: Welcome to On Tech Ethics with CITI Program. Our guest today is Katie Shilton, who’s an associate professor in the College of Information Studies at the University of Maryland College Park. Among other roles, Katie is also a principal investigator of the Pervade Project, which is a multi-campus collaboration focused on big data research ethics. Today, we are going to discuss some key considerations for data science researchers when it comes to ethical data use and participant awareness. Before we get started, I want to quickly note that this podcast is for educational purposes only. It is not designed to provide legal advice or legal guidance. You should consult with your organization’s attorneys if you have questions or concerns about the relevant laws and regulations that may be discussed in this podcast. In addition, the views expressed in this podcast are solely those of our guests. And on that note, I want to welcome Katie to the podcast.

Katie Shilton: Thank you so much for having me. I’m really glad to be part of this.

Daniel Smith: Absolutely. So I provided just a brief introduction to you, but can you tell us a bit more about yourself and what you currently focus on?

Katie Shilton: Yeah, so I describe myself as a tech ethics and data ethics researcher, and specifically I’m really interested in what people do in terms of ethical decision making in their everyday work lives. So I have studied software engineers to understand how they think about ethics in their work. I have studied data scientists to see how data scientists are thinking about ethical decision making, which is a lot of what we’ll talk about today. And I’m really interested in how we bridge the gap between big ethical principles. Listeners to this podcast might be familiar with the Belmont principles or maybe some of the AI ethics principles that are starting to be put out into the world. And those are great and well and good, we need them, but how do people translate those into their daily work? How do they make technical or research decisions using those broad guidelines? That’s kind of where I focused my efforts. And the PERVADE project, which is my most recent big project in the data ethics space, has really been focused on those questions.

Daniel Smith: Thank you, Katie. So can you tell us a bit more about some of PERVADE’s key findings, particularly around data collection and use?

Katie Shilton: Yeah, so the PERVADE project is a big collaboration. There are six campuses and seven researchers involved as well as numerous grad students. And we have been trying to triangulate between different stakeholders in data ethics to see if we can make some recommendations based on the expectations and the practices of all of the different people involved in big data research. So we’re focusing on human data or what we call pervasive data, which is data about people that is generated through digital interactions. So you can think about, of course, your internet search history or social media posts, things like that, but also smart home device data, your smart refrigerator or your smart TV may be recording data about you and about your habits and routines. That kind of data can be really interesting to researchers. And so we’re interested in any form of digital interaction data about people, and what we’re asking is how do the users of devices that are documented by that data feel about their data being reused for research?

What are their expectations around research reuses of that data? What are data scientists doing on the ground to try to navigate ethics in this kind of undefined space when it is okay to use that kind of data about people? And then what are regulators recommending? In the US, these regulators tend to be university internal review boards or IRBs who are the university staff and faculty who oversee research ethics on campuses. And so we’ve studied all three stakeholder groups and what we’re finding is a pretty big gap between user expectations for data reuse and the guidance given by IRBs and the actual practices of data scientists. Predominantly what we have seen in our studies of users, and we’ve studied users on multiple different platforms, online platforms. So colleagues of mine have studied Twitter users and Reddit users. With my colleagues Jessica Vitak and Sarah Gilbert, we have done surveys of dating app users, Facebook users, Instagram users, Reddit users, and we have asked them about their expectations for data use using little vignettes or stories that can help you get at the factors that matter to people.

But what we’re finding is the number one factor that matters to people in their data reuse is whether or not they were asked for consent first, which makes a lot of sense. If you think about the history of research ethics in the United States, we have 40 years of research ethics practice and law, which says that if you are involved in a laboratory study, you will be asked for consent. But people have taken that, I think to mean that if you were involved in academic research, you will be asked for consent. And so people expect to be asked for consent for uses of their online data. Broadly, that is not at all what actually happens when it comes to research practice. First, it is not what IRBs recommend. In many, many cases, if data is publicly available most of the time, unless it is identifiable to a particular individual, it’s probably not something that IRBs will tell you, you need to have IRB approval to collect or even have consent to collect.

And researchers generally, in many cases, depending on the kind of data, but in many cases again, when we’re talking about sort of publicly available data, Reddit data or Twitter data, they’re not asking for consent and they may not even be able to at scale. If you are doing a study of 100,000 or 500,000 Twitter users, it may be physically impossible. You can’t spam them all to ask for consent. And so there are some real tensions here around what researchers are doing, what is considered ethical under our current IRB guidance and what users expect. So PERVADE’s recommendations have really been focused on trying to narrow this gap and help give researchers some tools for thinking through when it’s acceptable to use data without consent, when they might do other forms of public awareness around the use of that data that aren’t a traditional informed consent and when they really do and should still secure forms of informed consent. So that’s really what we’ve tried to do is provide some recommendations for researchers.

Daniel Smith: That’s really interesting. And here in a moment we’ll get some more into those recommendations. But just to touch on one thing for a moment, you mentioned other forms of awareness other than traditional informed consent. Can you talk a bit more about what those are and how folks can think about those?

Katie Shilton: So there’s some room for creativity here. PERVADE hasn’t thought of all of the ways that this could be done. And I want to encourage researchers to be reflective and think about what might make sense for the communities that they’re studying. But we are pretty inspired by public forms of scholarship on the large scale. So researchers who share back their findings with communities either before or after publication to say like, “Hey, I did this study on Twitter and here’s what I found. And if you have tweeted between these dates, you might have been part of my study,” to let people know because there’s a real lack of awareness among Twitter users broadly or Reddit users broadly or Facebook users broadly, that they might have been part of studies. And I think increasing that awareness is important. Another way of thinking about this is to think about the community you’re studying and whether there are gatekeepers to that community who it might make sense to talk to first.

So if you are studying a health forum on Reddit, Reddit forums have moderators, and those moderators are generally experts in that forum. They may even have rules about what can and can’t be said in those forums, what can and can’t be done in them. Those folks are good folks to reach out to before you harvest data from a Reddit thread. Does it fit the norms of this community? Are there people I can talk to? If it’s not everybody in the study, maybe I can’t contact everyone in the study, but I can contact leaders in the space and try and understand if I’m violating any norms of the space by studying it.

Daniel Smith: So I guess just kind of adding to that a bit, what are the best practices for researchers when it comes to ethical data use? And is there kind of a process that they should be thinking through as they navigate these different platforms and user expectations and community norms and so on?

Katie Shilton: Yeah, so PERVADE has written a paper that recommends that data scientists and computational social scientists and other folks who are sort of harvesting or using online data or digital data, I should say for research, think about two axes. The first is awareness, which we’ve talked a little bit about, how likely are the people you’re studying to be aware that you are able to study them without consent? Are there ways that you can increase awareness? So that’s sort of the first one. The second one that I think it’s worth talking about that that researchers can walk through is thinking about the power relationship between you as a researcher and the community that you’re studying. And this comes from a history of data ethics in which most of this data that we’re talking about was not created to be research data. It was created for surveillance capitalism.

It was created by platforms to understand what you are doing on those platforms so that they can sell you more stuff. Not entirely, but many of these forms of data were and or are used that way. And as researchers, I think we have to be reflective about the fact that this data was not research data first, it was commercial data and that we need to think about the potential harms to communities that can happen through reuses of data. So an example of this, and my colleague Casey Fiesler has written about this with her graduate students, are dangers of amplifying content that was meant for one context into other contexts. So I’m going to return to the example of mental health research on Twitter. There are Twitter communities that have very frank conversations about mental health. It’s one thing for those conversations to happen on a platform within a smaller sort of community, even if it’s public.

And it’s a different thing from the New York Times to publish those tweets or for a researcher to publish them in Nature and say, “Look at what’s happening here.” And so that sort of amplification can be a real harm. It’s something that researchers should be conscious of and should be thinking about, their power to amplify content beyond its original context. There’s this wonderful book, Data Feminism, that I highly recommend that talks about the fact that too frequently big data has been used to oppress rather than to make people’s lives fundamentally better. And so we have to counter that as researchers. And so there are ways of thinking through our own power relationships and they’re like, “Are we the right person to be doing this research? Should we be partnering with community members to do this kind of research? Should we leave that kind of research to the community itself? Should it be completely participatory or guided by a community?”

And so these are all good things for researchers to ask themselves. Now we realized this is eight or nine or 10 things you sort of have to ask yourself as a researcher, who am I studying? How aware are they of what I’m doing? What are the power relationships here? This is a pretty complex set of questions. And the answers are, it depends. People ask me all the time, “Okay, well I’m doing this project.” I’m like, “Oh, okay, okay, okay.” We have to go through so many questions before I’m going to be able to say, “Yeah, that sounds good,” or, “Have you thought about doing it this way?” So we created a tool to help guide researchers through this process when the answer is always that it depends, then that’s a big ask for researchers. So we’ve created… it’s a quiz and I joke it’s like a Buzzfeed quiz, but for research ethics that ask people about where they’re getting their data from, what guidelines exist in their field for uses of that data, who’s documenting that data, who might be left out of that data, all of these kinds of questions.

And then we give people resources. The decision support tool is what we call it. So it’s there to sort of guide researchers to various resources to think through hard problems of awareness and power in their research so that hopefully we can make this process a little bit more intuitive for researchers, a little bit easier for researchers as they’re designing their projects.

Daniel Smith: That’s excellent. Now I will certainly include a link to that tool in our show notes so that our listeners can check it out and kind of explore these different considerations more. I guess also on that note, since you had mentioned that throughout this conversation a bit about the regulatory or oversight side and institutional review boards, do you have any kind of best practices or even resources for folks working in that area as they navigate these issues as well?

Katie Shilton: Yeah, this is something… this is a great question. I’ve done a number of talks for IRB staff and regulators and IRB staff are… they’re so thoughtful about this kind of data because they’re seeing it increasingly, they’re seeing questions of, “Should I be studying Twitter or can I do this research with smart devices in people’s homes?” And they want, I think, to be able to provide good guidance to these studies they may or may not be able to depending on… So right now in the US, the way the Common Rule is written, the Common Rule is the piece of legislation that says that if you are at a federally funded university or a university that receives any federal funding, you need to have all human subjects research reviewed by an institutional review board. But the Common Rule carves out publicly available data.

And because of that carve out right now, a lot of pervasive data research doesn’t go up for IRB review. And I don’t think we’re going to be revising the Common Rule again anytime soon. It was actually revised relatively recently, but right before the big data research era really took off. And so these questions haven’t been directly addressed at the sort of legislation phase. That said, IRB staff are frequently, not in every case at every institution, but frequently are pretty well versed in data ethics and big data ethics and are good resources for advice even for, I think for researchers who they don’t need to apply it to the IRB because it is public data, it will be exempted. They don’t need to get that approval. They might still ask, right? And one of the things we do in our tool is say like, “It might be worth talking to an IRB in these particular cases,” even though you might have an exempt project because IRB staff have practical research ethics knowledge just from their daily work.

They’re constantly sort of thinking about research ethics, usually they’re thinking about power in research, they’re thinking about awareness of various forms. And yes, we might not need the form of awareness that is full informed consent, but IRB staff may be able to talk to you about other forms of awareness that might work for your project. So thinking of them as research ethics consultants in these cases, I think can be useful. And the research ethics staff or the IRB staff I’ve talked to have been excited about that idea. We hope that the decision support tool might also be useful to IRB staff who want to send it to their PIs and say, “If you want to walk through this, it’s nothing we’re requiring, but might be useful to you.” So that’s something we’re hoping to pursue as well.

Daniel Smith: Wonderful. So like I said, I’ll certainly include a link to that data ethics tool because I think it’ll be of benefit to everybody to check it out and kind of explore these issues some more. We did cover a lot of different considerations in this brief time, but you also mentioned some other resources that I think would be helpful. So I also encourage our listeners to check out the publications on PERVADE’s website as there are a lot of great resources there too. And I invite you to visit to learn more about our new Big Data and Data Science Research Ethics course, which was authored by Katie and her colleague Emily Dacquisto.

The course covers the unique issues associated with big data and data science research ethics, including privacy and data protection, which we didn’t touch on as much today, but there is a more in-depth exploration in this course, participant awareness, which we talked a bit about today, but you can learn more in this course as well. And then finally, power and the different frameworks in which you can think about your power in relation to the folks that you are working with in a research capacity. And if your work involves big data and data science, I really think that you’ll find this course extremely helpful.


How to Listen and Subscribe to the Podcast

You can find On Tech Ethics with CITI Program available from several of the most popular podcast services. Subscribe on your favorite platform to receive updates when episodes are newly released. You can also subscribe to this podcast, by pasting “” into your your podcast apps.

apple podcast logo spotify podcast logo amazon podcast logo

Recent Episodes


Meet the Guest

content contributor katie shilton

Katie Shilton, PhD – University of Maryland, College Park

Katie Shilton is an associate professor in the College of Information Studies at the University of Maryland, College Park. Her research focuses on technology and data ethics.


Meet the Host

Team Member Daniel Smith

Daniel Smith, Associate Director of Content and Education and Host of On Tech Ethics Podcast – CITI Program

As Associate Director of Content and Education at CITI Program, Daniel focuses on developing educational content in areas such as the responsible use of technologies, humane care and use of animals, and environmental health and safety. He received a BA in journalism and technical communication from Colorado State University.