Season 1 – Episode 9 – Impact of Recent Social Media Platform Changes
This episode discusses how the recent changes to the social media landscape may affect researchers and society.
Click to expand/collapse
Daniel Smith: Welcome to On Tech Ethics with CITI Program. Our guest today is Nick Proferes, who is an assistant professor at Arizona State University’s School of Social and Behavioral Sciences. His research interests include users’ understandings of socio-technical systems such as social media, societal discourse about technology, and issues of power and ethics in the digital landscape. Today we are going to discuss how recent changes to the social media landscape may affect researchers and society.
Before we get started, I want to quickly note that this podcast is for educational purposes only. It is not designed to provide legal advice or legal guidance. You should consult with your organization’s attorneys if you have questions or concerns about the relevant laws and regulations that may be discussed in this podcast. In addition, the views expressed in this podcast are solely those of our guest. And on that note, welcome to the podcast, Nick.
Nick Proferes: Hey, thanks so much for having me.
Daniel Smith: It’s great to have you. So I gave you just a very brief introduction. Can you tell us more about yourself and what you currently focus on at Arizona State?
Nick Proferes: Sure. So really broadly speaking, I do a lot of research into how people think about technology and its place in their lives. In particular, I focus a lot on social media, how users develop expectations and beliefs about how social media works, and this is based not only on things like interface design, but also their lived experiences, and how companies talk to users about their systems working. So I’ll give you a couple examples.
For example, along with Anna Lauren Hoffman at UW and Michael Zimmer at Marquette, I did an analysis of every single thing that Mark Zuckerberg has ever said in public about Facebook to understand how his language use has evolved over time from sort of the Harvard College only Facebook days to the IPO, to congressional hearings. More recently, I’ve also been looking at Reddit’s upcoming IPO and how it’s been discussed by Reddit users, what sort of changes they imagine happening on the platform and the impact that it could have on their communities.
I also do a lot of work on how users’ expectations about information flows on social media, sometimes maybe conflict with the actual on-the-ground flows. In particular, I’ve really looked a lot at how users think and perceive actually scientific research that uses publicly available social media data. For example, along with Casey Fiesler at UC Boulder, we actually surveyed Twitter users about their contextual perceptions of the use of tweets for research. And we found some really interesting things, for example, that a majority of participants actually thought that researchers, for example, always have to ask permission to use social media data, even if it’s public, which listeners of this podcast may know is not actually always the case. But we also found that these folks had really mixed feelings about the idea of their tweets being used or quoted in studies.
And lastly, not too long ago, along with a team of other researchers, I completed a big analysis of how scientists are actually using and gathering data from Reddit as part of scientific research. And so we looked at over 700 papers that actually had been published using Reddit data, mapped the domains. This research was occurring in the Subreddits that were being commonly studied and the kinds of ethical issues that researchers are encountering and discussing in their publications.
Daniel Smith: Thanks, Nick. So you touched on quite a few things that I think could be affected by a lot of the recent changes that have gone on in the social media landscape. But before we get into that, can you provide us just with a quick overview of some of those recent major changes and other events that are taking place?
Nick Proferes: Yeah, so it’s been a wild year in social media. So this has included things like Elon Musk purchasing Twitter and very recently rebranding it to X. And I’ll probably continue to refer to it as Twitter just because it’s an ingrained habit. Meta launching its own Twitter competitor called Threads, the launch of Bluesky, which is in currently sort of a closed beta, which is another Twitter competitor backed by some former Twitter folks, including Jack Dorsey. Some big shifts towards federated social media platforms like Mastodon, the giant rise of TikTok, as well as the subsequent banning of it from government devices in certain US states. And that’s impacted researchers.
Reddit gearing up for an initial public offering of stock. Some interesting lawsuits that have happened actually with Elon Musk and a hate speech watch group called the Center for Countering Digital Hate. The overall rise of generative AI systems like ChatGPT and Midjourney, as well as some really interesting policy changes happening in the EU with the introduction of the Digital Services Act, which would actually put many of these social media companies under stricter online speech rules, but would also give researchers access to specific kinds of data from the social media platforms themselves.
Against this sort of backdrop, we’ve also had a number of changes in how these platforms themselves operate. And most notably for researchers, there’s been some big changes to Twitter and Reddit’s APIs or application programming interface, which researchers had sort of previously used to be able to collect really large datasets for either low cost or in many cases free. That’s now essentially sort of been cut off.
Daniel Smith: There really is a lot going on. So I think it might be helpful to break it down one by one per platform. You mentioned some of the recent changes to Twitter, now known as X, but we’ll probably refer to it as Twitter during this conversation. To get more specific, what are some of the implications there that researchers should be aware of?
Nick Proferes: Yeah, so the first thing I want to say is that Twitter had been a really prominent source for researchers to gather data really of a huge variety of topics, everything from natural disasters to predicting flu trends, to trying to predict the stock market movement, to understanding sentiment, to understanding politics and culture. There’s been essentially thousands of scientific research papers that have been published at this point using Twitter data. So it really had become this major, major source in sight essentially for researchers to gather data. And a lot of that was because they had relatively open APIs that really allowed you to gather this data in bulk. There’s a researcher, Zeynep Tufekci, who actually went as far as calling Twitter “the model organism” for understanding certain different social phenomena online. And that was because of this sort of openness.
So in February of this year, Twitter actually announced that there would be some pretty big changes to its API. And this included removing most free access and introducing paid tiers. And as I understand it today, the cheapest version of the API, which costs $100 a month only allows you to grab about 10,000 tweets a month, which is about less than 1% of what researchers could actually previously get for free. And while that still may sound like a lot of tweets to some folks, it’s really just a drop in the bucket when you consider the millions of tweets that are essentially sent each day. And many researchers were actually pulling on the order of millions and in some cases billions of tweets. So they could really understand sort of broad scale trends in conversations and scientific phenomenon on this platform.
Now today, if you want to gather a large amount of data, an enterprise account on Twitter’s API can cost anywhere from between $42,000 to $210,000 a month. That’s not something that’s really in the budget of most researchers. Particularly if you consider needing to pull or wanting to pull data across a much larger timescale, the costs just kind of keep adding up. So this has really killed a lot of ongoing research projects. It’s certainly thrown major wrenches into works that students we’re doing for master species or dissertations. And it just generally has made the process and cost of doing research on Twitter much more onerous.
And in addition to this, Twitter has actually told some researchers that they would actually need to go back and delete data that they had previously collected under prior APIs, in particular its Decahose, which was an API that provided a random sample of 10% of all the content on the platform. This is, of course, unless they pay for an enterprise account. So this has some pretty tough implications for not only those projects that relied on this data, but also for the prospects of sort of open science, which often encourage dataset disclosure to further reproducibility of that research.
Now, I will say in April of this year, a group of academics, journalists, and other researchers called the Coalition for Independent Technology Research sent a letter to Twitter actually asking them to help it maintain access. And that group listed over 250 projects that would actually be jeopardized by ending free and low-cost API services, including research into things like harmful content, information flows, crises, news consumption, public health elections, and political behavior. And while I think this group has done a great job in calling attention to the problems that researchers in this space face, ultimately Twitter has not backed off their plans.
Daniel Smith: So those are definitely some serious changes. And I think that Twitter is obviously one of the more prominent platforms, especially for researchers, but it’s only one piece of the puzzle. So you mentioned earlier that Reddit is preparing for an upcoming IPO and there’s some other things going on possibly related to that. So can you tell us some more about that and the implications for researchers as well?
Nick Proferes: Yeah. Well, so Twitter wasn’t the only social media platform to make some significant change to data access through its APIs. Reddit’s also made a number of pretty big changes which have impacted both academic researchers and even Reddit’s own moderation team, which actually resulted in this thing called the Reddit blackout. So maybe for listeners who might not be familiar with it, Reddit is this social media platform that’s made up of these individual spaces called Subreddits. Essentially they’re individual communities that focus on things like history, fashion, gaming, funny pictures of cats, interviews, stock market trading, really just about anything that you’re interested in. And it’s just really actually composed of nothing but hundreds and thousands of these smaller communities. And the thing I want to note is that these communities are actually moderated by volunteer moderators who most often come from within that community.
Now in mid-April, Reddit announced that it was going to be changing their APIs, which were previously free, and it would implement a system that charges for access. Reddit CEO Steve Huffman actually said something along the lines of, “The Reddit corpus of data,” so all the content that users are posting, “is really valuable, but we don’t need to give away all that value to some of the largest companies in the world for free.” So they’re really seeing the economic incentive as a reason to sort of close down these APIs. And in particular Reddit and I think much like Twitter has been worried about generative AI and people developing these massively profitable systems based on Reddit content. And these changes to the APIs though had the effect of killing a lot of third-party programs and systems that developers had made to do things like offer alternative ways to access Reddit.
For example, there was a popular browsers called Apollo that let you go through and browse Reddit. It also killed a service known as Pushshift, which was a sort of data repository that was very popular among academics and moderators that allowed them to really easily grab really large amounts of Reddit data. I actually think that there’s something on the order of about 1,000 papers that cite Pushshift data in some capacity. And it allowed historical Reddit data querying, something that Reddit’s own APIs actually don’t allow for. And it also killed a lot of the tools that moderators were using to actually help manage their Subreddits.
So there was some dialogue between moderators and app developers and the folks that run Reddit on Reddit itself. But the negotiations kind of fell apart and the changes that were promised were not really enough to satisfy many of the moderators. So in response, many moderators began what’s now known as the blackout. Essentially they turned their Subreddits private, which kept out anyone who hadn’t already been part of the community. And many of these Subreddits that went dark actually had millions of subscribers. Just as a general impact on the internet more broadly, this actually had some really interesting impacts on Google Search results. For example, if you searched on something in Google and the top result was a Reddit link, often during the blackout when you clicked on that link, you’d be informed that the community had gone private and you couldn’t access that information.
Now ultimately, many of the moderators eventually turned their Subreddits back to public, and this was in part due to pressure from Reddit, they had actually gone through the process of pushing out and replacing some of the moderators that had turned the subs private with new moderators that actually sort of towed the party line, so to speak. And so this is an interesting moment in which you see researchers and moderators having this shared vested interest and access to Reddit data. And I’ll note, once again, that the Coalition for Independent Technology Research put out a letter highlighting the problems for both these communities, which again brought some visibility to the issue. Now, the big implications for researchers today is that Pushshift essentially is not what it was. Historical Reddit data is much harder to get access to. But maybe what gives us a little bit of hope for researchers at least, is that Reddit has sort of promised that researchers who are engaged in not-for-profit research will continue to be able to access Reddit’s APIs for free, though there’s now a process by which you have to seek approval.
One other change to Reddit’s APIs that I want to note that happened during this time is that they changed their APIs so that content from Subreddits that had been marked NSFW, or not safe for work, couldn’t be made available through the APIs. Now, this may sound sort of silly, like, okay, who caress about that? But when you look at Reddit, many of the communities that are about drug dependency or drug abuse actually are marked as not safe for work. They’re discussing adult topics. And a lot of health researchers actually really use these spaces to understand what’s happening in these communities, to understand how people seek support, how novel drugs are producing particular effects and reported effects. So even though there’s this carve out for Reddit’s APIs, there’s still a lot of questions up in the air about how all this is actually going to come to pass.
Daniel Smith: Before we hear more from Nick, I want to tell you all about CITI Program’s webinars and courses that explore topics across professional areas that are meaningful to both early career and experienced researchers. Some of our newest offerings include a comprehensive course on qualitative data analysis, a webinar on how to meaningfully engage communities in research, and more. Visit citiprogram.org for more information. And now back to our conversation.
You touched on a few things that I think point out larger trends going on that are leading to some of these changes, and I want to get to those. But before we do that, I think it’d be interesting to hear some more about some of these newer social networks and how they’ve come about maybe as a result of some of those larger trends. Can you tell us a bit more about Mastodon and Bluesky and Threads and also how those are already being used or could be used in the research space?
Nick Proferes: This is a really sort of nascent area right now, but one that’s absolutely worth watching as it develops. I’ll start with Bluesky and Threads because they sort of more closely fit the model of centralized social media that many of us are used to. In a lot of ways these are kind of Twitter clones. Bluesky was actually created by some former Twitter folks, Jack Dorsey’s involved, and it’s currently in a closed beta, so you actually have to get an invite to be on it. Now, it uses a backend that purports to offer a lot of connectivity and data portability, which is certainly a positive. That’s actually been a big critique that’s been made of, for example, Facebook. But so far there’s only a few research papers or notes that are out there on Bluesky that I’ve managed to come across, and they’re mostly just sort of tracing what the platform is and what affordances it offers. So that’s really a developing space. Folks are still kind of figuring out what its social utility is going to be.
Threads is a social media platform that’s now being offered by Meta who owns Facebook and Instagram where you post a shorter text or image updates or video updates. But it’s sort of built on that model where text is very central and it actually got 10 million signups pretty quickly. But I think that space is still kind of figuring out what it wants to be as well. So again, there’s not necessarily a lot in the way of research quite yet, though I expect we’ll start seeing early papers pretty soon. But data collection is going to be a big question in these spaces. Facebook, Instagram very famously don’t offer these open APIs, and so it’s very difficult or there’s certainly at least challenges to collecting data in that space as opposed to the older version of Twitter or Reddit.
Now, Mastodon is really interesting because it functions very differently than Bluesky and Threads. Mastodon is this sort of decentralized social network that’s actually been around since 2016, but what makes it really different is that it’s made up of these independent servers that are actually organized around specific themes, topics, or interests. So I’ll give you an example. The Association of Internet Researchers, a professional organization of which I’m a part, actually has its own Mastodon instance, and instances can have different interoperability and discoverability and even their own rules.
But the nice part about Mastodon generally speaking is that you could also follow people on many other Mastodon servers. And I think that Mastodon’s really interesting because it points to this model of moving away from sort of centralized, centrally-controlled social media platforms to the situation where everyone’s kind of running their own servers and the servers are kind of aggregated. For researchers, though, there’s definitely some challenges here. So, for example, there’s, at least with Mastodon, no singular, all-encompassing API that you can just go and hit up and collect a huge number of posts that contain one particular keyword. Additionally, there’s not as many people. So while you might have a server or instance that’s really active around one particular topic, it’s a much more scoped community. So you don’t necessarily get those big broad scale trends that a lot of researchers are often interested in.
Daniel Smith: So in terms of moving from more of that centralized to decentralized environment, are there implications there that researchers should be aware of in terms of the effects that has on user perceptions of things like privacy and so on?
Nick Proferes: Yeah, there’s definitely a lot of potential for that, not just on Mastodon, but other sort of disaggregated forms of social media. Discord is one that doesn’t often get talked about, but it’s a really interesting space where people set up, it’s kind of like Slack, their own internal slacks. And one of the challenges there is both for researchers sort of discovering this space. So the nice thing about Reddit and Twitter is that you can just kind of go to one space and you can find communities very easily. In this decentralized model, it’s a lot harder to essentially get access to these spaces or even discover them.
In addition, there’s going to be differential views about this space and the degree to which it’s actually public. Part of what made data collection and analysis on Twitter and Reddit so interesting and so popular was its public-facing nature. This was sort of seen as the town square has been positioned that way, at least. In these other spaces, people may have very different perceptions of privacy. And it’s going to be an interesting question of, okay, well, how public is this space? How do we understand participants’ views on privacy in this space? How is an IRB going to treat this space versus another space where previously I didn’t even have to have a Twitter login to go and look at content that was on Twitter?
Daniel Smith: Definitely really interesting, and it’ll be even more interesting to see how it evolves over time. We’ve talked about Twitter, we’ve talked about Reddit, Mastodon, Bluesky, and Threads. But you also briefly touched on at the start, and I think that it would be helpful to go a little bit deeper on it, what should we know about the changes to TikTok, maybe particularly the talk around bans at the state level and so on?
Nick Proferes: So that’s been really fascinating to watch. I mean, TikTok has had a, I’m not sure controversial is the right word, but from a policy perspective, has not necessarily been greeted with open arms. And one of the issues that folks are worried about is data collection, where the data is being stored about TikTok users. And in response, different states have actually banned putting TikTok or allowing TikTok on government-owned devices. Now, what this actually means though for researchers is that if you’re at a state institution that has this ban, you can’t put TikTok on, for example, a government-owned phone. Now, one thing that has come up is that there’s now a challenge to this, I believe in Texas, where Texas researchers are actually trying to sue to get this ruling overturned because they are doing research on TikTok and they need to be able to access it to understand these sort of big things that are going on on TikTok.
Daniel Smith: Throughout this conversation, you’ve touched on a few kind of larger trends that are influencing all of these changes. So you’ve talked about the shift from centralization to decentralization, changes in API access and so on. But I think it would be helpful if you could just quickly summarize for us all, what are some of these larger trends that are influencing these changes and what should we know about them?
Nick Proferes: There’s a lot of dynamics at play here that are sort of shaping these changes that I think researchers should be aware of. Certainly one of the big things that gets focused on is the economic incentive structure for these companies. So Twitter and Reddit both have a really strong incentive to protect what they see as the value of their data. And that was actually stated outright by Reddit CEO. Companies that develop large language models and AI systems have really been scraping up lots of data, not just on social media, but across the web broadly in order to build these really robust systems that mimic natural conversation. And these businesses are essentially succeeding on the backs of open data and in many cases, Twitter and Reddit, and also potentially chewing up huge amounts of bandwidth use in the process.
Now, Elon Musk has actually expressed an interest in developing his own AI system and certainly having exclusivity on all of the content that’s ever been sent through Twitter would give him a really unique competitive advantage in that space. So there’s a lot of economic incentives that are going into sort of shutting down these points for data to flow.
I think more generally, we seem to be in a moment also where we’re starting to see this move away from the ideals of the open web, of information being free, particularly as we watch this gold rush happening in the AI space. And another trend that I think folks should be aware of is that it really does kind of feel like a fracturing of the centralized web. There’s a lot more communities moving into smaller spaces, spaces they maybe control, a good example is Mastodon. But even ones that are a little bit more centrally controlled like Discord channels or telegram channels, these are spaces that aren’t quite public in the same way that Twitter or even Reddit are. And like I said, this can make it harder for researchers to find these spaces, to find data, to find the communities that they’re interested in, where those folks are congregating.
Daniel Smith: Certainly. So we’ve touched on a lot of different changes in those trends too. Are there any additional resources that you would suggest for our listeners to check out to learn more about these issues?
Nick Proferes: Yeah, certainly. So I’ve mentioned them twice already, but the Coalition for Independent Technology Research’s letters are a great starting point for getting a snapshot of some of the issues that are specific to API changes. I want to mention a couple other groups. The Citizens and Technology Lab at Cornell has really done some great work that highlights how dependent we are in these technical systems and how enmeshed they are with our social systems and how when we have changes in one, it can impact these others.
The Knight First Amendment Institute at Columbia University has some really amazing work on how these state level TikTok bans are impacting researchers. Certainly even huge publications right now, like Science and Nature, have actually recently had articles that focus on the impact that API changes will have on the research community. Finally, I’ll also say that professional organizations like the Association of Internet Researchers have been really actively involved in trying to understand these changes and figure out what they might mean for researchers.
Daniel Smith: Wonderful. And I’ll certainly include links to all of those resources that you mentioned in our show notes so our listeners can learn more. On that note, do you have any final thoughts you would like to share that we did not already touch on?
Nick Proferes: I think that overall, researchers are very often sort of siloed into their particular subjects, their particular research interests, but I do think there’s a lot of value in being broadly aware of changes that are going on in this large ecosystem. And that way we don’t stay wedded to one particular platform that we develop sort of a plurality of approaches and understandings of these different spaces.
Daniel Smith: Thank you again, Nick. And before we go, I invite all of our listeners to visit citiprogram.org to learn more about our courses and webinars on research ethics and compliance. And with that, I look forward to bringing you all more conversations on all things tech ethics.
How to Listen and Subscribe to the Podcast
You can find On Tech Ethics with CITI Program available from several of the most popular podcast services. Subscribe on your favorite platform to receive updates when episodes are newly released. You can also subscribe to this podcast, by pasting “https://feeds.buzzsprout.com/2120643.rss” into your your podcast apps.
- Season 1 – Episode 8: Understanding Big Health Data Research’s Unique Issues
- Season 1 – Episode 7: Navigating Big Data and Data Science Research Ethics
- Season 1 – Episode 6: Bots in Survey Research
- Season 1 – Episode 5: Technology Transfer and Commercialization
Meet the Guest
Nick Proferes – Arizona State University
Nicholas Proferes is an Assistant Professor at Arizona State University’s School of Social and Behavioral Sciences. His research interests include users’ understanding of socio-technical systems such as social media, societal discourse about technology, and issues of power and ethics in the digital landscape.
Meet the Host
Daniel Smith, Associate Director of Content and Education and Host of On Tech Ethics Podcast – CITI Program
As Associate Director of Content and Education at CITI Program, Daniel focuses on developing educational content in areas such as the responsible use of technologies, humane care and use of animals, and environmental health and safety. He received a BA in journalism and technical communication from Colorado State University.