On Tech Ethics Podcast on Synthetic Data in Research

Season 1 – Episode 26 – Synthetic Data in Research and Healthcare

Discusses the use of synthetic data in research and healthcare.

Podcast Chapters

Click to expand/collapse

To easily navigate through our podcast, simply click on the ☰ icon on the player. This will take you straight to the chapter timestamps, allowing you to jump to specific segments and enjoy the parts you’re most interested in.

Dennis Shung’s Background (00:01:06) Dennis shares his expertise in medicine, data science, and the intersection of healthcare and technology.
Understanding Synthetic Data (00:03:44) Discussion on what synthetic data is, how it is generated, and its various types.
Applications of Synthetic Data (00:05:57) Exploration of how synthetic data is used in research and healthcare, including real-world examples.
Digital Twins Concept (00:10:15) Explanation of digital twins and their potential benefits in personalized medicine and research.
Using Synthetic Data to Train AI Models (00:13:02) Overview of how synthetic data can augment machine learning models and its implications.
Ethical Issues with Synthetic Data (00:17:12) Discussion on privacy concerns and ethical considerations surrounding the use of synthetic data.
Regulatory Challenges (00:19:21) Examination of the current state of regulations regarding synthetic data and AI in healthcare.
Auditing the Synthetic Data Process (00:21:53) Discussion on the importance of auditing data processes and understanding data integration in synthetic data models.
Data Chain of Custody (00:22:53) Exploration of maintaining data origins and transformations to prevent errors and biases in synthetic data.
Regulatory Recommendations (00:24:32) Suggestions for regulations focusing on data origins and auditing processes in synthetic data usage.
Learning Resources for Synthetic Data (00:24:52) Recommendations for review articles and funding announcements to understand synthetic data in medical research.
The Exciting Future of Data Science (00:27:19) Reflection on the evolving landscape of data science in healthcare and the role of synthetic data.
Quality Over Quantity in Data (00:28:57) Emphasis on prioritizing data quality and thoughtful processing over merely increasing data volume.
Synergy of Synthetic Data and Digital Twins (00:30:23) Discussion on leveraging synthetic data and digital twins for improved treatment quantification and patient care.
Conclusion of the Conversation (00:32:02) Wrap-up of the discussion with an invitation to explore further resources on tech ethics.

Episode Transcript

Click to expand/collapse

Daniel Smith: Welcome to On Tech Ethics with CITI Program. Our guest today is Dennis Shung, who’s an assistant professor of medicine and director of digital health in the section of digestive diseases at Yale School of Medicine. He’s a physician, data scientist and gastroenterologist working at the intersection of translational informatics, algorithmic development and implementation science with a special focus on the management of acute gastrointestinal bleeding. Today we are going to discuss the use of synthetic data in healthcare and research. Before we get started, I want to quickly note that this podcast is for educational purposes only. It is not designed to provide legal advice or legal guidance. You should consult with your organization’s attorneys if you have questions or concerns about the relevant laws and regulations that may be discussed in this podcast. In addition, the views expressed in this podcast are solely those of the guests. And on that note, welcome to the podcast Dennis.

Dennis Shung: Hi. Thanks for having me, Daniel.

Daniel Smith: I’m really looking forward to our conversation today in learning more about synthetic data. But first, can you tell us more about yourself and your work at Yale School of Medicine?

Dennis Shung: Yes, I’m a board certified gastroenterologist and also have board certification in clinical informatics. I discovered data science a little bit later in life. I did a PhD in investigative medicine that focused on unsupervised machine learning and deep learning after fellowship. Where I worked is in the intersection where we take data that’s being generated from everyday healthcare processes, from being in the hospital and getting an electronic health record instance written about the patient all the way to other sources of data that the patient might be interested in, which is wearable data or data that is generated during their encounter with the healthcare system and then making it work for the patient, the provider, and the system. Data right now just stays in repositories and the work that I do tries to take that data and then make it such that it can actually, hopefully benefit the patient and then while also minimizing the potential harm and the risk that may occur from that data being used and processed by computational methods.

My work spans everything from taking existing algorithms and validating them to developing new algorithms. Not every task is already solved in data science, you sometimes have to generate new algorithmic approaches to suit the data. In this case, healthcare has data that is very sparse, very heterogeneous, and very messy and noisy, and those are all things that algorithms currently don’t necessarily handle very well. We work in that area. And then I also work on the other side where we deploy generative artificial intelligence and other machine learning algorithms to actual providers within a simulated setting to see how they respond to these systems in a simulated setting that mimics their real-life practice.

Daniel Smith: Thank you, Dennis. As I mentioned, we’re going to talk about synthetic data today, and I think that this is a concept that people are becoming increasingly familiar with, particularly in the context of AI and machine learning. But can you start off by telling us what synthetic data is, including how it is generated and the different types of synthetic data?

Dennis Shung: Synthetic data is a new phenomenon and it is something that is still very ill-defined. When we think about synthetic data, you can think about it as data that’s generated using either a statistical model or an algorithm, usually with the idea of solving some sort of task. You have data that is real, that is measured from a patient’s encounter, for example, from going to the hospital, they draw blood tests, they run the blood tests, and you have numbers that come out of the blood tests. And then you have synthetic data where you say, “This is a very specific set of numbers related to this specific patient, but can we generate similar numbers that are similar enough to this patient that it can be used for an algorithm to be trained?” For example, if I only have one or two patients with a certain condition, synthetic data allows us to then generate ten, a hundred, a thousand patients that are similar to the patient that we see in front of us.

The different types of synthetic data, generally you have the idea of, is it the Goldilocks? Is it hot, cold, or in the middle? If it’s hot, you’re only using fully synthetic data. Nothing is real. Everything is generated through an algorithm. Cold is you have basically mostly elements that are from real data and you’re just giving some synthetic elements to it. And then you have partially synthetic where you mix them together. You have some elements, some with real, some with synthetic, and then you use a synthetic data to create some sort of additional capability from the real data set that you have.

Daniel Smith: I think you touched on this a little bit when describing what synthetic data is and giving the example of the patient blood draws. But can you talk some more about how synthetic data is currently being used in research and healthcare and provide a couple of real-world examples?

Dennis Shung: I think one thing I want to step back with is I talk about synthetic data and talk about these algorithms that generate them, but I haven’t really talked about how it’s generated. The majority of synthetic data is generated through neural networks, which are now in the public consciousness in many different ways. But specifically there are generative adversarial networks that are the specific type of network that are able to take the real data like the blood draws from the patient and then generate something that looks similar to that. In the case of generative artificial intelligence, there are opportunities in these neural network architectures to also generate synthetic data using whatever template or whatever conditions that you are interested in. The basis is usually neural networks, either generative adversarial networks, generative artificial intelligence of any of the architectures that exist right now.

But then the idea here is that once they generate the synthetic data, there has to be some sort of downstream task. Again, when they generate, then you have some sort of downstream task that you want to use it for. You have a variety of use cases. In research, we can use it to generate data that can make our machine learning models more robust. For example, if I have a group of patients from one area of the country and I say, “If I train a machine learning model on them on the real data, it’s only going to be really useful for that segment of patients. Can I get a more representative patient population?” And then yes, I can get another patient cohort from a different part of the country, but sometimes it’s hard to share the real data. One of the things that you can do in research is you can take the real data from the other part of the country, lets say I’m in Connecticut, and this is in Texas.

In Texas, I can take the real data from those patients, create and train a model that can generate synthetic data that’s similar to that and then generate synthetic data and then put it together with my patients in Connecticut and then train the machine learning model that now hopefully will perform well both in Connecticut and in Texas. In research, that’s one thing. The other thing you could do is basically create what we call digital twins, and I think we’ll talk about that a little bit later. But there’s an idea that you can create a digital representation of patients, and specifically this will be important when we think about evaluating in randomized controlled trials. In most randomized controlled trials, you have patients that have been randomized to one or the other, but you could only observe them under one condition, either they got the treatment or they didn’t get the treatment.

With synthetic data, you basically synthesize a version of that patient that maybe they got the treatment in the trial, but you can a version that didn’t get the treatment under the trial. And then by doing that, you can then start quantifying what we say, treatment effects on an individual patient level. You can start saying, “Given this digital representation of the real patient, we can give the estimate of how effective that drug was for that patient.” In industry and in specifically, I think we’ll talk about this currently, the era of generative AI, synthetic data is used to basically different administrative things that you might want to do across large patient populations, do different population health simulators that maybe you can do, but also you can use synthetic data to train models much more at scale than trying to identify all of the real data and clean up all the real data. A lot of companies that have run out of real data now are turning to synthetic data to augment the performance of their models.

Daniel Smith: You’ve mentioned two things that I want to focus on, the digital twins in research and healthcare and the use of synthetic data to train generative AI models. First, going back to what you were just talking about with digital twins, can you talk some more about some of the benefits of that concept and why it would be useful in both research and healthcare?

Dennis Shung: Digital twins are a digital representation of, in this case, a person or a patient, but it could be an object or process, and these are contextualized in a digital version of its environment, but with a link to the physical actual thing. In this case, when we talk about digital twins for patients in healthcare, we usually think about a patient who’s actually, let’s say in the hospital and receiving treatment, and then the digital representation of that patient, which is basically their labs, their demographics, what we know about them in terms of conditions that is written in the electronic health record.

The key thing for digital twins is that you have to have some link between the physical patient and the digital representation of the patient. This is an active area of cutting edge research across National Science Foundation, National Institutes of Health, because we think this has a lot of value in the things that I mentioned before where hopefully you can now start not just saying, “I’m going to try this for my patient and see if it works,” but can I test the digital version of that patient with five different treatments without exposing the actual patient to any of them?

See which one might be the best across multiple simulations, and then say, “Now I’m going to try that one for this patient.” There is a big promise in terms of personalized medicine where you can use digital twins to try out many different conditions and manipulate different conditions without exposing the patient to harm, and then try to identify the best route for the patient. In this case, you’re creating digital representation of the patient using real data. The synthetic part that we talk about is that when you’re trying to create different versions of that patient. Instead of saying, “I want this exact patient with all of their exact laboratory tests,” et cetera, we want to say, “What if these conditions happen for this patient? Can we simulate and generate synthetic versions of the patient across all these different conditions?” Synthetic data is then used to augment the actual digital twin in order to create different versions of the patient that can then be subject to different medication treatments and evaluate if that actually improves their outcome.

Daniel Smith: That’s really interesting, and I want to ask about some of the ethical issues involved in that. But first, before we get to that, I also mentioned the use of synthetic data to train generative AI models. How could this be useful for models in general? And then more specifically, how could it be useful in the context of research and healthcare?

Dennis Shung: In models in general, there are the ability to augment the existing data that you’re training the models on so that you can make the models more robust without necessarily spending so much money on trying to acquire preprocessing clean data. For my group, we’ve worked on looking at how you can create synthetic versions of patients with certain conditions and then mix that in with real patient data and then see if you can improve the performance of those machine learning models. With generative artificial intelligence, they’re running out of data. They’ve already ingested the entire internet. In order to get more data, there are certain ideas where you use generative AI to generate data to then train generative AI, or you use other neural network models to generate the data, and then you use that to train these large language models or large multimodal models. The big problem here is something that has recently been described, it is called model collapse.

Model collapse is basically a process where you keep generating data and then you keep using that to train this language model or this large multimodal model, and it starts amplifying the biases or amplifying polluted data. There are different errors that can happen as you’re generating data. Data is generated using neural network architectures, particularly generative adversarial networks or just generative artificial intelligence. You can use decoder-only language models, transformer-based language models. When you have these models, they make errors as they generate. Obviously they’re not exactly the same as whatever the data that they’re trying to generate from. They make some perturbations to that in the output. By doing that, you have a cascading set of errors, and there are three types of errors, the statistical approximation error, functional expressivity error, and the functional approximation error. And these are basically due to the way that we’re generating the data, you have finite samples that you’re using, and then as you go to infinite samples, then the information starts getting lost as you keep on resampling from the finite sample.

You may think that you’re generating infinite levels of data, but really you’re just sampling from a finite group of data and then trying to generate all these different types. And then the other errors are due to the function. Neural networks are basically a big function mathematical function. When you use it, then use it over and over again, and you’re using that data, then train another neural network, you’ll just amplify some of those errors. This can lead to basically the entire model starting to become poisoned by the synthetic data because you’ve amplified the errors to such an extent that it no longer has the right information, but it’s learning the wrong information. I think those are ways in which we had to understand that it’s helpful. In some ways it can actually improve the performance of existing machine learning models to a certain extent, but there are clear downsides that are now being seen, especially in these huge black boxes where you can use them, but they’re phenomenon that demonstrate that synthetic data is not a panacea and you can’t use it endlessly.

Ed Butch: I hope you’re enjoying this episode of On Tech Ethics. If you’re interested in important and diverse topics, the latest trends in the ever-changing landscape of universities, join me, Ed Butch, for CITI Program’s original podcast, On Campus. New episodes released monthly. Now, back to your episode.

Daniel Smith: Those are some of the issues surrounding the use of synthetic data to train models. Then going back to the concept of digital twins, can you talk some more about the ethical and even regulatory issues that our listeners should be aware of when it comes to using synthetic data in that way?

Dennis Shung: The primary issue is privacy. When you create synthetic data, you assume that synthetic. It’s not the data that it was before, therefore it should be perfectly fine to use for anything you want. That’s not true. When you have these neural networks that are generating data from real data, you actually may memorize parts of it that can be re-identified, especially in healthcare. It doesn’t take that many data elements to re-identify somebody. You have to think about privacy from the very beginning, and you need to think about privacy in the generation of that data. There are different ways of maintaining privacy. There’s differential privacy, generative adversarial networks, and there’s also certain ways in which you can force there to be some differential privacy constraint mathematically for each of the generated subjects versus what’s in the original data set. That is, I think, one of the primary things that in healthcare we care about and that we need to be very aware of because that affects patients.

We’re not necessarily being very beneficent like working for their good if we’re exposing their data to be re-identified by whoever has the version of synthetic data and non-maleficent. We don’t want to do harm by having a lot of versions of this patient floating around the internet and then having that able to be re-identified or linked to them. And then finally, just thinking about justice. Are patients having adequate representation in such a way that it will lead to benefit across the spectrum, or are they being either exploited digitally or they’re being excluded from the digital front door? I think there are so many things to consider when we think about digital synthetic data in order to make sure that we protect patient privacy while making sure that they’re also adequately represented in the data sets that are then being used downstream to train these larger models.

The other thing I would say for regulation, regulation is still trying to catch up to how, specifically in generative artificial intelligence, is being thought about. Right now, the current FDA guidance is focused on clinical student support software versus software’s medical device. Synthetic data is in a weird place because it’s not necessarily a algorithm, but it can be used to train algorithms, it can be used to test algorithms, it can be used to enhance algorithmic capabilities. I don’t know of any clear regulatory framework that’s thought about this very thoroughly in this context, but I would assume that they would have some processes to regulate the data that’s being used to train different algorithms that can be used downstream, software’s medical device. If you’re using data, you should probably have to disclose that and probably have to demonstrate with certain tests that there is no harm or at least mitigated harm by using that data versus just using real data.

I don’t think that currently exists. If you use synthetic data to update your algorithm, I’m not sure that there’s currently any regulatory guidance as to how much you should use, what are the tests that you should run before you use synthetic data versus real data. And then when we think about generative artificial intelligence, there’s really not a clear framework out there, at least in the US to think about generative AI. One of the things that we care about are obviously accuracy and safety or reliability, but all of those things are difficult to regulate in the absence of standardized tools to look at these black boxes. In the regulatory standpoint, I think one of the big challenges is that how do you enforce any sort of regulation when you don’t know the inner workings of these large black boxes? And if they are susceptible to model collapse because they’ve been using a bunch of synthetic data to train themselves, how would you be able to suss that out?

How would you be able to not just suss that out from the beginning, but really at the end figure out is this actually affecting the safety for patients or is this affecting the performance in downstream tasks? One of the things that I’ve been talking through with I’m part of the working group for generative AI, for the Coalition for Healthcare AI, is that we focus so much on the outcomes and how to measure the outcomes with benchmarks and datasets, but really what we might really want to look at is auditing the process and saying, “What is the process that they’re using? What are the datasets that they’re using to train? How are they integrating these datasets into whatever X black box they’re putting in, transformer based, encoder-decoder or decoder only?” To me, I think that is probably going to be where regulators will focus more on, because that’s something that you can audit.

You can have some idea of how that works, and you can’t necessarily understand, “After you put all this in, then you probably have to have some mechanism to look at the outcome.” The final thing I want to talk about is thinking through what you were saying about regulation for synthetic data specifically. I think one of the ideas that I wrote about recently with Nature Digital Medicine about synthetic data was this idea of data chain of custody. But I think with synthetic data, you can have it floating around everywhere it’s like, “It’s completely de-identified.” And yes, if you’ve done differential privacy, you assure that there’s some mathematical guarantee that there is privacy protection. That’s all well and good, but ultimately when you start getting so downstream with synthetic data, synthetic data, synthetic data, you start losing sight of where this data actually comes from.

And then I think in light of the potential model collapse, it’s very hard to figure out what went wrong if something were to go wrong. Was it because you did synthetic data that was generated from synthetic data at three steps down, or was it because they did certain things with the first step that messed things up and amplified biases such that led to model collapse? That’s a very hard question to answer, but I think with a chain of custody idea, you can at least have an idea of saying, “This is five steps away from the original data set. The original data set was run through these different types of architectures with these different types of parameters, hyperparameter tuning, you get an idea at least of where this data has been and how far away it is from the original data and what transformations have been done to the data.”

I think that to me would also be something that would be helpful in regulation to at least look at the process. What is the process, not just of what I mentioned before, which is what is data using to train and what proportions are you using to train it in? But really where is the data coming from and do we have some guarantee that this is not so far away from the original data that it may be injecting the errors and the biases that can lead to downstream effects.

Daniel Smith: You mentioned a lot of things that could be done there to further the responsible use of synthetic data and how clinicians and researchers and developers can address the issues that we’ve discussed so far. With that in mind, do you have any recommendations for additional resources where learners can learn more?

Dennis Shung: Yeah, there are multiple review articles that have been out there, particularly in Nature and Nature Digital Medicine, that give you a sense of what synthetic data is and how it’s being used in medical research. This is a very rapidly evolving field, and people are just really throwing things out there. And when you hear synthetic data, I think digital twins are not far behind or far away. These two things are being used in conjunction with one another or almost in tandem with one another because ultimately you want synthetic data for some sort of downstream task. I would recommend if you are a researcher, look at the different RFAs funding announcements that are being put out. There is one that was put out by the National Science Foundation, I think National Institute of Health is also thinking through different digital twin proposals that are out there.

As you read these proposals and as you read the review articles, hopefully you can get a better sense of where this field is going. This field is still evolving and changing on a constant basis because the underlying technology is changing. I mentioned generative adversarial networks and all of those things. Those are now being eclipsed by the generative AI transformer-based models because they’re so much more powerful. What I would say is that keep an eye on what the big companies in generative AI are doing in the synthetic data space because they are much more able to evolve very quickly with their sophisticated models than a lot of the other groups are. And everything is moving in that direction. And larger and larger models that are very over-parameterized with many capabilities as foundation models being used to generate the synthetic data.

OpenAI, Anthropic, Google, and Meta, are the four players that are really innovating in this space. If you go on Archive, I would look for synthetic data as it pertains to those big four because they will really be on the edge of proposing different uses of transformer-based large multimodal models or vendition models for synthetic data generation.

Daniel Smith: Absolutely, and I’ll certainly include links to the proposals and the review articles that you mentioned in our show notes so that our listeners can check those out. On that note, do you have any final thoughts that we have not already touched on?

Dennis Shung: It’s a very exciting time for medical research, particularly data science in healthcare, because before we were just generating a bunch of data and then we thought that that was enough. We generate enough data. And then we realized now that yes, we can generate a lot of data and that data may or may not be useful if we use really powerful models, but there are limitations to that data, and the data sometimes needs a little bit of balancing in order to make it actually useful for patient use, for provider use, for use in the healthcare system. Synthetic data is really filling that need, but I think what I would say is that we have to figure out how much we need it, and more is not always better. As we’ve talked about before, model collapse is real, and specifically in healthcare, that could be disastrous if you were to feed it so much poison data that starts spitting out things that could eventually in downstream tasks harm patients.

I think especially in the era of generative AI, it’s going to be important for researchers, for users, for providers, for healthcare systems to really look at this critically and say, “Where can this be value add for what we need in our mission? And where is this just too far?” And I think we need to get away from the ethos of more is better all the time, which unfortunately I think tech is more prone to, just like, “We just need all the data. We want all the data. We want as much data as humanly possible.” And really think about the quality of data. What is the quality of data? If our data quality can be enhanced with synthetic data, great, what are the ways we should do it?

We should think about how this data is being processed, how we can preserve privacy, how we can maintain chain of custody to know that, “This data has been processed in these different ways,” and then we should have auditing to look at the processes of using that data to train whatever model is downstream, and then benchmarks to make sure that the performance is not negatively affected in any way. There has to be a very thoughtful way of thinking about it that I think might be difficult for people who are just trying to move fast and break things. But I have hope that this is going to open up a new frontier in terms of how we can use the data that we have to actually bring benefit back to patients. And I’ll just end with this one thing, which is that I think the golden fleece that we really want to go towards is really saying that can we quantify treatments before we actually give treatment?

And I think this is where synthetic data and digital twins marry, have a really nice synergy because if we can generate multiple representations that are realistic under different conditions for this patient, then we can leverage, hopefully, information that is already there, but maybe locked away somewhere. Maybe we can use five different randomized trials to construct different versions of the patient in front of me. And then using the treatment effects that I estimate from each of these randomized trials say, “This patient may do best with this treatment because we’re able to leverage the data that’s been already collected and we’ve already looked at the treatment effect.” And then by doing that, you can shorten the time to adequate treatment. Same way you can think about diagnosis. Now, we have language models and multimodal models for diagnostic tasks. Imagine if you can create multiple versions across a natural history of disease or across a different manifestations of disease.

And then you could say, “This patient came in, they coming with this set of symptoms. According to all of these other generated digital representations across different disease spectra. I think it’s more likely to be this family of diagnoses. We should start with this test rather than starting with a shotgun approach.” That will lead to shorter time to diagnosis and hopefully better efficiency and better satisfaction for the patient because then they’ll get the right test instead of having to go test after test of a test and then figure out that, “I should have done this test all along.” I think we’re at a cusp where the computational abilities that we have really demand that we consider the use of these approaches in order to enhance the data science tasks that we want to achieve for patient care.

Daniel Smith: And I think that’s a wonderful place to leave our conversation for today. Thank you again, Dennis.

Dennis Shung: Thank you.

Daniel Smith: And I also invite everyone to visit CITIProgram.org to learn more about our courses, webinars, and other podcasts. Of note, you may be interested in our Essentials of Responsible AI and Big Data and Data Science Research Ethics courses. And with that, I look forward to bringing you all more conversations on all things tech ethics.

How to Listen and Subscribe to the Podcast

You can find On Tech Ethics with CITI Program available from several of the most popular podcast services. Subscribe on your favorite platform to receive updates when episodes are newly released. You can also subscribe to this podcast, by pasting “https://feeds.buzzsprout.com/2120643.rss” into your your podcast apps.

Recent Episodes

Meet the Guest

Dennis Shung, MD, MHS, PhD – Yale School of Medicine

Dennis L. Shung, MD, MHS, PhD, is an Assistant Professor of Medicine at Yale School of Medicine and Director of Digital Health in Digestive Diseases. He leads the Human+Artificial Intelligence in Medicine lab, which focuses on enhancing human presence with AI. Shung is also involved in multiple gastroenterology AI initiatives and research.

Meet the Host

Daniel Smith, Associate Director of Content and Education and Host of On Tech Ethics Podcast – CITI Program

As Associate Director of Content and Education at CITI Program, Daniel focuses on developing educational content in areas such as the responsible use of technologies, humane care and use of animals, and environmental health and safety. He received a BA in journalism and technical communication from Colorado State University.

On Tech Ethics Podcast – Synthetic Data in Research and Healthcare