Overview
As digital technologies become cheaper and more powerful, the storing, analysis, and transfer of the raw output of scientific inquiry becomes increasingly more possible [1]. Given advances in machine learning and analytical algorithms, data are no longer just a means for drawing findings restricted to the original study but can be aggregated and analyzed in novel ways that generate new hypotheses and findings [2]. In this article, we will examine the benefits and challenges associated with sharing raw data.
Introduction to Data Sharing
For the purposes of this article, data are “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.” [3]. Given challenges associated with the preservation and transport of non-digital samples (e.g., tissue specimens), we will focus on those data that are either natively digital (e.g., an online survey) or digitized after collection (e.g., digital photographs of slide samples) [4]. In this regard, data sharing involves the processes and infrastructure necessary to facilitate access to data, as well as the procedures used to analyze or make sense of them (e.g., statistical code and metadata).
Benefits of Data Sharing
Data sharing can be advantageous in several ways. Benefits include:
- Increased transparency and reproducibility
- Greater potential for collaboration and cross-pollination of ideas
- The possibility of accelerating discovery
- The conferral of credibility and validity to scientific findings
Increased access to data can help to address recent crises in replication that have been reported across disciplines through an increase of transparency and reexamination [5]. By having access to raw data, researchers can both provide greater “checks” upon the interpretation of data and, with constructive responses and communication, help to address misunderstandings or different perspectives related to the use of inferential statistics and other analytical techniques. Such collaboration is not limited to analyses, as major endeavors, such as the Human Genome Project (completed in 2003) and the ongoing Human Pangenome Project, would not be feasible without the open sharing of data and the collaboration of numerous laboratories spanning multiple continents [6]. Journals that request the uploading – for later sharing – of raw data also provide safeguards that protect the credibility of the publication and that of the articles they publish.
Challenges and Considerations
Though desirable, the sharing of data is not without its concerns. For example, data may be viewed as proprietary. In the “publish or perish” environment of academe [4], it should be no surprise that more researchers agree with data sharing when it takes place after publication rather than before [2]. Furthermore, a lack of clarity and understanding of copyright considerations can contribute to the reluctance to share data.
In addition, depending on the type and nature of the data, access may need to be safeguarded to prevent unauthorized usage or corruption [4]. In a survey of 600 psychological researchers, Houtkoop and colleagues found that over a quarter of respondents did not share data due to Institutional Review Board (IRB), legal, or other concerns related to the protection of research participants [7]. The 2018 revised Common Rule (45 CFR 46, Subpart A) may assuage some of those concerns. The Common Rule now explicitly allows for “broad consent,” which enables the secondary usage of data without subsequent requests from research participants [8]. This should provide some researchers with the necessary assurance that they are able to share data provided such broad consent was obtained initially.
Just as with the original collection, storage, and summary of data, paramount amongst research considerations must be avoiding (or minimizing) the likelihood of disclosing personally identifiable information (PII) and the harm that such disclosure could cause research participants. To this end, one should examine the handling and sharing of information by considering whether the answer to each of the following “safe” guidelines is “yes” [9, 10]:
- Safe data: Have PII been removed and have measures been enacted to obfuscate the identity of individuals?
- Safe projects: Were the methods via which the data were collected vetted by an IRB or equivalent ethics panel?
- Safe places: Are the locations where the data are stored safe, encrypted, and if need be, offline?
- Safe people: Will the individuals accessing the data be familiar with usage agreements, restrictions, and considerations? Will proper statistical analytical and interpretive frameworks and storage and subsequent sharing protocols be followed?
Finally, the format of data to be shared can be limiting, particularly if it is made accessible in formats that are proprietary or become outdated [4]. Thus, whenever possible, data should be stored in formats that are universally transferable and readable (e.g., .csv rather than .sav).
Summary
Striking a balance between researchers’ intellectual property rights and the equitable sharing of data can be challenging, but it is generally accepted as having the potential to contribute to the common good by increasing transparency, advancing the speed of discovery, and enhancing the credibility of research. As data sharing requirements become criteria for the awarding of federal funds [11], it is hoped that this process will usher in a new era of discovery that continues to safeguard information from the individuals from whom it was collected.
References
[1] Interagency Working Group on Digital Data. 2009. “Harnessing the Power of Digital Data for Science and Society.” Accessed August 30, 2024.
[2] Tenopir, Carol, Suzie Allard, Kimberly Douglass, Arsev Umur Aydinoglu, Lei Wu, Eleanor Read, Maribeth Manoff, and Mike Frame. 2011. “Data Sharing by Scientists: Practices and Perceptions.” PloS One 6(6):e21101.
[3] U.S. National Science Foundation (NSF). 2018. “Data Management Guidance for SBE Directorate Proposals and Awards.” Accessed August 30, 2024.
[4] Kowalczyk, Stacy, and Kalpana Shankar. 2011. “Data Sharing in the Sciences.” Annual Review of Information Science and Technology 45(1):247-94.
[5] Spellman, Barbara A., Elizabeth A. Gilbert, and Katherine S. Corker. 2018. “Open Science.” In Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience, Volume 5, Methodology, 4th Edition. New York, NY: Wiley.
[6] Wang, Ting, Lucinda Antonacci-Fulton, Kerstin Howe, Heather A. Lawson, Julian K. Lucas, Adam M. Phillippy, Alice B. Popejoy, Mobin Asri, Caryn Carson, Mark J. P. Chaisson, et al. 2022. “The Human Pangenome Project: a global resource to map genomic diversity.” Nature 604(7906):437-46.
[7] Houtkoop, Bobby Lee, Chris Chambers, Malcolm Macleod, Dorothy V.M. Bishop, Thomas E. Nichols, and Eric-Jan Wagenmakers. 2018. “Data Sharing in Psychology: A Survey on Barriers and Preconditions.” Advances in Methods and Practices in Psychological Science 1(1):70-85.
[8] Protection of Human Subjects, 45 CFR § 46 (2018).
[9] Alter, George and Richard Gonzalez. 2018. “Responsible Practices for Data Sharing.” American Psychologist 73(2):146-56.
[10] Ritchie, Felix. 2005. “Access to business microdata in the UK: Dealing with the irreducible risks.” Monographs of official statistics, 239.
[11] Kaiser, Jocelyn, and Jeffrey Brainard. 2023. “Ready, set, share!” Science, 379(6630):322-5.