
(shuttersev/Shutterstock)
Public data is the lifeblood of open research and scientific inquiry. But the possibility of losing public datasets–including academic, government, and scientific data generated as part of research–is now spurring several groups to take action to save it.
In early February, the New York Times reported that more than 8,000 Web pages had been taken down across more than a dozen websites as part of President Trump’s orders to eliminate controversial diversity, equity, and inclusion (DEI) programs.
Unfortunately, the cuts have gone deeper than gender and racial ideology. Per the Times, they spanned 3,000 pages from CDC websites, including 1,000 research articles on everything from chronic disease prevention to the warnings signs of Alzheimer’s disease.
One of the groups racing to document the data before it disappears is the End of Term Web Archive, which is dedicated to documenting government websites every four years when the reins of power are handed to the next president. The group has worked to document every transition since 2008.
Another group working to save data is the Environmental Data & Governance Initiative, which bills itself as a research collaborative and network of professionals working to promote scientific data. The group formed following President Trump’s first election in 2016, the group says it helped to save 200 terabytes of data from government websites running under the Obama Administration.
A new group working to save data is called the Data Rescue Project. Founded by members of the International Association for Social Science Information Service & Technology (IASSIST), the Research Data Access & Preservation (RDAP), and members of the Data Curation Network, the Data Rescue Project bills itself as “a clearinghouse for data rescue-related efforts and data access points for public US governmental data that are currently at risk.”
Data Rescue Project encourages volunteers to document at-risk datasets by using Data Lumos. Data Lumos was created by the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan to serve as a crowdsourced repository for government data.
Harvard University’s Library Innovation Lab is also working to help protect data. Last month, the group launched a new project called the Data.gov Archive that’s designed to preserve datasets that have been linked to Data.gov, the Federal Government’s home for open data. The university group says it has “harvested” more than 310,000 datasets linked through Data.gov, for a total of 15 terabytes of data.
“We’ve built this project on our long-standing commitment to preserving government records and making public information available to everyone. Libraries play an essential role in safeguarding the integrity of digital information,” the group says. “By preserving detailed metadata and establishing digital signatures for authenticity and provenance, we make it easier for researchers and the public to cite and access the information they need over time.”
It’s not uncommon for data to get lost under the normal course of business. Any large organization with a sizable website is going to have missing documents and broken URLs to deal with. What’s currently happening under the Trump Administration is different, according to Lynda Kellam from the Data Rescue Project.
“The difference is that we are seeing data being removed from studies that don’t match up with the ideology of the administration,” Kellam told the Columbia Journalism Review. “This pace of takedown has been much quicker than it’s been in the past.”
When the National Institutes of Health’s popular PubMed website went down over the weekend in early March, many researchers and scientists feared the worst. The repository of more than 37 million articles, which is maintained by the NIH’s U.S. National Center for Biotechnology Information (NCBI), is a vital source of data for biomedical research.
The worst case scenario suddenly seemed possible. “Omg did Pubmed go dark,” wrote UCLA Health researcher Thanh Neville on Bluesky, as documented in a Nature article. Luckily, it was just an IT glitch, and PubMed was back up and running, sending a collective sigh of relief through the biomedical research community.
But the PubMed episode is a reminder that future accessible of data is not a guarantee. For Philip Bourne, the dean of the School of Data Science at the University of Virginia, PubMed’s brief offline foray sent “a worrying signal.”
“As deans and university leaders, we need to make clear to governments that to be a public university means public accessibility to all the scholarship we produce, including the data from which that scholarship is derived,” Bourne wrote in a blog post.
Senior scientist, mentors, and students can also play a role in reminding others of the importance of data, the UofA Stephenson Dean wrote, and encouraging all stakeholders to take the necessary steps to guarantee access.
“In the case of my own university, the University of Virginia, this is particularly poignant as its founder, Thomas Jefferson, one of this nation’s original founding fathers said, ‘The most important bill in our whole code is the diffusion of knowledge among the people.’”
Related Items:
NSF-Funded Data Fabric Takes Flight
Prolific Puts People, Ethics at Center of Data Curation Platform
ADSA to Keep Humans in the Loop at Annual Meeting