Where is the Data ?

Date of Publishing
22nd of January, 2025

A lot of the current advancement in applied AI is a result of work done over the last decade to collect, curate and make available the training data , that is unlocking such capabilities in the latest deep learning models. However that was largely done for “General” purpose AI. To build ‘AI for Science’, a similar effort and curation across a wide expanse of Scientific data will be needed.

Below, I describe three categories of data that could be useful for training AI models for science. This includes various models, from assistive agents to prediction models, applicable across any stage of the research workflow (as covered in Edition 1)

Data that’s available and accessible

Think of open-access scientific journals, public research datasets hosted by organizations like the National Center for Biotechnology Information (NCBI) in the US or the Australian Research Data Commons (ARDC) in Australia, and code repositories published by individual labs on platforms like GitHub and Kaggle. A lot of scientific data is already openly shared, and there is a growing trend of inter-institutional collaboration in academic research driving the availability of more open-access datasets. Platforms like Google Cloud’s Public Datasets program and Data Commons also play important roles in making vast amounts of scientific data easily accessible and usable for AI applications.

However, more work is needed to make this available data easily deployable in different computational environments. Building the right pipelines to move data from public portals to institutional environments isn’t always trivial. The challenges are further exacerbated by the multimodality of scientific data, a hurdle that must be overcome given current advances in multimodal foundational models. Fortunately, much can be accomplished by developing easy abstractions and API endpoints on top of this data, which both academia and commercial providers are already working on (CERN, OpenNeuro)

Data That’s Available but Not Accessible

Then there’s the data that exists but isn’t accessible. This includes scientific papers locked behind paywalls, non-digitized textbooks, proprietary internal code bases, and experimental data stuck in institutional silos. Making this data available could open up new avenues for research. While intellectual property (IP) and proprietary considerations might prevent this data from being fully open-access, the right incentives and policy frameworks could encourage institutions and labs to share what currently languishes in archival storage. Partnerships between tech companies and research institutions, such as IBM’s collaboration with NASA for the Open Science Data Initiative, demonstrate how private sector involvement can help unlock valuable data assets.

Many policy frameworks being considered by governments worldwide, such as the National Artificial Intelligence Research Resource (NAIRR) in the US, recognize the importance of this data asset class and are thinking about building platforms and incentives to encourage more open sharing.

Data That’s Not (Yet) Available

The third category is perhaps the most intriguing: data that doesn’t exist yet and needs to be created with the advent of AI for science. This includes lab methodologies (e.g., “How is a specific research workflow conducted?”), synthetic data generated by ML models to supplement real data, curated Q&A examples to better align large language models (LLMs), and benchmark data for evaluating AI models on different scientific tasks.

Recent months have seen a significant acceleration in the creation of this kind of data, but it remains unorganized and lacks scientific rigor. While forums are emerging where AI models, their outputs, and datasets are openly shared, the necessary guardrails and governance for their safe use in science are still missing (an AI for Science version of FAIR). Data and model provenance is another critical aspect, as having a globally recognized system for tracking and citing these artifacts will be essential for building trust. The scientific community must come together to address these questions and find viable solutions.

Moving Forward Thoughtfully

Making AI successful in scientific research is not just about leveraging the data we have but also about unlocking inaccessible data and creating new data responsibly. Technical improvements, supportive policies, and collaborative efforts from the scientific community are crucial. By addressing these needs thoughtfully, we can pave the way for a more robust, data-driven approach to scientific research, ensuring that AI becomes an invaluable tool for scientists worldwide.

Some things that I am keeping an eye on related to this :

Recommendations from PCAST (President’s council of advisors on Science and Technology) in the US : Link

The work of NAIC (National AI Centre) in Australia : Link

Communities like Kaggle where data scientists and researchers are sharing these artifacts : Link

Initiatives like Data Commons, Earth Engine Data Catalog, All of Us Research Hub as models of new data sharing approaches.

Data that’s available and accessible

Data That’s Available but Not Accessible

Data That’s Not (Yet) Available

Moving Forward Thoughtfully

Contact Us