REFORMS: Consensus-based Recommendations for Machine-learning-based Science

Context. ML methods are being widely adopted for scientific research.

Compared to older statistical methods, they promise increased predictive accuracy, the ability to process large amounts of data, and the ability to use different types of data for scientific research, such as text, images, and video.

Problem. The rapid uptake of ML methods has been accompanied by concerns of validity, reproducibility, and generalizability.

These failures can hinder scientific progress, lead to false consensus around invalid claims, and can undermine the credibility of ML-based science.

Our intervention. The key observation that motivates this project is that ML methods are often applied and fail in similar ways across disciplines.

Based on an extensive review of best practices and common pitfalls in reporting ML results, we provide clear reporting standards for ML-based science, which take the form of a checklist and a paired set of guidelines.

The REFORMS checklist consists of 32 items across 8 sections. It is based on an extensive review of the pitfalls and best practices in adopting ML methods. We created an accompanying set of guidelines for each item in the checklist. We include expectations about what it means to address the item sufficiently. To aid researchers new to ML-based science, we identify resources and relevant past literature.

The REFORMS checklist differs from the large body of past work on checklists in two crucial ways. First, we aimed to make our reporting standards field-agnostic, so that they can be used by researchers across fields. To that end, the items in our checklist broadly apply across fields that use ML methods. Second, past checklists for ML methods research focus on reproducibility issues that arise commonly when developing ML methods. But these issues differ from the ones that arise in scientific research. Still, past work on checklists in both scientific research and ML methods research has helped inform our checklist.

We developed the REFORMS checklist based on a consensus of 19 researchers who work on reproducibility across computer science, mathematics, social sciences, and health research. The authors convened after a workshop on the reproducibility crisis in ML-based science. A majority of the authors were speakers or organizers of the workshop.

Below, we briefly discuss each section of the checklist.

Study design

ML-based science has many researcher degrees of freedom. Each design decision can significantly impact the conclusions drawn from a study. This section focuses on clearly and precisely stating the decisions taken during a study's design. This is motivated by recent research which shows that reporting these decisions in adequate depth and clarity is not trivial or common. For instance, Lundberg et al. find that none of the quantitative papers published in a top sociology journal in 2018 report their estimands in sufficient detail.

Computational reproducibility

This refers to the ability of an independent researcher to get the same results that are reported in a paper or manuscript. Computational reproducibility allows errors to be uncovered quickly. Independent researchers can verify and build on a study's results. Despite being a core tenet of computational research, it is hard to achieve computational reproducibility in practice.

Data quality

This section helps readers and referees understand and evaluate the quality of the data used in the study. Using poor quality data or data that is not suitable for answering a research question can lead to results that are meaningless or misleading.

Data preprocessing

Different preprocessing steps can lead to vastly different outcomes from the modeling process. Even small changes, such as changing the order in which the data preparation steps take place, can lead to large differences in outcomes. For example, Vandewiele et al. find that oversampling before splitting the data into training and evaluation sets lead to widespread errors in pre-term birth detection.

Modeling

There are many steps involved in developing an ML model. This makes it hard to report exact details about the model, and can hinder replication by independent researchers. For example, Raff found that ML results can often not be replicated using the paper's text (i.e., without using the code accompanying a paper). We ask authors to specify the main steps in the modeling process, including feature selection, the types of models considered, and evaluation.

Data leakage

Leakage is a spurious relationship between the independent variables and the target variable that arises as an artifact of the data collection, sampling, preprocessing, or modeling steps. For example, normalizing features in the training and test data together leads to leakage since information about the test data features is included in the training data. Leakage is a major source of reproducibility failures: it affects dozens of fields and hundreds of published papers and can lead to vastly overoptimistic results. We ask authors to justify that their study does not suffer from the main sources of leakage.

Metrics and uncertainty quantification

The performance of ML models is key to the scientific claims of interest. Since there are many possible choices authors can make when choosing metrics, it is important to reason why the metrics used are appropriate for the task at hand. Communicating and reasoning about uncertainty is critical. Still, Simmonds et al. find that studies often do not report the various kinds of uncertainty in the modeling process. We ask authors to report the metrics and uncertainty estimates used in enough detail to enable a judgment about whether they made valid choices for evaluating the performance of the model and drawing scientific inferences.

Generalizability

ML-based science faces a number of threats to external validity. Since studies that use ML methods are often unaccompanied by external (i.e., out-of-distribution) validation, it is important to reason about these threats. Additionally, authors are best positioned to identify the boundaries of applicability of their claims in order to prevent misunderstandings about the claims made in their study.

Using reporting standards

Authors can self-regulate by using the reporting standards to identify errors and preemptively address concerns about the use of ML methods in their paper. This can also help increase the credibility of their paper, especially in fields that are newly adopting ML methods. We expect the reporting standards to be useful to authors throughout the study—during conceptualization, implementation, and communication of the results. The checklist can be included as part of the supplementary materials released alongside a paper, such as the code and data. The guidelines can help authors learn how to correctly apply our reporting standards in their own work and introduce them to underlying theories of evidence.

Referees can use the reporting standards to determine whether a study they are reviewing falls short. If they have concerns about a study, they can ask researchers to include the filled out checklist in a revised version. For example, Roberts et al. use the CLAIM checklist to filter papers for a systematic review based on compliance with the checklist.

Journals can require authors to submit a checklist along with their papers to set reporting standards for ML-based science. Similar checklists are in place in a number of journals; however, they are usually used for specific disciplines rather than for methods that are prevalent across disciplines. Since ML-based science is proliferating across disciplines, our reporting standards offer a method-specific (rather than discipline-specific) intervention.

Author list

The reporting standards were developed based on a consensus of 19 authors across disciplines (computer science, mathematics, social science, and health research):

Name	Affiliation	Field	Email
Sayash Kapoor	Princeton University	Computer Science	sayashk@princeton.edu
Emily Cantrell	Princeton University	Sociology and Social Policy	e.cantrell@princeton.edu
Kenny Peng	Cornell University	Computer Science	klp98@cornell.edu
Thanh Hien (Hien) Pham	Princeton University	Computer Science	th.pham@princeton.edu
Christopher A. Bail	Duke University	Sociology, Political Science, and Public Policy	christopher.bail@duke.edu
Odd Erik Gundersen	Norwegian University of Science and Technology	Computer Science	odderik@ntnu.no
Jake M. Hofman	Microsoft Research	Computational Social Science	jhofman@gmail.com
Jessica Hullman	Northwestern University	Computer Science	jhullman@northwestern.edu
Michael A. Lones	Heriot-Watt University	Computer Science	m.lones@hw.ac.uk
Momin M. Malik	Mayo Clinic’s Center for Digital Health	Data Science	momin.malik@gmail.com
Priyanka Nanayakkara	Northwestern University	Computer Science and Communication	priyankan@u.northwestern.edu
Russell A. Poldrack	Stanford University	Psychology and Cognitive Neuroscience	poldrack@stanford.edu
Inioluwa Deborah Raji	UC Berkeley	Computer Science	rajiinio@berkeley.edu
Michael Roberts	University of Cambridge	Mathematics	mr808@cam.ac.uk
Matthew J. Salganik	Princeton University	Sociology	mjs3@princeton.edu
Marta Serra-Garcia	UC San Diego	Behavioral Economics	mserragarcia@ucsd.edu
Brandon M. Stewart	Princeton University	Sociology	bms4@princeton.edu
Gilles Vandewiele	Ghent University	Computer Science Engineering	vandewielegilles@gmail.com
Arvind Narayanan	Princeton University	Computer Science	arvindn@cs.princeton.edu