ML methods are often applied and fail in similar ways across different scientific fields. We aim to provide clear reporting standards for ML-based science across disciplines.
The REFORMS checklist consists of 32 items across 8 sections. It is based on an extensive review of the pitfalls and best practices in adopting ML methods. We created an accompanying set of guidelines for each item in the checklist. We include expectations about what it means to address the item sufficiently. To aid researchers new to ML-based science, we identify resources and relevant past literature.
The REFORMS checklist differs from the large body of past work on checklists in two crucial ways. First, we aimed to make our reporting standards field-agnostic, so that they can be used by researchers across fields. To that end, the items in our checklist broadly apply across fields that use ML methods. Second, past checklists for ML methods research focus on reproducibility issues that arise commonly when developing ML methods. But these issues differ from the ones that arise in scientific research. Still, past work on checklists in both scientific research and ML methods research has helped inform our checklist.
We developed the REFORMS checklist based on a consensus of 19 researchers who work on reproducibility across computer science, mathematics, social sciences, and health research. The authors convened after a workshop on the reproducibility crisis in ML-based science. A majority of the authors were speakers or organizers of the workshop.
Below, we briefly discuss each section of the checklist.
ML-based science has many researcher degrees of freedom. Each design decision can significantly impact the conclusions drawn from a study. This section focuses on clearly and precisely stating the decisions taken during a study's design. This is motivated by recent research which shows that reporting these decisions in adequate depth and clarity is not trivial or common. For instance, Lundberg et al. find that none of the quantitative papers published in a top sociology journal in 2018 report their estimands in sufficient detail.
This refers to the ability of an independent researcher to get the same results that are reported in a paper or manuscript. Computational reproducibility allows errors to be uncovered quickly. Independent researchers can verify and build on a study's results. Despite being a core tenet of computational research, it is hard to achieve computational reproducibility in practice.
This section helps readers and referees understand and evaluate the quality of the data used in the study. Using poor quality data or data that is not suitable for answering a research question can lead to results that are meaningless or misleading.
Different preprocessing steps can lead to vastly different outcomes from the modeling process. Even small changes, such as changing the order in which the data preparation steps take place, can lead to large differences in outcomes. For example, Vandewiele et al. find that oversampling before splitting the data into training and evaluation sets lead to widespread errors in pre-term birth detection.
There are many steps involved in developing an ML model. This makes it hard to report exact details about the model, and can hinder replication by independent researchers. For example, Raff found that ML results can often not be replicated using the paper's text (i.e., without using the code accompanying a paper). We ask authors to specify the main steps in the modeling process, including feature selection, the types of models considered, and evaluation.
Leakage is a spurious relationship between the independent variables and the target variable that arises as an artifact of the data collection, sampling, preprocessing, or modeling steps. For example, normalizing features in the training and test data together leads to leakage since information about the test data features is included in the training data. Leakage is a major source of reproducibility failures: it affects dozens of fields and hundreds of published papers and can lead to vastly overoptimistic results. We ask authors to justify that their study does not suffer from the main sources of leakage.
The performance of ML models is key to the scientific claims of interest. Since there are many possible choices authors can make when choosing metrics, it is important to reason why the metrics used are appropriate for the task at hand. Communicating and reasoning about uncertainty is critical. Still, Simmonds et al. find that studies often do not report the various kinds of uncertainty in the modeling process. We ask authors to report the metrics and uncertainty estimates used in enough detail to enable a judgment about whether they made valid choices for evaluating the performance of the model and drawing scientific inferences.
ML-based science faces a number of threats to external validity. Since studies that use ML methods are often unaccompanied by external (i.e., out-of-distribution) validation, it is important to reason about these threats. Additionally, authors are best positioned to identify the boundaries of applicability of their claims in order to prevent misunderstandings about the claims made in their study.
Authors can self-regulate by using the reporting standards to identify errors and preemptively address concerns about the use of ML methods in their paper. This can also help increase the credibility of their paper, especially in fields that are newly adopting ML methods. We expect the reporting standards to be useful to authors throughout the study—during conceptualization, implementation, and communication of the results. The checklist can be included as part of the supplementary materials released alongside a paper, such as the code and data. The guidelines can help authors learn how to correctly apply our reporting standards in their own work and introduce them to underlying theories of evidence.
Referees can use the reporting standards to determine whether a study they are reviewing falls short. If they have concerns about a study, they can ask researchers to include the filled out checklist in a revised version. For example, Roberts et al. use the CLAIM checklist to filter papers for a systematic review based on compliance with the checklist.
Journals can require authors to submit a checklist along with their papers to set reporting standards for ML-based science. Similar checklists are in place in a number of journals; however, they are usually used for specific disciplines rather than for methods that are prevalent across disciplines. Since ML-based science is proliferating across disciplines, our reporting standards offer a method-specific (rather than discipline-specific) intervention.
Name | Affiliation | Field | |
---|---|---|---|
Sayash Kapoor | Princeton University | Computer Science | sayashk@princeton.edu |
Emily Cantrell | Princeton University | Sociology and Social Policy | e.cantrell@princeton.edu |
Kenny Peng | Cornell University | Computer Science | klp98@cornell.edu |
Thanh Hien (Hien) Pham | Princeton University | Computer Science | th.pham@princeton.edu |
Christopher A. Bail | Duke University | Sociology, Political Science, and Public Policy | christopher.bail@duke.edu |
Odd Erik Gundersen | Norwegian University of Science and Technology | Computer Science | odderik@ntnu.no |
Jake M. Hofman | Microsoft Research | Computational Social Science | jhofman@gmail.com |
Jessica Hullman | Northwestern University | Computer Science | jhullman@northwestern.edu |
Michael A. Lones | Heriot-Watt University | Computer Science | m.lones@hw.ac.uk |
Momin M. Malik | Mayo Clinic’s Center for Digital Health | Data Science | momin.malik@gmail.com |
Priyanka Nanayakkara | Northwestern University | Computer Science and Communication | priyankan@u.northwestern.edu |
Russell A. Poldrack | Stanford University | Psychology and Cognitive Neuroscience | poldrack@stanford.edu |
Inioluwa Deborah Raji | UC Berkeley | Computer Science | rajiinio@berkeley.edu |
Michael Roberts | University of Cambridge | Mathematics | mr808@cam.ac.uk |
Matthew J. Salganik | Princeton University | Sociology | mjs3@princeton.edu |
Marta Serra-Garcia | UC San Diego | Behavioral Economics | mserragarcia@ucsd.edu |
Brandon M. Stewart | Princeton University | Sociology | bms4@princeton.edu |
Gilles Vandewiele | Ghent University | Computer Science Engineering | vandewielegilles@gmail.com |
Arvind Narayanan | Princeton University | Computer Science | arvindn@cs.princeton.edu |