REFORMS: Consensus-based Recommendations for
Machine-learning-based Science

[Paper website]

Appendix 3

This appendix provides additional details on some of the citations from the main text. We include references from the main text that address: (1) the quality of reporting in past scientific literature, or (2) examples of problems that occurred in past scientific literature. This appendix does not constitute a comprehensive list of all published references on these topics. The table has 44 entries with details about their relevance to our review.

The citations are listed in order of appearance in the main text, with section headings corresponding to the headings from the text. Some sections from the main text are omitted because they do not contain references that match our criteria for inclusion in the table. Some citations are included in the table more than once because they appear in multiple sections. Many of the references focus specifically on machine learning (ML)-based science, but we also include references about science with traditional statistical methods because some of the best practices and shortcomings are shared in ML-based science and other quantitative sciences.

Reference Findings about reporting quality in past literature or problems in past literature Discipline Literature examined ML-Focused?
MODULE 1: STUDY GOALS
Introduction
Hofman et al., 2017, “Prediction and explanation in social systems” The authors re-evaluate data from a prior paper to demonstrate how different (but equally reasonable) choices in research design can lead to different results from the same data. This includes an example of how slight differences in the definition of a research question can lead to substantially different results. Computational social science Re-evaluation of data from 1 prior paper on prediction of information cascade size on Twitter Yes
1a) Population or distribution about which the scientific claim is made
Lundberg et al., 2021, “What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory” Only 9 out of 32 papers (28%) provided sufficient information for a reader to “confidently” identify the target population about which the scientific claim is made (p. 553). Sociology 32 quantitative papers in 2018 volume of a top sociology journal No
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research” 33 out of 49 papers (67%) define a target population. Epidemiology & medicine 49 longitudinal studies on strokes in six journals, 1999-2003 No
MODULE 2: COMPUTATIONAL REPRODUCIBILITY
Introduction
Verstynen and Kording, 2023, “Overfitting to ‘predict’ suicidal ideation” The code for the feature selection step in a flawed prior paper was not released, so Verstynen and Kording could not pinpoint the exact source of errors. Psychology, neuroscience, and biomedical engineering 1 paper on prediction of suicidal ideation Yes
Current computational reproducibility standards fall short
Stodden et al., 2018, “An empirical analysis of journal policy effectiveness for computational reproducibility” Stodden et al. attempted to contact the authors of 204 papers published in the journal Science to obtain reproducibility materials. Only 44% of authors responded. Multi-disciplinary 204 quantitative papers in Science No
Gabelica et al., 2022, “Many researchers were not compliant with their published data sharing statement: A mixed-methods study” Gabelica et al. examined 333 open-access journals indexed on BioMed Central in January 2019 and found that out of the 1,792 papers that pledged to share data upon request, 1,669 did not do so, resulting in a 93% data unavailability rate. Biology, health sciences and medicine 1,792 papers published in 333 BioMed Central open-access journals in January 2019 No
Vasilevsky et al., 2017, “Reproducible and reusable research: Are journal data sharing policies meeting the mark?” Vasilevsky et al. examined the data-sharing policies of 318 biomedical journals and discovered that almost one-third lacked any such policies, and those that did often lacked clear guidelines for author compliance. Biology, health sciences and medicine 318 biomedical journals (Biochemistry and Molecular Biology, Biology, Cell Biology, Crystallography, Developmental Biology, Biomedical Engineering, Immunology, Medical Informatics, Microbiology, Microscopy, Multidisciplinary Sciences, and Neurosciences) No
Computational reproducibility allows independent researchers to find errors in original papers
Hofman et al., 2021, “Expanding the scope of reproducibility research through data analysis replications” Hofman et al. analyze 11 papers and find various shortcomings in this body of literature. Multi-disciplinary 11 computational social science papers No
Vandewiele et al., 2021, “Overly optimistic prediction results on imbalanced data: A case study of flaws and benefits when applying over-sampling” Vandewiele et al. analyze 24 papers on pre-term birth prediction and find 21 of these papers suffer from leakage. Medicine 24 papers on pre-term risk prediction Yes
MODULE 3: DATA QUALITY
3a) Data source(s)
Navarro et al., 2022, “Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review” 98% of articles adhered to the guidelines for reporting data source from the TRIPOD statement. Epidemiology & medicine 152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019 Yes
Yusuf et al., 2020, “Reporting quality of studies using machine learning models for medical diagnosis: a systematic review” 24 out of 28 papers (86%) reported information about their data source, defined as “Where and when potentially eligible participants were identified (setting, location and dates)” (p. 3). Medicine 28 “medical research studies that used ML methods to aid clinical diagnosis,” published July 2015-July 2018 Yes
Kim et al., 2016, “Garbage in, Garbage Out: Data Collection, Quality Assessment and Reporting Standards for Social Media Data Use in Health Research, Infodemiology and Digital Disease Detection” Studies that utilize social media data frequently omit important information about their data collection process, such as details about the development and assessment of search filters. This paper provides a framework for reporting this information. Health media Studies that use social media data (this is not a formal review paper, but it provides several examples) No
Geiger et al., 2020, “Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?” There was “wide divergence” in whether papers followed best practices for reporting the data annotation process, such as reporting: “who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available” (p. 325). Multi-disciplinary: “the papers represented political science, public health, NLP, sentiment analysis, cybersecurity, content moderation, hate speech, information quality, demographic profiling, and more” (p. 328) 164 “machine learning application papers... that classified tweets from Twitter” (p. 326) Yes
3b) Sampling frame
Navarro et al., 2022, “Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review” 105 out of 152 studies (69%) reported their eligibility criteria. Epidemiology & medicine 152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019 Yes
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research” 41 out of 49 papers (84%) reported their sampling frame, and 32 out of 49 papers (65%) reported their eligibility criteria. Epidemiology & medicine 49 longitudinal studies on strokes in six journals, 1999-2003 No
Porzsolt et al., 2019, “Inclusion and exclusion criteria and the problem of describing homogeneity of study populations in clinical trials” 75 out of 100 studies (75%) reported inclusion criteria. 6 of those 75 studies (8%) also reported exclusion criteria. Medicine 100 publications on “quality of life” assessments No
3d) Outcome variable
Credé and Harms, 2021, “Three cheers for descriptive statistics—and five more reasons why they matter” In a review of literature that was still a work-in-progress at the time Credé and Harms published this commentary, “Among the articles coded to date, less than half report the ethnicity of the participants or the types of jobs held by the participants and only 56% report data on the industry in which the data were collected. Other interesting—and to meta-analysts potentially important—information is also remarkably often unreported” (p. 486). (Note: This commentary discusses descriptive statistics broadly, not just descriptive statistics for outcome variables.) Industrial and organizational psychology Articles from four top journals in industrial and organizational psychology (number of articles is not reported) No
Larson-Hall and Plonsky, 2015, “Reporting and interpreting quantitative research findings: What gets reported and recommendations for the field” Meta-analyses frequently had to omit large numbers of primary articles from their analyses due to insufficient descriptive statistics in the primary articles. (Note: This article discusses descriptive statistics broadly, not just descriptive statistics for outcome variables.) Second language acquisition Approximately 90 meta-analyses in second language acquisition No
3e) Sample size
Plonsky, 2013, “Study Quality in SLA: An Assessment of Designs, Analyses, and Reporting Practices in Quantitative L2 Research” 99% of studies reported sample size. Second language acquisition 606 studies in second language acquisition journals, published 1990-2010 No
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research” 100% of 49 longitudinal studies reported the total number of participants from the first wave of their study. However, only 25 out of 49 (51%) reported the number of participants after attrition at each subsequent wave. Epidemiology & medicine 49 longitudinal studies on strokes in six journals, 1999-2003 No
3f) Missingness
McKnight et al., 2007, “Missing Data: A Gentle Introduction” Around 90% of articles had missing data, and the average amount of missing data per study was over 30%. Furthermore, “few of the articles included explicit mention of missing data, and even fewer indicated that the authors attended to missing data, either by performing statistical procedures or by making disclaimers regarding the studies in the results and conclusions” (p. 3). Psychology Over 300 publications from a prominent psychology journal No
Peugh and Enders, 2004, “Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement” Among the articles Peugh and Enders reviewed, “[d]etails concerning missing data were seldom reported” and “[t]he methods used to handle missing data were, in many cases, difficult to ascertain because explicit descriptions of missing-data procedures were rare” (p. 537). However, Peugh and Enders were able to infer the amount of missingness in some studies by examining the “discrepancy between the reported degrees of freedom for a given analysis and the degrees of freedom that one would expect on the basis of the stated sample size and design characteristics” (p. 537). In articles published in 1999, they detected missing data in 16% of studies, but they write that this is likely a “gross underestimate” of the actual prevalence of missing data. Among articles published in 2003, they were able to detect missing data in 42% of articles, which is higher than in 1999 due to changes in reporting practices following a recommendation by an American Psychological Association task force. Educational research 989 studies published in 1999 and 545 studies published in 2003 in 23 applied educational research journals No
Salganik et al., 2020, Supplementary information for “Measuring the predictability of life outcomes using a scientific mass collaboration” There are many reasons for missing data in survey data, including a respondent not participating in a given wave of a longitudinal survey, respondents refusing to answer some questions, skip patterns in the survey design, and redaction for privacy. In a modified version of a well-known, high-quality social survey dataset, 73% of possible data entries were missing, and the largest source of missingness was survey skip patterns. This high level of missingness emphasizes the importance of careful attention to handling missing data. Sociology 1 study with a well-known social survey data set Yes
Nijman et al., 2022, “Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review” “A total of 56 (37%) prediction model studies did not report on missing data and could not be analyzed further. We included 96 (63%) studies which reported on the handling of missing data. Across the 96 studies, 46 (48%) did not include information on the amount or nature of the missing data” (p. 220). Medicine 152 ML-based clinical prediction model studies, published 2018-2019 Yes
Navarro et al., 2022, “Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review” “Forty-four studies reported how missing data were handled (28.9%, 95% CI 22.3 to 36.6). The missing data item consists of four sub-items of which three were rarely addressed in included studies. Within 28 studies that reported handling of missing data: three studies reported the software used (10.7%, CI 3.7 to 27.2), four studies reported the variables included in the procedure (14.3%, CI 5.7 to 31.5) and no study reported the number of imputations (0%, CI 0.0 to 39.0)” (pp. 6-7). Epidemiology & medicine 152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019 Yes
Little et al., 2013, “On the Joys of Missing Data” “Among the 80 reviewed studies, only 45 (56.25%) mentioned missing data explicitly in the text or a table of descriptive statistics. Of those 45, only three mentioned testing whether the missingness was related to other variables, justifying their [missingness at random] assumption” (p. 156). Pediatric psychology 80 empirical studies in the 2012 issues of a pediatric psychology journal No
Nicholson et al., 2016, “Attrition in developmental psychology” Among 541 longitudinal studies, only 253 (47%) discussed missingness due to attrition, and only 99 (18%) explicitly discussed whether missingness due to attrition was “missing at random,” “missing completely at random,” or “missing not at random.” Developmental psychology 541 longitudinal studies in major developmental journals, published 2009 and 2012 No
Sterner, 2011, “What Is Missing in Counseling Research? Reporting Missing Data” In the first journal, “14 of 66 (21%) articles referenced missing data on some level. Of these 14 articles, 11 mentioned missing data specifically... In the remaining 52 JCD articles, no information was provided on whether missing data existed.” In the second journal, “one of 28 (4%) empirically based research articles made reference to screening for missing data; however, no mention was made of missing data in the remaining articles” (p. 56). Counseling 94 empirical research articles in two top counseling journals, published 2004 to 2008 No
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research” Only 19 out of 49 articles (39%) reported on missing data items at each longitudinal wave, and only 2 out of 42 articles (5%) that had missing data in their analyses described imputation, weighting, or sensitivity analyses for handling missing data. Epidemiology & medicine 49 longitudinal studies on strokes in six journals, 1999-2003 No
Hussain et al., 2017, “Quality of missing data reporting and handling in palliative care trials demonstrates that further development of the CONSORT statement is required: a systematic review” 101 out of 108 studies (94%) reported the number of participants who were missing in the primary outcome analysis; however, reporting rates were lower for other details about missing data and for methods of handling missing data. Epidemiology 108 articles on palliative care randomized controlled trials, published 2009-2014 No
3g) Dataset for evaluation is representative
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research” Among several reporting criteria this review examined, “the criteria in the checklist representing selection bias were the least frequently reported overall” (p. 285). Specifically, selection-in biases were discussed in 14 out of 49 articles (28%), comparison of consenters with non-consenters was discussed in 1 out of 47 applicable articles (2%), and loss to follow-up was accounted for in the analyses of 1/41 applicable articles (5%). Additionally, 37 out of 49 articles (75%) discuss how their results relate to the target population. Epidemiology & medicine 49 longitudinal studies on strokes in six journals, 1999-2003 No
MODULE 4: DATA PREPROCESSING
4c) Data transformations
Vandewiele et al., 2021, “Overly optimistic prediction results on imbalanced data: a case study of flaws and benefits when applying over-sampling” Vandewiele et al. analyze 24 papers on pre-term birth prediction and find 11 of these papers improperly transform data (by oversampling before splitting into train and test sets). Medicine 24 papers on pre-term risk prediction Yes
MODULE 5: MODELING
5d) Model selection method
Neunhoeffer and Sternberg, 2019, “How Cross-Validation Can Go Wrong and What to Do About It.” Neunhoeffer and Sternberg demonstrate that the main findings of a prominent political science paper fail to reproduce due to improper model selection. In particular, model selection was done on the same data that was used for evaluation. Political Science 1 prominent political science paper Yes
5e) Hyper-parameter selection
Dodge et al., 2019, “Show Your Work: Improved Reporting of Experimental Results” Dodge et al. find that among 50 random papers from a prominent natural language processing conference, while 74% of papers reported at least some information about the best performing hyperparameters, 10% of fewer reported more specific details about hyperparameter search or the effect of hyperparameters on performance. Natural language processing 50 random papers from a prominent natural language processing conference in 2018 Yes
5f) Appropriate baselines
Sculley et al., 2018, “Winner’s curse? On pace, progress, and empirical rigor” Sculley et al. discuss five papers that provide evidence of improper comparison with baselines in different areas of ML, suggesting that empirical progress in the field can be misleading. ML 5 papers identifying poor performance compared to baselines in different areas of ML Yes
MODULE 6: DATA LEAKAGE
Introduction
Kapoor and Narayanan, 2022, “Leakage and the reproducibility crisis in ML-based science” Kapoor and Narayanan found that leakage affects hundreds of papers across 17 fields. Multi-disciplinary A survey of leakage issues across 17 fields Yes
Train-test separation is maintained
Poldrack et al., 2020, “Establishment of best practices for evidence for prediction: A review” Poldrack et al. find that of the 100 neuropsychiatry studies that claimed to predict patient outcomes, 45 only reported in-sample statistical fit as evidence for predictive accuracy. Neuropsychiatry 100 published studies between December 24, 2017 and October 30, 2018 in PubMed using search terms “fMRI prediction” and “fMRI predict” Yes
Dependencies or duplicates between datasets
Roberts et al., 2021, “Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans” Roberts et al. discuss the issue of “Frankenstein” datasets: datasets that combine multiple other sources of data and can end up using the same data twice---for instance, if two datasets rely on the same underlying data source are combined into a larger dataset. Medicine 62 studies that claimed to diagnose or prognose Covid-19 using chest x-rays Yes
MODULE 7: METRICS AND UNCERTAINTY
7b) Uncertainty estimates
Simmonds et al., 2022, “How is model-related uncertainty quantified and reported in different disciplines?” Simmonds et al. show that across seven fields, no fields consistently reported complete model uncertainties, and that the type of uncertainties reported varied by field. Multi-disciplinary 496 studies across 7 fields that included statistical models No
MODULE 8: GENERALIZABILITY AND LIMITATIONS
Introduction
Raji et al., 2022, “The Fallacy of AI Functionality” Raji et al. review real-world applications of technologies that claim to use ML and categorize several ways in which such technology frequently failed, including “lack of robustness to changing external conditions” (p. 9). Computer science and law (real-world ML applications) 283 cases of failures of technology that claimed to be AI, ML or data-driven between 2012 to 2021 Yes
Liao et al., 2021, “Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning” Liao et al. find that the same types of evaluation failures occur across a wide range of ML tasks and algorithms. They provide a taxonomy of common internal and external validity failures. Computer science 107 “survey papers from computer vision, natural language processing, recommender systems, reinforcement learning, graph processing, metric learning, and more” Yes
Reporting on external validity falls short in past literature
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research” 37 out of 49 papers (75%) discuss how the findings from their sample generalize to their target population, and 26 out of 49 papers (53%) discuss generalizability beyond the target population. Epidemiology & medicine 49 longitudinal studies on strokes in six journals, 1999-2003 No
Bozkurt et al., 2020, “Reporting of demographic data and representativeness in machine learning models using electronic health records” The authors argue that descriptive statistics about the study sample should be provided in order to be transparent about representativeness of the target population. They find that of 164 studies that trained ML models with electronic health records data, “Race/ethnicity was not reported in 64%; gender and age were not reported in 24% and 21% of studies, respectively. Socioeconomic status of the population was not reported in 92% of studies.” They also find, “Few models (12%) were validated using external populations” (p. 1878). Medicine 164 studies that trained ML models with electronic health records data Yes
Navarro et al., 2023, “Systematic review finds ‘spin’ practices and poor reporting standards in studies on machine learning-based prediction models” “In the main text, 86/152 (56.6% [95% CI 48.6 - 64.2]) studies made recommendations to use the model in clinical practice, however, 74/86 (86% [95% CI 77.2 - 91.8]) lacked external validation in the same article. Out of the 13/152 (8.6% [95% CI 5.1 - 14.1]) studies that recommended the use of the model in a different setting or population, 11/13 (84.6% [95% CI 57.8 - 95.7]) studies lacked external validation” (p. 104). Epidemiology & medicine 152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019 Yes