REFORMS: Consensus-based Recommendations for
Machine-learning-based Science

[Paper website]

Appendix 3

This appendix provides additional details on some of the citations from the main text. We include references from the main text that address: (1) the quality of reporting in past scientific literature, or (2) examples of problems that occurred in past scientific literature. This appendix does not constitute a comprehensive list of all published references on these topics. The table has 44 entries with details about their relevance to our review.

The citations are listed in order of appearance in the main text, with section headings corresponding to the headings from the text. Some sections from the main text are omitted because they do not contain references that match our criteria for inclusion in the table. Some citations are included in the table more than once because they appear in multiple sections. Many of the references focus specifically on machine learning (ML)-based science, but we also include references about science with traditional statistical methods because some of the best practices and shortcomings are shared in ML-based science and other quantitative sciences.

Reference	Findings about reporting quality in past literature or problems in past literature	Discipline	Literature examined	ML-Focused?
MODULE 1: STUDY GOALS
Introduction
Hofman et al., 2017, “Prediction and explanation in social systems”	The authors re-evaluate data from a prior paper to demonstrate how different (but equally reasonable) choices in research design can lead to different results from the same data. This includes an example of how slight differences in the definition of a research question can lead to substantially different results.	Computational social science	Re-evaluation of data from 1 prior paper on prediction of information cascade size on Twitter	Yes
1a) Population or distribution about which the scientific claim is made
Lundberg et al., 2021, “What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory”	Only 9 out of 32 papers (28%) provided sufficient information for a reader to “confidently” identify the target population about which the scientific claim is made (p. 553).	Sociology	32 quantitative papers in 2018 volume of a top sociology journal	No
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”	33 out of 49 papers (67%) define a target population.	Epidemiology & medicine	49 longitudinal studies on strokes in six journals, 1999-2003	No
MODULE 2: COMPUTATIONAL REPRODUCIBILITY
Introduction
Verstynen and Kording, 2023, “Overfitting to ‘predict’ suicidal ideation”	The code for the feature selection step in a flawed prior paper was not released, so Verstynen and Kording could not pinpoint the exact source of errors.	Psychology, neuroscience, and biomedical engineering	1 paper on prediction of suicidal ideation	Yes
Current computational reproducibility standards fall short
Stodden et al., 2018, “An empirical analysis of journal policy effectiveness for computational reproducibility”	Stodden et al. attempted to contact the authors of 204 papers published in the journal Science to obtain reproducibility materials. Only 44% of authors responded.	Multi-disciplinary	204 quantitative papers in Science	No
Gabelica et al., 2022, “Many researchers were not compliant with their published data sharing statement: A mixed-methods study”	Gabelica et al. examined 333 open-access journals indexed on BioMed Central in January 2019 and found that out of the 1,792 papers that pledged to share data upon request, 1,669 did not do so, resulting in a 93% data unavailability rate.	Biology, health sciences and medicine	1,792 papers published in 333 BioMed Central open-access journals in January 2019	No
Vasilevsky et al., 2017, “Reproducible and reusable research: Are journal data sharing policies meeting the mark?”	Vasilevsky et al. examined the data-sharing policies of 318 biomedical journals and discovered that almost one-third lacked any such policies, and those that did often lacked clear guidelines for author compliance.	Biology, health sciences and medicine	318 biomedical journals (Biochemistry and Molecular Biology, Biology, Cell Biology, Crystallography, Developmental Biology, Biomedical Engineering, Immunology, Medical Informatics, Microbiology, Microscopy, Multidisciplinary Sciences, and Neurosciences)	No
Computational reproducibility allows independent researchers to find errors in original papers
Hofman et al., 2021, “Expanding the scope of reproducibility research through data analysis replications”	Hofman et al. analyze 11 papers and find various shortcomings in this body of literature.	Multi-disciplinary	11 computational social science papers	No
Vandewiele et al., 2021, “Overly optimistic prediction results on imbalanced data: A case study of flaws and benefits when applying over-sampling”	Vandewiele et al. analyze 24 papers on pre-term birth prediction and find 21 of these papers suffer from leakage.	Medicine	24 papers on pre-term risk prediction	Yes
MODULE 3: DATA QUALITY
3a) Data source(s)
Navarro et al., 2022, “Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review”	98% of articles adhered to the guidelines for reporting data source from the TRIPOD statement.	Epidemiology & medicine	152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019	Yes
Yusuf et al., 2020, “Reporting quality of studies using machine learning models for medical diagnosis: a systematic review”	24 out of 28 papers (86%) reported information about their data source, defined as “Where and when potentially eligible participants were identified (setting, location and dates)” (p. 3).	Medicine	28 “medical research studies that used ML methods to aid clinical diagnosis,” published July 2015-July 2018	Yes
Kim et al., 2016, “Garbage in, Garbage Out: Data Collection, Quality Assessment and Reporting Standards for Social Media Data Use in Health Research, Infodemiology and Digital Disease Detection”	Studies that utilize social media data frequently omit important information about their data collection process, such as details about the development and assessment of search filters. This paper provides a framework for reporting this information.	Health media	Studies that use social media data (this is not a formal review paper, but it provides several examples)	No
Geiger et al., 2020, “Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?”	There was “wide divergence” in whether papers followed best practices for reporting the data annotation process, such as reporting: “who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available” (p. 325).	Multi-disciplinary: “the papers represented political science, public health, NLP, sentiment analysis, cybersecurity, content moderation, hate speech, information quality, demographic profiling, and more” (p. 328)	164 “machine learning application papers... that classified tweets from Twitter” (p. 326)	Yes
3b) Sampling frame
Navarro et al., 2022, “Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review”	105 out of 152 studies (69%) reported their eligibility criteria.	Epidemiology & medicine	152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019	Yes
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”	41 out of 49 papers (84%) reported their sampling frame, and 32 out of 49 papers (65%) reported their eligibility criteria.	Epidemiology & medicine	49 longitudinal studies on strokes in six journals, 1999-2003	No
Porzsolt et al., 2019, “Inclusion and exclusion criteria and the problem of describing homogeneity of study populations in clinical trials”	75 out of 100 studies (75%) reported inclusion criteria. 6 of those 75 studies (8%) also reported exclusion criteria.	Medicine	100 publications on “quality of life” assessments	No
3d) Outcome variable
Credé and Harms, 2021, “Three cheers for descriptive statistics—and five more reasons why they matter”	In a review of literature that was still a work-in-progress at the time Credé and Harms published this commentary, “Among the articles coded to date, less than half report the ethnicity of the participants or the types of jobs held by the participants and only 56% report data on the industry in which the data were collected. Other interesting—and to meta-analysts potentially important—information is also remarkably often unreported” (p. 486). (Note: This commentary discusses descriptive statistics broadly, not just descriptive statistics for outcome variables.)	Industrial and organizational psychology	Articles from four top journals in industrial and organizational psychology (number of articles is not reported)	No
Larson-Hall and Plonsky, 2015, “Reporting and interpreting quantitative research findings: What gets reported and recommendations for the field”	Meta-analyses frequently had to omit large numbers of primary articles from their analyses due to insufficient descriptive statistics in the primary articles. (Note: This article discusses descriptive statistics broadly, not just descriptive statistics for outcome variables.)	Second language acquisition	Approximately 90 meta-analyses in second language acquisition	No
3e) Sample size
Plonsky, 2013, “Study Quality in SLA: An Assessment of Designs, Analyses, and Reporting Practices in Quantitative L2 Research”	99% of studies reported sample size.	Second language acquisition	606 studies in second language acquisition journals, published 1990-2010	No
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”	100% of 49 longitudinal studies reported the total number of participants from the first wave of their study. However, only 25 out of 49 (51%) reported the number of participants after attrition at each subsequent wave.	Epidemiology & medicine	49 longitudinal studies on strokes in six journals, 1999-2003	No
3f) Missingness
McKnight et al., 2007, “Missing Data: A Gentle Introduction”	Around 90% of articles had missing data, and the average amount of missing data per study was over 30%. Furthermore, “few of the articles included explicit mention of missing data, and even fewer indicated that the authors attended to missing data, either by performing statistical procedures or by making disclaimers regarding the studies in the results and conclusions” (p. 3).	Psychology	Over 300 publications from a prominent psychology journal	No
Peugh and Enders, 2004, “Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement”	Among the articles Peugh and Enders reviewed, “[d]etails concerning missing data were seldom reported” and “[t]he methods used to handle missing data were, in many cases, difficult to ascertain because explicit descriptions of missing-data procedures were rare” (p. 537). However, Peugh and Enders were able to infer the amount of missingness in some studies by examining the “discrepancy between the reported degrees of freedom for a given analysis and the degrees of freedom that one would expect on the basis of the stated sample size and design characteristics” (p. 537). In articles published in 1999, they detected missing data in 16% of studies, but they write that this is likely a “gross underestimate” of the actual prevalence of missing data. Among articles published in 2003, they were able to detect missing data in 42% of articles, which is higher than in 1999 due to changes in reporting practices following a recommendation by an American Psychological Association task force.	Educational research	989 studies published in 1999 and 545 studies published in 2003 in 23 applied educational research journals	No
Salganik et al., 2020, Supplementary information for “Measuring the predictability of life outcomes using a scientific mass collaboration”	There are many reasons for missing data in survey data, including a respondent not participating in a given wave of a longitudinal survey, respondents refusing to answer some questions, skip patterns in the survey design, and redaction for privacy. In a modified version of a well-known, high-quality social survey dataset, 73% of possible data entries were missing, and the largest source of missingness was survey skip patterns. This high level of missingness emphasizes the importance of careful attention to handling missing data.	Sociology	1 study with a well-known social survey data set	Yes
Nijman et al., 2022, “Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review”	“A total of 56 (37%) prediction model studies did not report on missing data and could not be analyzed further. We included 96 (63%) studies which reported on the handling of missing data. Across the 96 studies, 46 (48%) did not include information on the amount or nature of the missing data” (p. 220).	Medicine	152 ML-based clinical prediction model studies, published 2018-2019	Yes
Navarro et al., 2022, “Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review”	“Forty-four studies reported how missing data were handled (28.9%, 95% CI 22.3 to 36.6). The missing data item consists of four sub-items of which three were rarely addressed in included studies. Within 28 studies that reported handling of missing data: three studies reported the software used (10.7%, CI 3.7 to 27.2), four studies reported the variables included in the procedure (14.3%, CI 5.7 to 31.5) and no study reported the number of imputations (0%, CI 0.0 to 39.0)” (pp. 6-7).	Epidemiology & medicine	152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019	Yes
Little et al., 2013, “On the Joys of Missing Data”	“Among the 80 reviewed studies, only 45 (56.25%) mentioned missing data explicitly in the text or a table of descriptive statistics. Of those 45, only three mentioned testing whether the missingness was related to other variables, justifying their [missingness at random] assumption” (p. 156).	Pediatric psychology	80 empirical studies in the 2012 issues of a pediatric psychology journal	No
Nicholson et al., 2016, “Attrition in developmental psychology”	Among 541 longitudinal studies, only 253 (47%) discussed missingness due to attrition, and only 99 (18%) explicitly discussed whether missingness due to attrition was “missing at random,” “missing completely at random,” or “missing not at random.”	Developmental psychology	541 longitudinal studies in major developmental journals, published 2009 and 2012	No
Sterner, 2011, “What Is Missing in Counseling Research? Reporting Missing Data”	In the first journal, “14 of 66 (21%) articles referenced missing data on some level. Of these 14 articles, 11 mentioned missing data specifically... In the remaining 52 JCD articles, no information was provided on whether missing data existed.” In the second journal, “one of 28 (4%) empirically based research articles made reference to screening for missing data; however, no mention was made of missing data in the remaining articles” (p. 56).	Counseling	94 empirical research articles in two top counseling journals, published 2004 to 2008	No
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”	Only 19 out of 49 articles (39%) reported on missing data items at each longitudinal wave, and only 2 out of 42 articles (5%) that had missing data in their analyses described imputation, weighting, or sensitivity analyses for handling missing data.	Epidemiology & medicine	49 longitudinal studies on strokes in six journals, 1999-2003	No
Hussain et al., 2017, “Quality of missing data reporting and handling in palliative care trials demonstrates that further development of the CONSORT statement is required: a systematic review”	101 out of 108 studies (94%) reported the number of participants who were missing in the primary outcome analysis; however, reporting rates were lower for other details about missing data and for methods of handling missing data.	Epidemiology	108 articles on palliative care randomized controlled trials, published 2009-2014	No
3g) Dataset for evaluation is representative
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”	Among several reporting criteria this review examined, “the criteria in the checklist representing selection bias were the least frequently reported overall” (p. 285). Specifically, selection-in biases were discussed in 14 out of 49 articles (28%), comparison of consenters with non-consenters was discussed in 1 out of 47 applicable articles (2%), and loss to follow-up was accounted for in the analyses of 1/41 applicable articles (5%). Additionally, 37 out of 49 articles (75%) discuss how their results relate to the target population.	Epidemiology & medicine	49 longitudinal studies on strokes in six journals, 1999-2003	No
MODULE 4: DATA PREPROCESSING
4c) Data transformations
Vandewiele et al., 2021, “Overly optimistic prediction results on imbalanced data: a case study of flaws and benefits when applying over-sampling”	Vandewiele et al. analyze 24 papers on pre-term birth prediction and find 11 of these papers improperly transform data (by oversampling before splitting into train and test sets).	Medicine	24 papers on pre-term risk prediction	Yes
MODULE 5: MODELING
5d) Model selection method
Neunhoeffer and Sternberg, 2019, “How Cross-Validation Can Go Wrong and What to Do About It.”	Neunhoeffer and Sternberg demonstrate that the main findings of a prominent political science paper fail to reproduce due to improper model selection. In particular, model selection was done on the same data that was used for evaluation.	Political Science	1 prominent political science paper	Yes
5e) Hyper-parameter selection
Dodge et al., 2019, “Show Your Work: Improved Reporting of Experimental Results”	Dodge et al. find that among 50 random papers from a prominent natural language processing conference, while 74% of papers reported at least some information about the best performing hyperparameters, 10% of fewer reported more specific details about hyperparameter search or the effect of hyperparameters on performance.	Natural language processing	50 random papers from a prominent natural language processing conference in 2018	Yes
5f) Appropriate baselines
Sculley et al., 2018, “Winner’s curse? On pace, progress, and empirical rigor”	Sculley et al. discuss five papers that provide evidence of improper comparison with baselines in different areas of ML, suggesting that empirical progress in the field can be misleading.	ML	5 papers identifying poor performance compared to baselines in different areas of ML	Yes
MODULE 6: DATA LEAKAGE
Introduction
Kapoor and Narayanan, 2022, “Leakage and the reproducibility crisis in ML-based science”	Kapoor and Narayanan found that leakage affects hundreds of papers across 17 fields.	Multi-disciplinary	A survey of leakage issues across 17 fields	Yes
Train-test separation is maintained
Poldrack et al., 2020, “Establishment of best practices for evidence for prediction: A review”	Poldrack et al. find that of the 100 neuropsychiatry studies that claimed to predict patient outcomes, 45 only reported in-sample statistical fit as evidence for predictive accuracy.	Neuropsychiatry	100 published studies between December 24, 2017 and October 30, 2018 in PubMed using search terms “fMRI prediction” and “fMRI predict”	Yes
Dependencies or duplicates between datasets
Roberts et al., 2021, “Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans”	Roberts et al. discuss the issue of “Frankenstein” datasets: datasets that combine multiple other sources of data and can end up using the same data twice---for instance, if two datasets rely on the same underlying data source are combined into a larger dataset.	Medicine	62 studies that claimed to diagnose or prognose Covid-19 using chest x-rays	Yes
MODULE 7: METRICS AND UNCERTAINTY
7b) Uncertainty estimates
Simmonds et al., 2022, “How is model-related uncertainty quantified and reported in different disciplines?”	Simmonds et al. show that across seven fields, no fields consistently reported complete model uncertainties, and that the type of uncertainties reported varied by field.	Multi-disciplinary	496 studies across 7 fields that included statistical models	No
MODULE 8: GENERALIZABILITY AND LIMITATIONS
Introduction
Raji et al., 2022, “The Fallacy of AI Functionality”	Raji et al. review real-world applications of technologies that claim to use ML and categorize several ways in which such technology frequently failed, including “lack of robustness to changing external conditions” (p. 9).	Computer science and law (real-world ML applications)	283 cases of failures of technology that claimed to be AI, ML or data-driven between 2012 to 2021	Yes
Liao et al., 2021, “Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning”	Liao et al. find that the same types of evaluation failures occur across a wide range of ML tasks and algorithms. They provide a taxonomy of common internal and external validity failures.	Computer science	107 “survey papers from computer vision, natural language processing, recommender systems, reinforcement learning, graph processing, metric learning, and more”	Yes
Reporting on external validity falls short in past literature
Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”	37 out of 49 papers (75%) discuss how the findings from their sample generalize to their target population, and 26 out of 49 papers (53%) discuss generalizability beyond the target population.	Epidemiology & medicine	49 longitudinal studies on strokes in six journals, 1999-2003	No
Bozkurt et al., 2020, “Reporting of demographic data and representativeness in machine learning models using electronic health records”	The authors argue that descriptive statistics about the study sample should be provided in order to be transparent about representativeness of the target population. They find that of 164 studies that trained ML models with electronic health records data, “Race/ethnicity was not reported in 64%; gender and age were not reported in 24% and 21% of studies, respectively. Socioeconomic status of the population was not reported in 92% of studies.” They also find, “Few models (12%) were validated using external populations” (p. 1878).	Medicine	164 studies that trained ML models with electronic health records data	Yes
Navarro et al., 2023, “Systematic review finds ‘spin’ practices and poor reporting standards in studies on machine learning-based prediction models”	“In the main text, 86/152 (56.6% [95% CI 48.6 - 64.2]) studies made recommendations to use the model in clinical practice, however, 74/86 (86% [95% CI 77.2 - 91.8]) lacked external validation in the same article. Out of the 13/152 (8.6% [95% CI 5.1 - 14.1]) studies that recommended the use of the model in a different setting or population, 11/13 (84.6% [95% CI 57.8 - 95.7]) studies lacked external validation” (p. 104).	Epidemiology & medicine	152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019	Yes

Reference

Findings about reporting quality in past literature or problems in past literature

Discipline

Literature examined

ML-Focused?

MODULE 1: STUDY GOALS

Introduction

Hofman et al., 2017, “Prediction and explanation in social systems”

The authors re-evaluate data from a prior paper to demonstrate how different (but equally reasonable) choices in research design can lead to different results from the same data. This includes an example of how slight differences in the definition of a research question can lead to substantially different results.

Computational social science

Re-evaluation of data from 1 prior paper on prediction of information cascade size on Twitter

Yes

1a) Population or distribution about which the scientific claim is made

Lundberg et al., 2021, “What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory”

Only 9 out of 32 papers (28%) provided sufficient information for a reader to “confidently” identify the target population about which the scientific claim is made (p. 553).

Sociology

32 quantitative papers in 2018 volume of a top sociology journal

Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”

33 out of 49 papers (67%) define a target population.

Epidemiology & medicine

49 longitudinal studies on strokes in six journals, 1999-2003

MODULE 2: COMPUTATIONAL REPRODUCIBILITY

Introduction

Verstynen and Kording, 2023, “Overfitting to ‘predict’ suicidal ideation”

The code for the feature selection step in a flawed prior paper was not released, so Verstynen and Kording could not pinpoint the exact source of errors.

Psychology, neuroscience, and biomedical engineering

1 paper on prediction of suicidal ideation

Yes

Current computational reproducibility standards fall short

Stodden et al., 2018, “An empirical analysis of journal policy effectiveness for computational reproducibility”

Stodden et al. attempted to contact the authors of 204 papers published in the journal Science to obtain reproducibility materials. Only 44% of authors responded.

Multi-disciplinary

204 quantitative papers in Science

Gabelica et al., 2022, “Many researchers were not compliant with their published data sharing statement: A mixed-methods study”

Gabelica et al. examined 333 open-access journals indexed on BioMed Central in January 2019 and found that out of the 1,792 papers that pledged to share data upon request, 1,669 did not do so, resulting in a 93% data unavailability rate.

Biology, health sciences and medicine

1,792 papers published in 333 BioMed Central open-access journals in January 2019

Vasilevsky et al., 2017, “Reproducible and reusable research: Are journal data sharing policies meeting the mark?”

Vasilevsky et al. examined the data-sharing policies of 318 biomedical journals and discovered that almost one-third lacked any such policies, and those that did often lacked clear guidelines for author compliance.

Biology, health sciences and medicine

318 biomedical journals (Biochemistry and Molecular Biology, Biology, Cell Biology, Crystallography, Developmental Biology, Biomedical Engineering, Immunology, Medical Informatics, Microbiology, Microscopy, Multidisciplinary Sciences, and Neurosciences)

Computational reproducibility allows independent researchers to find errors in original papers

Hofman et al., 2021, “Expanding the scope of reproducibility research through data analysis replications”

Hofman et al. analyze 11 papers and find various shortcomings in this body of literature.

Multi-disciplinary

11 computational social science papers

Vandewiele et al., 2021, “Overly optimistic prediction results on imbalanced data: A case study of flaws and benefits when applying over-sampling”

Vandewiele et al. analyze 24 papers on pre-term birth prediction and find 21 of these papers suffer from leakage.

Medicine

24 papers on pre-term risk prediction

Yes

MODULE 3: DATA QUALITY

3a) Data source(s)

Navarro et al., 2022, “Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review”

98% of articles adhered to the guidelines for reporting data source from the TRIPOD statement.

Epidemiology & medicine

152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019

Yes

Yusuf et al., 2020, “Reporting quality of studies using machine learning models for medical diagnosis: a systematic review”

24 out of 28 papers (86%) reported information about their data source, defined as “Where and when potentially eligible participants were identified (setting, location and dates)” (p. 3).

Medicine

28 “medical research studies that used ML methods to aid clinical diagnosis,” published July 2015-July 2018

Yes

Kim et al., 2016, “Garbage in, Garbage Out: Data Collection, Quality Assessment and Reporting Standards for Social Media Data Use in Health Research, Infodemiology and Digital Disease Detection”

Studies that utilize social media data frequently omit important information about their data collection process, such as details about the development and assessment of search filters. This paper provides a framework for reporting this information.

Health media

Studies that use social media data (this is not a formal review paper, but it provides several examples)

Geiger et al., 2020, “Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?”

There was “wide divergence” in whether papers followed best practices for reporting the data annotation process, such as reporting: “who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available” (p. 325).

Multi-disciplinary: “the papers represented political science, public health, NLP, sentiment analysis, cybersecurity, content moderation, hate speech, information quality, demographic profiling, and more” (p. 328)

164 “machine learning application papers... that classified tweets from Twitter” (p. 326)

Yes

3b) Sampling frame

Navarro et al., 2022, “Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review”

105 out of 152 studies (69%) reported their eligibility criteria.

Epidemiology & medicine

152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019

Yes

Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”

41 out of 49 papers (84%) reported their sampling frame, and 32 out of 49 papers (65%) reported their eligibility criteria.

Epidemiology & medicine

49 longitudinal studies on strokes in six journals, 1999-2003

Porzsolt et al., 2019, “Inclusion and exclusion criteria and the problem of describing homogeneity of study populations in clinical trials”

75 out of 100 studies (75%) reported inclusion criteria. 6 of those 75 studies (8%) also reported exclusion criteria.

Medicine

100 publications on “quality of life” assessments

3d) Outcome variable

Credé and Harms, 2021, “Three cheers for descriptive statistics—and five more reasons why they matter”

In a review of literature that was still a work-in-progress at the time Credé and Harms published this commentary, “Among the articles coded to date, less than half report the ethnicity of the participants or the types of jobs held by the participants and only 56% report data on the industry in which the data were collected. Other interesting—and to meta-analysts potentially important—information is also remarkably often unreported” (p. 486). (Note: This commentary discusses descriptive statistics broadly, not just descriptive statistics for outcome variables.)

Industrial and organizational psychology

Articles from four top journals in industrial and organizational psychology (number of articles is not reported)

Larson-Hall and Plonsky, 2015, “Reporting and interpreting quantitative research findings: What gets reported and recommendations for the field”

Meta-analyses frequently had to omit large numbers of primary articles from their analyses due to insufficient descriptive statistics in the primary articles. (Note: This article discusses descriptive statistics broadly, not just descriptive statistics for outcome variables.)

Second language acquisition

Approximately 90 meta-analyses in second language acquisition

3e) Sample size

Plonsky, 2013, “Study Quality in SLA: An Assessment of Designs, Analyses, and Reporting Practices in Quantitative L2 Research”

99% of studies reported sample size.

Second language acquisition

606 studies in second language acquisition journals, published 1990-2010

Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”

100% of 49 longitudinal studies reported the total number of participants from the first wave of their study. However, only 25 out of 49 (51%) reported the number of participants after attrition at each subsequent wave.

Epidemiology & medicine

49 longitudinal studies on strokes in six journals, 1999-2003

3f) Missingness

McKnight et al., 2007, “Missing Data: A Gentle Introduction”

Around 90% of articles had missing data, and the average amount of missing data per study was over 30%. Furthermore, “few of the articles included explicit mention of missing data, and even fewer indicated that the authors attended to missing data, either by performing statistical procedures or by making disclaimers regarding the studies in the results and conclusions” (p. 3).

Psychology

Over 300 publications from a prominent psychology journal

Peugh and Enders, 2004, “Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement”

Among the articles Peugh and Enders reviewed, “[d]etails concerning missing data were seldom reported” and “[t]he methods used to handle missing data were, in many cases, difficult to ascertain because explicit descriptions of missing-data procedures were rare” (p. 537). However, Peugh and Enders were able to infer the amount of missingness in some studies by examining the “discrepancy between the reported degrees of freedom for a given analysis and the degrees of freedom that one would expect on the basis of the stated sample size and design characteristics” (p. 537). In articles published in 1999, they detected missing data in 16% of studies, but they write that this is likely a “gross underestimate” of the actual prevalence of missing data. Among articles published in 2003, they were able to detect missing data in 42% of articles, which is higher than in 1999 due to changes in reporting practices following a recommendation by an American Psychological Association task force.

Educational research

989 studies published in 1999 and 545 studies published in 2003 in 23 applied educational research journals

Salganik et al., 2020, Supplementary information for “Measuring the predictability of life outcomes using a scientific mass collaboration”

There are many reasons for missing data in survey data, including a respondent not participating in a given wave of a longitudinal survey, respondents refusing to answer some questions, skip patterns in the survey design, and redaction for privacy. In a modified version of a well-known, high-quality social survey dataset, 73% of possible data entries were missing, and the largest source of missingness was survey skip patterns. This high level of missingness emphasizes the importance of careful attention to handling missing data.

Sociology

1 study with a well-known social survey data set

Yes

Nijman et al., 2022, “Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review”

“A total of 56 (37%) prediction model studies did not report on missing data and could not be analyzed further. We included 96 (63%) studies which reported on the handling of missing data. Across the 96 studies, 46 (48%) did not include information on the amount or nature of the missing data” (p. 220).

Medicine

152 ML-based clinical prediction model studies, published 2018-2019

Yes

Navarro et al., 2022, “Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review”

“Forty-four studies reported how missing data were handled (28.9%, 95% CI 22.3 to 36.6). The missing data item consists of four sub-items of which three were rarely addressed in included studies. Within 28 studies that reported handling of missing data: three studies reported the software used (10.7%, CI 3.7 to 27.2), four studies reported the variables included in the procedure (14.3%, CI 5.7 to 31.5) and no study reported the number of imputations (0%, CI 0.0 to 39.0)” (pp. 6-7).

Epidemiology & medicine

152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019

Yes

Little et al., 2013, “On the Joys of Missing Data”

“Among the 80 reviewed studies, only 45 (56.25%) mentioned missing data explicitly in the text or a table of descriptive statistics. Of those 45, only three mentioned testing whether the missingness was related to other variables, justifying their [missingness at random] assumption” (p. 156).

Pediatric psychology

80 empirical studies in the 2012 issues of a pediatric psychology journal

Nicholson et al., 2016, “Attrition in developmental psychology”

Among 541 longitudinal studies, only 253 (47%) discussed missingness due to attrition, and only 99 (18%) explicitly discussed whether missingness due to attrition was “missing at random,” “missing completely at random,” or “missing not at random.”

Developmental psychology

541 longitudinal studies in major developmental journals, published 2009 and 2012

Sterner, 2011, “What Is Missing in Counseling Research? Reporting Missing Data”

In the first journal, “14 of 66 (21%) articles referenced missing data on some level. Of these 14 articles, 11 mentioned missing data specifically... In the remaining 52 JCD articles, no information was provided on whether missing data existed.” In the second journal, “one of 28 (4%) empirically based research articles made reference to screening for missing data; however, no mention was made of missing data in the remaining articles” (p. 56).

Counseling

94 empirical research articles in two top counseling journals, published 2004 to 2008

Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”

Only 19 out of 49 articles (39%) reported on missing data items at each longitudinal wave, and only 2 out of 42 articles (5%) that had missing data in their analyses described imputation, weighting, or sensitivity analyses for handling missing data.

Epidemiology & medicine

49 longitudinal studies on strokes in six journals, 1999-2003

Hussain et al., 2017, “Quality of missing data reporting and handling in palliative care trials demonstrates that further development of the CONSORT statement is required: a systematic review”

101 out of 108 studies (94%) reported the number of participants who were missing in the primary outcome analysis; however, reporting rates were lower for other details about missing data and for methods of handling missing data.

Epidemiology

108 articles on palliative care randomized controlled trials, published 2009-2014

3g) Dataset for evaluation is representative

Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”

Among several reporting criteria this review examined, “the criteria in the checklist representing selection bias were the least frequently reported overall” (p. 285). Specifically, selection-in biases were discussed in 14 out of 49 articles (28%), comparison of consenters with non-consenters was discussed in 1 out of 47 applicable articles (2%), and loss to follow-up was accounted for in the analyses of 1/41 applicable articles (5%). Additionally, 37 out of 49 articles (75%) discuss how their results relate to the target population.

Epidemiology & medicine

49 longitudinal studies on strokes in six journals, 1999-2003

MODULE 4: DATA PREPROCESSING

4c) Data transformations

Vandewiele et al., 2021, “Overly optimistic prediction results on imbalanced data: a case study of flaws and benefits when applying over-sampling”

Vandewiele et al. analyze 24 papers on pre-term birth prediction and find 11 of these papers improperly transform data (by oversampling before splitting into train and test sets).

Medicine

24 papers on pre-term risk prediction

Yes

MODULE 5: MODELING

5d) Model selection method

Neunhoeffer and Sternberg, 2019, “How Cross-Validation Can Go Wrong and What to Do About It.”

Neunhoeffer and Sternberg demonstrate that the main findings of a prominent political science paper fail to reproduce due to improper model selection. In particular, model selection was done on the same data that was used for evaluation.

Political Science

1 prominent political science paper

Yes

5e) Hyper-parameter selection

Dodge et al., 2019, “Show Your Work: Improved Reporting of Experimental Results”

Dodge et al. find that among 50 random papers from a prominent natural language processing conference, while 74% of papers reported at least some information about the best performing hyperparameters, 10% of fewer reported more specific details about hyperparameter search or the effect of hyperparameters on performance.

Natural language processing

50 random papers from a prominent natural language processing conference in 2018

Yes

5f) Appropriate baselines

Sculley et al., 2018, “Winner’s curse? On pace, progress, and empirical rigor”

Sculley et al. discuss five papers that provide evidence of improper comparison with baselines in different areas of ML, suggesting that empirical progress in the field can be misleading.

5 papers identifying poor performance compared to baselines in different areas of ML

Yes

MODULE 6: DATA LEAKAGE

Introduction

Kapoor and Narayanan, 2022, “Leakage and the reproducibility crisis in ML-based science”

Kapoor and Narayanan found that leakage affects hundreds of papers across 17 fields.

Multi-disciplinary

A survey of leakage issues across 17 fields

Yes

Train-test separation is maintained

Poldrack et al., 2020, “Establishment of best practices for evidence for prediction: A review”

Poldrack et al. find that of the 100 neuropsychiatry studies that claimed to predict patient outcomes, 45 only reported in-sample statistical fit as evidence for predictive accuracy.

Neuropsychiatry

100 published studies between December 24, 2017 and October 30, 2018 in PubMed using search terms “fMRI prediction” and “fMRI predict”

Yes

Dependencies or duplicates between datasets

Roberts et al., 2021, “Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans”

Roberts et al. discuss the issue of “Frankenstein” datasets: datasets that combine multiple other sources of data and can end up using the same data twice---for instance, if two datasets rely on the same underlying data source are combined into a larger dataset.

Medicine

62 studies that claimed to diagnose or prognose Covid-19 using chest x-rays

Yes

MODULE 7: METRICS AND UNCERTAINTY

7b) Uncertainty estimates

Simmonds et al., 2022, “How is model-related uncertainty quantified and reported in different disciplines?”

Simmonds et al. show that across seven fields, no fields consistently reported complete model uncertainties, and that the type of uncertainties reported varied by field.

Multi-disciplinary

496 studies across 7 fields that included statistical models

MODULE 8: GENERALIZABILITY AND LIMITATIONS

Introduction

Raji et al., 2022, “The Fallacy of AI Functionality”

Raji et al. review real-world applications of technologies that claim to use ML and categorize several ways in which such technology frequently failed, including “lack of robustness to changing external conditions” (p. 9).

Computer science and law (real-world ML applications)

283 cases of failures of technology that claimed to be AI, ML or data-driven between 2012 to 2021

Yes

Liao et al., 2021, “Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning”

Liao et al. find that the same types of evaluation failures occur across a wide range of ML tasks and algorithms. They provide a taxonomy of common internal and external validity failures.

Computer science

107 “survey papers from computer vision, natural language processing, recommender systems, reinforcement learning, graph processing, metric learning, and more”

Yes

Reporting on external validity falls short in past literature

Tooth et al., 2005, “Quality of Reporting of Observational Longitudinal Research”

37 out of 49 papers (75%) discuss how the findings from their sample generalize to their target population, and 26 out of 49 papers (53%) discuss generalizability beyond the target population.

Epidemiology & medicine

49 longitudinal studies on strokes in six journals, 1999-2003

Bozkurt et al., 2020, “Reporting of demographic data and representativeness in machine learning models using electronic health records”

The authors argue that descriptive statistics about the study sample should be provided in order to be transparent about representativeness of the target population. They find that of 164 studies that trained ML models with electronic health records data, “Race/ethnicity was not reported in 64%; gender and age were not reported in 24% and 21% of studies, respectively. Socioeconomic status of the population was not reported in 92% of studies.” They also find, “Few models (12%) were validated using external populations” (p. 1878).

Medicine

164 studies that trained ML models with electronic health records data

Yes

Navarro et al., 2023, “Systematic review finds ‘spin’ practices and poor reporting standards in studies on machine learning-based prediction models”

“In the main text, 86/152 (56.6% [95% CI 48.6 - 64.2]) studies made recommendations to use the model in clinical practice, however, 74/86 (86% [95% CI 77.2 - 91.8]) lacked external validation in the same article. Out of the 13/152 (8.6% [95% CI 5.1 - 14.1]) studies that recommended the use of the model in a different setting or population, 11/13 (84.6% [95% CI 57.8 - 95.7]) studies lacked external validation” (p. 104).

Epidemiology & medicine

152 articles on diagnostic or prognostic prediction models across medical fields, published 2018-2019

Yes

REFORMS: Consensus-based Recommendations for Machine-learning-based Science

[Paper website]

Appendix 3

REFORMS: Consensus-based Recommendations for
Machine-learning-based Science