Friday, April 29, 2016

How many papers said "All data are within the paper"?

When PLoS announced its data policy that all data should be made publicly available, everyone applauded. It was a big step toward an open science and data sharing. Until I kept seeing PLoS papers saying: "Data Availability: All relevant data are within the paper." It has been explained in a PLOS blog:
All data must be in one of three places:
the body of the manuscript; this may be appropriate for studies where the dataset is small enough to be presented in a table
in the supporting information; this may be appropriate for moderately-sized datasets that can be reported in large tables or as compressed files, which can then be downloaded
in a stable, public repository that provides an accession number or digital object identifier (DOI) for each dataset; there are many repositories that specialize in specific data types, and these are particularly suitable for very large datasets

So "All relevant data are within the paper" only applies to the situation that "the dataset is small enough to be presented in a table". But I have seen neuroimaging studies that also said that "All relevant data are within the paper", which the raw data could be as large as several hundred MB or GB. It is simply not possible to include the data in the paper or in the supporting files. Just give an example, this one.

Then, the question becomes how many papers published in PLoS ONE said "All data are within the paper"? First, I searched all the papers published after 2014 using PubMed, and downloaded the results as a CSV file. From the CSV file, I obtained the doi (digital object identifier) for all the papers. A paper's url is dx.doi.org/ followed by its doi. For example, if a paper's doi is 10.1371/journal.pone.0143126, then the url for this paper is http://dx.doi.org/10.1371/journal.pone.0143126. Therefore, I wrote a web crawler to download all the papers in PLoS ONE published since 2014. The crawler is a simple MATLAB script, with command urlread to download the paper. Since 2014, PLoS ONE published more than 60,000 papers. So, it took several nights to download all the papers.

There are two types of data statements that I think are questionable. Some papers said "All relevant data are within the paper", and some said "All relevant data are within the paper and its Supporting Information files". I searched these exact sentences in all the downloaded papers. And these two types of papers were identified exclusively. The figure below shows the number of papers in every month since the start of 2014 said "All relevant data are within the paper" (blue bars), and "All relevant data are within the paper and its Supporting Information files" (red bars). The green bars shows the total numbers of publications each month. Only research articles were counted.


The below figure shows the proportions of papers saying "All relevant data are within the paper" (blue), "All relevant data are within the paper and its Supporting Information files" (red), and the total proportions of these two categories together. I have posted a twitter figure with only the green bars, which I did not distinguish these two categories. It could be seen that the authors quickly learned to use these statement. And the proportions of such papers are quite stable (over 60%).


Since the data look stable in 2015, I just give some numbers for this year. In 2015 alone, PLoS ONE published 28,104 research articles. 7,073 papers (25.17%) said that "All relevant data are within the paper.", and 11,506 papers (40.94%) said that "All relevant data are within the paper and its Supporting Information files." Together, 66.11%  of the research articles published in 2015 said the data are either "within the paper" or "within the paper and its Supporting Information files". I have compiled the results into a excel file, and uploaded it here.

I have to say that not all the papers saying "All relevant data are within the paper" are questionable. There are certain types of research, e.g. meta-analysis, which do not use large dataset. And some studies may analyze publicly available dataset, so that there were no original data generated. But it is possible that a large number of these papers should make their data available, but they didn't.

PLoS ONE has a good data policy. But it is not clear how was the policy enforced. I know that some editors would ask the authors to share the data before the paper was accepted. But do all the editors do this? And I have asked the authors of a paper for their data publicly, and neither the authors nor the PLoS ONE editors replied. So, is this data policy a joke?

Data Availability: All relevant data are within the paper have been uploaded to figshare.