ScreenIT Quality Assurance Results

Author

Vladislav Nachev

Comparison between COVID preprint analyses with ScreenIT before and after update

The original ScreenIT pipeline continuously screened COVID preprints from January 2020 to June 2022. Sometime in March 2021 a big change in the code of the pipeline occured introducing new categories and changing the output schema. Therefore the data set needs to be rescreened using the same version of the pipeline. This report aims to perform a quality assurance on the updated version of the pipeline before the whole data set is screened with it. Data prior to the update were taken from the latest database. The updated version screenings were done via a different API that only sees the PDFs and no meta-data from the preprint servers. Two hundred preprints were randomly selected for the comparison, but due to bugs in the pipeline, four preprints could not be screened with the updated pipeline, resulting in a total of 196 screened preprints.

Sciscore Results

Compared to the previous version, the updated version more often delivered “not required” for the ethics statement, for example for modeling papers. However, this affected all downstream analyses as well, even in the case when there were actually statements related to e.g. randomization, attrition, etc. (Figure 1). In addition, a couple of funding statements were incorrectly detected as ethics statements, attrition had a few false positives, blinding a few false negatives, and power analysis a couple of false positives and one false negative.

Figure 1: Comparison between Sciscore results in the preivous and updated pipeline versions. Only cases where the two versions yielded discrepant results are shown. Cases are also split by the result of the manual validation with true positives shown as solid bars and true negatives shown as striped bars.

rtransparent Results

A full 100% of preprints in the data set had conflict of interest statements and funding statements, however only some had included these in the pdf of the manuscript (Figure 2). As the updated version screened only the pdf input, the manual assessment also was based only on the text in the pdf. The updated version of the pipeline missed coi statements if they were named with non-standard names (e.g. “conflicts:” or without section title) or if they were given on the first page of the manuscript. A check of the extracted text in the pipeline container under /temp/all_text/ showed that some papers were missing the first page of text. This omission also affected other metrics where the relevant text was on the first page.

The updated version of the pipeline also missed funding statements if they were named with non-standard names (e.g. “financial disclosure”, “financing”, “funding/support”, etc.) or if the funding information was in the acknowledgements or on the first page of the pdf (see previous paragraph).

The updated version of the pipeline did not detect registration numbers in 16 cases where the previous version did. The majority (14) of those were correct calls, with the exception of two cases where a PROSPERO registration number was cited but missed by the updated pipeline version.

Figure 2: Comparison between rtransparent results in the preivous and updated pipeline versions. Only cases where the two versions yielded discrepant results are shown. Cases are also split by the result of the manual validation with true positives shown as solid bars and true negatives shown as striped bars.

limitation-recognizer Results

There were only three discrepancies between the previous and updated pipeline versions (Figure 3). In all three the updated version caught limitations that the previous version did not.

Figure 3: Comparison between limitation-recognizer results in the preivous and updated pipeline versions. Only cases where the two versions yielded discrepant results are shown. Cases are also split by the result of the manual validation with true positives shown as solid bars and true negatives shown as striped bars.

TrialIdentifier Results

The updated version of TrialIdentifier yielded several false positives (Figure 4). Some of these were grant numbers or accession numbers given in supplemental tables, while the majority were EUDRA numbers falsely detected from dois in the reference section ( Figure 5).

Figure 4: Comparison between TrialIdentifier results in the preivous and updated pipeline versions. Only cases where the two versions yielded discrepant results are shown.

Figure 5: False positive detection by TrialIdentifier for EUDRA 201501087122.

JetFighter Results

The updated JetFighter version detected seven papers that the previous version did not (Figure 6). In addition, it falsely detected the fluorescent microscopy image shown in Figure 7.

Figure 6: Comparison between JetFighter results in the preivous and updated pipeline versions. Only cases where the two versions yielded discrepant results are shown.

Figure 7: False positive detection by JetFighter.

Barzooka Results

In addition to the comparison of the previous and updated version of the pipeline (for all tools listed above), we also compared the performance of Barzooka based on two different types of input: the individually extracted image files during pipeline processing vs. a folder of pdfs with the same preprints. Thus, the main difference was on the level of analysis, either image-based (Barzooka in pipeline) vs. page-based (stand-alone Barzooka). Two hundred papers were screened with either Barzooka version (pipeline vs. stand-alone) and the cases where there were discrepancies between the two versions were manually validated.

Discrepancies between the two Barzooka versions on the presence or absence of a figure type were detected in 103 out of 200 papers, with discrepancies found for all figure types (Figure 8).

Figure 8: Comparison between Barzooka results from the stand-alone (yellow) and pipeline (purple) versions. Only cases where the two versions yielded discrepant results are shown. Cases are also split by the result of the manual validation with true positives shown as solid bars and true negatives shown as striped bars.

For most categories, especially “approp”, “bardot”, “dot”, and “pie”, the stand-alone version generally delivered better results (Figure 8). Thus, this is the recommended use of the tool and the application on extracted separate image files is to be avoided. For the stand-alone version, the occasional errors in the “bar” and “approp” categories were due to proportional data not recognized as such or histograms or bardots were miss-classified. The stand-alone version also more readily detected “hist” images compared to the pipeline version, although some of these detections were false positives. There were no false negatives and only a few false positives for the “dot” and “bardot” categories, with commonly missidentified dots with whiskers, scatter plots or “bardots” with barely any bars visible. Similarly, many densely-packed dotplots or boxplots were mistaken for “violin” plots. Finally, there were several gene structure schematics and symbol-whisker plots with large squares that were mistakenly classified as “box”.

Taking a closer look at some of the images extracted in the pipeline container under /temp/images revealed a number of issues that would explain the discrepant results in the above comparison. First, some figures were not extracted at all and were therefore never screened. Second, images were frequently extracted without the text layer (Figure 9), which may contain crucial information such as whether the y axis is displaying counts or proportions. Third, images were sometimes extracted from layers or subsections (e.g. only the legend) of the original figure, resulting in several incomplete pieces of the figure (Figure 10). Finally, some extraneous non-figure images such as logos were extracted as well.

Figure 9: Image extracted with stripped text layer

Acknowledgements

Nico Riedel

Peter Eckmann

Anita Bandrowski

Robert Schulz

Parya Abbasi