University of Cambridge scientists have created new databases for dye-sensitized and perovskite cells, respectively. They used the ChemDataExtractor text-mining toolkit to collect the data.
Researchers from the University of Cambridge have created two automatically generated databases presenting photovoltaic properties and device material data for dye-sensitized solar cells (DSCs) and perovskite solar cells (PSCs).
The scientists used the ChemDataExtractor text-mining toolkit, which they described as a “chemistry-aware” natural-language-processing (NLP) tool. It was applied to 25,720 scientific articles comprising 660,881 data entries representing 57,678 photovoltaic devices. The database for the dye-sensitized devices included 475,045 entries organized into 41,680 records. The one for perovskite cells included 185,836 entries organized into 15,818 records.
“Such a database could also reveal information about the variation observed in popular structures that have been synthesized multiple times in different studies, to glean information on the underlying variation for that particular architecture,” they said.
Extraction pipeline to create database records from a research article
Image: University of Cambridge, scientific data, Creative Commons License CC BY 4.0
The researchers claim their multifaceted evaluation approach ensured data quality, with precision metrics ranging from 73.1% to 95.8%.
“It is interesting to note that the accuracy of data extracted for the PSC database exceeded that of the DSC database on both metrics,” they said. “This is surprising since the parsers for the photovoltaic properties and the logic for calculating derived properties are the same in both cases.
The research team described their findings in “Perovskite- and Dye-Sensitized Solar-Cell Device Databases Auto-generated Using ChemDataExtractor,” which was recently published in Scientific Data.