Multi-Label Topic Classification for COVID-19 Literature

Citation Information

Shuo Xu, Yuefu Zhang, Liang Chen, and Xin An, 2024. Is Metadata of Articles about COVID-19 enough for Multi-Label Topic Classification Task? Database, Vol. 2024, pp. baae106. Dataset
Qingyu Chen, Alexis Allot, Robert Leaman, Rezarta Islamaj, Jingcheng Du, Li Fang, Kai Wang, Shuo Xu, Yuefu Zhang, Parsa Bagherzadeh, Sabine Bergler, Aakash Bhatnagar, Nidhir Bhavsar, Yung-Chun Chang, Sheng-Jie Lin, Wentai Tang, Hongtong Zhang, Ilija Tavchioski, Senja Pollak, Shubo Tian, Jinfeng Zhang, Yulia Otmakhova, Antonio Jimeno Yepes, Hang Dong, Honghan Wu, Richard Dufour, Yanis Labrak, Niladri Chatterjee, Kushagri Tandon, Fréjus A. A. Laleye, Loïc Rakotoson, Emmanuele Chersoni, Jinghang Gu, Annemarie Friedrich, Subhash Chandra Pujari, Mariia Chizhikova, Naveen Sivadasan, Saipradeep VG, and Zhiyong Lu, 2022. Multi-Label Classification for Biomedical Literature: An Overview of the BioCreative VII LitCovid Track for COVID-19 Literature Topic Annotations. Database, Vol. 2022, pp. baac069.
Shuo Xu, Yuefu Zhang, and Xin An, 2021. Team BJUT-BJFU at BioCreative VII LitCovid Track: A Deep Learning based Method for Multi-label Topic Classification in COVID-19 Literature. Proceedings of the BioCreative VII Challenge Evaluation Workshop, pp. 275-277.

Requirements

Dataset

BC7-LitCovid from LitCovid track Multi-label topic classification for COVID-19 literature annotation
LitCovid (FTP), LitCovid (Web Site)
BioC-PMC (FTP), BioC-PMC (Web Site)

Create Database

The database SQL file: bc7_lit_covid.sql.

Import Metadata

Import train, development, and test sets in the BC7-LitCovid dataset to the database by running MetadataImporter.java in the package cn.edu.bjut.ui.

Update DOI

> UPDATE article SET doi = "10.3760/CMA.J.CN112138-20200221-00114" WHERE id = 353; 
> UPDATE article SET doi = "10.5830/CVJA-2020-016" WHERE id = 21359;

Merge Articles

One can run ArticleDoiChecker.java in the package cn.edu.bjut.ui to check whether multiple articles are attached a same DOI number. In our case, 26 pairs of articles are found to share a same DOI number. One can merge these articles by running ArticleDoiMerger.java in the package cn.edu.bjut.ui.

Update PMC ID

To download PMC-ids-csv.gz, and save it in the directory resource.

To run ArticlePmcIdUpdater.java in the package cn.edu.bjut.ui.

Update Other Information

To download bibliographic data in batch in the format of BibTeX from the Web of Science according to DOI names extracted with the following SQL statement:

> SELECT DISTINCT doi FROM article WHERE doi IS NOT NULL;

To save the downloaded data in the directory data/WoS. Then, the following information can be updated by running ArticleBibTexUpdater.java in the package cn.edu.bjut.ui from the directory data/WoS: WOS ID, language, publisher, journal (full_name, Print ISSN, and Online ISSN), keyword-plus, category, research area, and so on.

> UPDATE article SET language_code = "zh" WHERE doi LIKE "10.3760/CMA%";

Import Entity Annotations

From LitCovid

The biological entity annotations in the BC7-LitCovid dataset can be imported to the database by running LitCovidAnnotationImporter.java in the package cn.edu.bjut.ui.

If you want, these fulltexts can be converted to BioC-XML format by running LitCovidToBioCConventor.java in the package cn.edu.bjut.ui. In this time, these XML files are saved in the directory data/bioc-litcovid.

From PubTator Annotations on BioC-PMC Fulltexts

To extract XML files with ArticleBioCPMCExtractor.java in the package cn.edu.bjut.pubtator with the help of BioC-PMC API, and save them in the directory data/bioc-pmc.

Not all articles in the BioC-PMC API are attached the resulting fulltexts. To eliminate these articles, the extracted XML files with the size less than 10KB are checked manually one by one, and the corresponding PMIDs are saved in the file data/pmid-removed.list. Then, these files are deleted, and the field full_text_source in the database is updated correspondingly by running BioCPMCRemover.java in the package cn.edu.bjut.pubtator.

The requests are submitted to PubTator server by running PmcTextSubmitRequester.java in the package cn.edu.bjut.pubtator. In this time, session number for each XML file can be obtained.

The annotated entities can be retrieved by running PmcTextSubmitRetriever.java in the package cn.edu.bjut.pubtator. The results are stored in the directory data/bioc-pmc-pubtator.

The annotated entities can be imported to the database by running BioCPmcAnnotationImporter.java in the package cn.edu.bjut.ui from the directory data/bioc-pmc-pubtator.

From PubTator Annotations on Manual Fulltexts

As mentioned in previous section, not all articles in the BioC-PMC API are attached the resulting fulltexts. One can retrieve those articles by running the following SQL statement.

> SELECT id, pmid, title, abstract, doi FROM article WHERE full_text_source = "MANUAL";

As for our case, there are 7,028 articles in total. To fetch the resulting fulltext, each passage can be copied manually to an excel file. Then, one can run FullTextWithExcelImporter.java in the package cn.edu.bjut.ui to import these fulltexts to the database.

These fulltexts are converted to BioC-XML format by running CustomTextToBioCConventor.java in the package cn.edu.bjut.ui. In this time, these XML files are saved in the directory data/bioc-custom.

The requests are submitted to PubTator server by running CustomTextSubmitRequester.java in the package cn.edu.bjut.pubtator. In this time, session number for each XML file can be obtained.

The annotated entities can be retrieved by running CustomTextSubmitRetriever.java in the package cn.edu.bjut.pubtator. The results are stored in the directory data/bioc-custom-pubtator.

The annotated entities can be imported to the database by running BioCCustomAnnotationImporter.java in the package cn.edu.bjut.ui from the directory data/bioc-custom-pubtator.

Data Cleaning

> UPDATE article_annotation SET text = REGEXP_REPLACE(text, "[ | | | | | |\n]", " ");

Update Author

> SELECT article.id AS article_id, pmid, pmc_id, title, doi, author.id AS author_id, seq_no, full_name, last_name, first_name FROM article, article_author, author WHERE article_author.article_id = article.id AND article_author.author_id = author.id AND pmid IN ("32105052", "32133830", "32141280", "32149484", "32216961", "32265220", "32311431", "32319971", "32327229", "32337192", "32362243", "32365221", "32369656", "32373991", "32376398", "32392129", "32422410", "32463434", "32495923", "32511704", "32512291", "32524843", "32530033", "32530813", "32531110", "32531138", "32532430", "32541232", "32544034", "32549072", "32558644", "32574896", "32593742", "32641989", "32731151", "32732190", "32770466", "32804103", "32804122", "32820721", "32829601", "32831176", "32835573", "32837678", "32848097", "32865183", "32865184", "32870139", "32873575", "32876113", "32876697", "32887691", "32893646", "32901732", "32915172", "32917566", "32918858", "32921703", "32949380", "32949881", "32978251", "32986819", "33014380", "33032267", "33042553", "33045362", "33071427", "33074221", "33076590", "34211521", "34225090", "34226470", "34340970", "34348116", "34375647", "34376927")

Update MeSH Headings

To download the MeSH data in the format of XML, and then to import it to the database by running MeshHeadingImporter.java in the package cn.edu.bjut.ui.

To extract XML files with ArticleEFetchExtractor.java in the package cn.edu.bjut.ui with the help of E-Fetch API, and save them in the directory data/mesh.

Import MeSH Heading information to the database from the directory data/mesh by running MeshHeadingUpdater.java in the package cn.edu.bjut.ui. Note that the resulting publication years and XML fragment for each author are also updated in this time.

Update Publication Year

> SELECT id, pmid, title, doi, publication_year FROM article WHERE publication_year = 0;

To export the above records to article_publication_year.xlsx in the directory data, and then correct manually them one by one.

Once correction is done, to run ArticlePublicationYearUpdater.java in the package cn.edu.bjut.ui to import the related years in the file data/article_publication_year.xlsx into MySQL database.

Update Affiliation

The affiliation information can be updated from XML fragment by running ArticleAffiliationRawImporter.java in the package cn.edu.bjut.ui. Note that the resulting ORCID and email of each author are also updated in this time.

LabelSet Statistics

One can obtain the labelset statistics by running LabelSetSummary.java in the package cn.edu.bjut.ui.

Export for Deep Learning

One can run CSVExporter.java in the package cn.edu.bjut.ui to export all related information for deep learning.

In addition, the entities and meshes can be separately exported by running EntityExporter.java and MeshHeadingExporter.java respectively in the package cn.edu.bjut.ui.

Direct and Indirect Citations

> SELECT DISTINCT doi FROM article WHERE doi IS NOT NULL ORDER BY doi ASC;

With the help of OpenCitations, the DOIs of citing and cited articles can be retrieved according to a DOI list file by running the following statement.

> python retrieve.py bc7_lit_covid.csv bc7_lit_covid

The forward and backward citations can be imported to the database by running DirectCitationImporter.java in the package cn.edu.bjut.ui. Then, one can obtain indirect citations (co-citation and bibliographic coupling) by running IndirectCitationUpdater.java in the package cn.edu.bjut.ui.

目录