The database SQL file: bc7_lit_covid.sql.
Import train, development, and test sets in the BC7-LitCovid dataset to the database by running MetadataImporter.java in the package cn.edu.bjut.ui.
> UPDATE article SET doi = "10.3760/CMA.J.CN112138-20200221-00114" WHERE id = 353; > UPDATE article SET doi = "10.5830/CVJA-2020-016" WHERE id = 21359;
One can run ArticleDoiChecker.java in the package cn.edu.bjut.ui to check whether multiple articles are attached a same DOI number. In our case, 26 pairs of articles are found to share a same DOI number. One can merge these articles by running ArticleDoiMerger.java in the package cn.edu.bjut.ui.
To download PMC-ids-csv.gz, and save it in the directory resource.
To run ArticlePmcIdUpdater.java in the package cn.edu.bjut.ui.
To download bibliographic data in batch in the format of BibTeX from the Web of Science according to DOI names extracted with the following SQL statement:
> SELECT DISTINCT doi FROM article WHERE doi IS NOT NULL;
To save the downloaded data in the directory data/WoS. Then, the following information can be updated by running ArticleBibTexUpdater.java in the package cn.edu.bjut.ui from the directory data/WoS: WOS ID, language, publisher, journal (full_name, Print ISSN, and Online ISSN), keyword-plus, category, research area, and so on.
> UPDATE article SET language_code = "zh" WHERE doi LIKE "10.3760/CMA%";
The biological entity annotations in the BC7-LitCovid dataset can be imported to the database by running LitCovidAnnotationImporter.java in the package cn.edu.bjut.ui.
If you want, these fulltexts can be converted to BioC-XML format by running LitCovidToBioCConventor.java in the package cn.edu.bjut.ui. In this time, these XML files are saved in the directory data/bioc-litcovid.
To extract XML files with ArticleBioCPMCExtractor.java in the package cn.edu.bjut.pubtator with the help of BioC-PMC API, and save them in the directory data/bioc-pmc.
Not all articles in the BioC-PMC API are attached the resulting fulltexts. To eliminate these articles, the extracted XML files with the size less than 10KB are checked manually one by one, and the corresponding PMIDs are saved in the file data/pmid-removed.list. Then, these files are deleted, and the field full_text_source in the database is updated correspondingly by running BioCPMCRemover.java in the package cn.edu.bjut.pubtator.
The requests are submitted to PubTator server by running PmcTextSubmitRequester.java in the package cn.edu.bjut.pubtator. In this time, session number for each XML file can be obtained.
The annotated entities can be retrieved by running PmcTextSubmitRetriever.java in the package cn.edu.bjut.pubtator. The results are stored in the directory data/bioc-pmc-pubtator.
The annotated entities can be imported to the database by running BioCPmcAnnotationImporter.java in the package cn.edu.bjut.ui from the directory data/bioc-pmc-pubtator.
As mentioned in previous section, not all articles in the BioC-PMC API are attached the resulting fulltexts. One can retrieve those articles by running the following SQL statement.
> SELECT id, pmid, title, abstract, doi FROM article WHERE full_text_source = "MANUAL";
As for our case, there are 7,028 articles in total. To fetch the resulting fulltext, each passage can be copied manually to an excel file. Then, one can run FullTextWithExcelImporter.java in the package cn.edu.bjut.ui to import these fulltexts to the database.
These fulltexts are converted to BioC-XML format by running CustomTextToBioCConventor.java in the package cn.edu.bjut.ui. In this time, these XML files are saved in the directory data/bioc-custom.
The requests are submitted to PubTator server by running CustomTextSubmitRequester.java in the package cn.edu.bjut.pubtator. In this time, session number for each XML file can be obtained.
The annotated entities can be retrieved by running CustomTextSubmitRetriever.java in the package cn.edu.bjut.pubtator. The results are stored in the directory data/bioc-custom-pubtator.
The annotated entities can be imported to the database by running BioCCustomAnnotationImporter.java in the package cn.edu.bjut.ui from the directory data/bioc-custom-pubtator.
> UPDATE article_annotation SET text = REGEXP_REPLACE(text, "[ | | | | | |\n]", " ");
> SELECT article.id AS article_id, pmid, pmc_id, title, doi, author.id AS author_id, seq_no, full_name, last_name, first_name FROM article, article_author, author WHERE article_author.article_id = article.id AND article_author.author_id = author.id AND pmid IN ("32105052", "32133830", "32141280", "32149484", "32216961", "32265220", "32311431", "32319971", "32327229", "32337192", "32362243", "32365221", "32369656", "32373991", "32376398", "32392129", "32422410", "32463434", "32495923", "32511704", "32512291", "32524843", "32530033", "32530813", "32531110", "32531138", "32532430", "32541232", "32544034", "32549072", "32558644", "32574896", "32593742", "32641989", "32731151", "32732190", "32770466", "32804103", "32804122", "32820721", "32829601", "32831176", "32835573", "32837678", "32848097", "32865183", "32865184", "32870139", "32873575", "32876113", "32876697", "32887691", "32893646", "32901732", "32915172", "32917566", "32918858", "32921703", "32949380", "32949881", "32978251", "32986819", "33014380", "33032267", "33042553", "33045362", "33071427", "33074221", "33076590", "34211521", "34225090", "34226470", "34340970", "34348116", "34375647", "34376927")
To download the MeSH data in the format of XML, and then to import it to the database by running MeshHeadingImporter.java in the package cn.edu.bjut.ui.
To extract XML files with ArticleEFetchExtractor.java in the package cn.edu.bjut.ui with the help of E-Fetch API, and save them in the directory data/mesh.
Import MeSH Heading information to the database from the directory data/mesh by running MeshHeadingUpdater.java in the package cn.edu.bjut.ui. Note that the resulting publication years and XML fragment for each author are also updated in this time.
> SELECT id, pmid, title, doi, publication_year FROM article WHERE publication_year = 0;
To export the above records to article_publication_year.xlsx in the directory data, and then correct manually them one by one.
Once correction is done, to run ArticlePublicationYearUpdater.java in the package cn.edu.bjut.ui to import the related years in the file data/article_publication_year.xlsx into MySQL database.
The affiliation information can be updated from XML fragment by running ArticleAffiliationRawImporter.java in the package cn.edu.bjut.ui. Note that the resulting ORCID and email of each author are also updated in this time.
One can obtain the labelset statistics by running LabelSetSummary.java in the package cn.edu.bjut.ui.
One can run CSVExporter.java in the package cn.edu.bjut.ui to export all related information for deep learning.
In addition, the entities and meshes can be separately exported by running EntityExporter.java and MeshHeadingExporter.java respectively in the package cn.edu.bjut.ui.
> SELECT DISTINCT doi FROM article WHERE doi IS NOT NULL ORDER BY doi ASC;
With the help of OpenCitations, the DOIs of citing and cited articles can be retrieved according to a DOI list file by running the following statement.
> python retrieve.py bc7_lit_covid.csv bc7_lit_covid
The forward and backward citations can be imported to the database by running DirectCitationImporter.java in the package cn.edu.bjut.ui. Then, one can obtain indirect citations (co-citation and bibliographic coupling) by running IndirectCitationUpdater.java in the package cn.edu.bjut.ui.