用户工具

站点工具


zh:notes:scientific_data

Scientific Data Construction

Citation Information

To add later.

Create Database

The database SQL file: scientific_data.sql.

PLoS

Download Bibliographic Data

In the first place, the DOI names are extracted by running DoiExtractor in the package cn.edu.bjut.download from the downloaded file allofplos.zip. Then, the bibliographic data can be downloaded from Web of Science (WoS).

  • Database: Web of Science
  • Export Format:BibTeX
  • Record Content:Full Record and Cited References

Preprocess Bibliographic Data

One can preprocess all files in BibTeX format by running BibTeXPreprocessor.java in the package cn.edu.bjut.ui of the project WoSImporter.

Import Bibliographic Data

The bibliographic data can be imported to the database by running ArticleBibTexImporter.java in the package cn.edu.bjut.ui of the project WoSImporter. It is noteworthy that the parameters checkFlag and citedArticleFlag should be set to false. In this procedure, the DOI names of cited articles are pre-processed with the cleaning method in Xu et al. (2019).

After then, exploratory analysis on the data quality can be conducted, especially journal information. Once incorrect information is found, one can manually correct it. One can remove the articles and related information beyond the PLoS by running the following statements:

> DELETE FROM article_affiliation WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886); 
> DELETE FROM article_author WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886);
> DELETE FROM article_category WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886);
> DELETE FROM article_cited_article WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886);
> DELETE FROM article_country WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886);
> DELETE FROM article_funding WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886);
> DELETE FROM article_keyword_plus WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886);
> DELETE FROM article_research_area WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886);
> DELETE FROM article WHERE id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886);
> DELETE FROM journal WHERE id > 2159;
> ALTER TABLE journal AUTO_INCREMENT=2160;
> SELECT id, wos_id, title, publication_year, doi FROM article WHERE doi LIKE "%;%" OR doi LIKE "%,%" ORDER BY id ASC; 
> UPDATE article SET doi = UPPER("10.1371/journal.pone.0270273") WHERE id = 755519;
> UPDATE article SET doi = UPPER("10.1371/journal.pone.0271835") WHERE id = 834969;
> UPDATE article SET doi = UPPER("10.1371/journal.pone.0274263") WHERE id = 759937;
> UPDATE article SET doi = UPPER("10.1371/journal.pcbi.0030191") WHERE id = 813782;
> UPDATE article SET doi = UPPER("10.1371/journal.pcbi.0030083") WHERE id = 741355;
> UPDATE article SET doi = UPPER("10.1371/journal.pmed.0050036") WHERE id = 666819;
> UPDATE article SET doi = UPPER("10.1371/journal.pmed.0040236") WHERE id = 813787;
> UPDATE article SET doi = UPPER("10.1371/journal.pgen.0030122") WHERE id = 663902;
> UPDATE article SET doi = UPPER("10.1371/journal.pbio.0050003") WHERE id = 663493;
> UPDATE article SET doi = UPPER("10.1371/journal.pbio.0040154") WHERE id = 787002;
> UPDATE article SET doi = UPPER("10.1371/journal.pmed.0020299") WHERE id = 840538;
> UPDATE article SET doi = UPPER("10.1371/journal.pmed.0020173") WHERE id = 710321;
> UPDATE article SET doi = UPPER("110.1007/s11033-022-08097-3") WHERE id = 14630;
> UPDATE article SET doi = UPPER("10.1371/journal.pmed.0050247") WHERE id = 639110;
> UPDATE article SET doi = UPPER("10.1371/journal.pmed.0050246") WHERE id = 666780;
> UPDATE article SET doi = UPPER("10.1371/journal.pmed.0050070") WHERE id = 855183;
> UPDATE article SET doi = UPPER("10.1371/journal.pcbi.0030040") WHERE id = 642571;
> UPDATE article SET doi = UPPER("10.1371/journal.pcbi.0030047") WHERE id = 642574;
> UPDATE article SET doi = UPPER("10.1186/s12943-022-01693-8") WHERE id = 410654;
> UPDATE article SET doi = UPPER("10.1371/journal.pbio.0050291") WHERE id = 874050;
> UPDATE article SET doi = UPPER("10.1371/journal.pone.0032405") WHERE id = 655689;
> UPDATE article SET doi = UPPER("10.1371/journal.ppat.0020064") WHERE id = 659150;

Construct Linkages between Articles

> SELECT id, preferred_id, title, doi FROM article WHERE journal_id >= 2152 AND TYPE IN ("Correction", "Expression of Concern", "Retraction") AND preferred_id IS NULL; 

Import Passages

Multiple passages (such as data availability, acknowledgement and supplementary material) are imported to the database by running PassageImporter.java in the package cn.edu.bjut.ui.

Filter Passages

supplementary material

> UPDATE article_supplementary_material SET flag = 0 WHERE mimetype IN (???);
> UPDATE article_supplementary_material SET flag = 1 WHERE mimetype IN (???);
> UPDATE article_supplementary_material SET flag = 0 WHERE label LIKE "%code%" OR label LIKE "%script%" OR label LIKE "%program%" OR label LIKE "%command%" OR label LIKE "%fig%" OR label LIKE "%image%" OR label LIKE "%movie%" OR label LIKE "%video%" OR label LIKE "%audio%" OR label LIKE "%language%" OR label LIKE "%document%" OR label LIKE "%software%" OR label LIKE "%tool%";
> UPDATE article_supplementary_material SET flag = 0 WHERE "%data%" OR label LIKE "%sequence%" OR label LIKE "%chromatogram%" OR href LIKE "%fasta";

acknowledgement:

> UPDATE article_passage SET flag = 1 WHERE passage_type = "ACKNOWLEDGEMENT" AND NOT (xml_fragment LIKE "%avail%" AND xml_fragment LIKE "%xlink:href%");

data_availability:

> UPDATE article_passage SET flag = 0 WHERE passage_type = "DATA_AVAILABILITY" AND xml_fragment IN (SELECT DISTINCT xml_fragment FROM article_passage WHERE passage_type = "DATA_AVAILABILITY" AND xml_fragment NOT LIKE '%"%' AND flag = 0);
> UPDATE article_passage SET flag = 0 WHERE passage_type = "DATA_AVAILABILITY" AND xml_fragment REGEXP ???;

Annotate Entity and Relation Mentions

To reduce the workload, one can annotate automatically entity and relation mentions on the basis of curated rules by running PassageExporterFromScratch.java in the package cn.edu.bjut.brat. In this time, two sub-directories (data_availability and acknowledgement) will be generated in the directory data/brat. Then, these two sub-directories are uploaded to the directory data of the BRAT software. As for our case, one can check the entities in each article by visiting our local BRAT. Then, the entity and relation mentions are annotated manually.

To improve the annotation quality, one can run NestedEntityDetector.java in the package cn.edu.bjut.brat to detect the nested entities, and run AnnotationChecker.java in the package cn.edu.bjut.brat to detect the entity candidates for checking by the resulting annotators.

Import Annotations

The annotated entities and relations can be imported to the database by running AnnotationImporter.java in the package cn.edu.bjut.brat. It is worth noting that two entity mentions crossing multiple fragments will not imported to the database, such as T1 in the article with id = 825858, and T13 and T14 in the article with id = 889757.

To construct the linkages between articles and repository mentions, one can run RepositoryEnricher.java in the package cn.edu.bjut.ui.

Export Annotations

If one wants, the annotations in the database can be exported in the BRAT format by running PassageExporterFromDatabase.java in the package cn.edu.bjut.brat.

Extract and Normalize Repositories

One can extract unique repositories and hrefs from annotations by running RepositoryExtractor.java in the package cn.edu.bjut.ui. Then, the resulting repositories are normalized by running RepositoryNormalizer.java in the package cn.edu.bjut.ui.

Build Linkages between Articles and Repositories

The linkages between articles and repositories can be built by running ArticleRepositoryBuilder.java in the package cn.edu.bjut.ui. After this operation, the summary about linkages between articles and repositories can be obtained by running RepositorySummary.java in the package cn.edu.bjut.ui.

Springer Nature

Download Bibliographic Data

In the first place, the WoS IDs are extracted by running WosIdExtractor in the package cn.edu.bjut.download. Then, the bibliographic data can be downloaded from Web of Science (WoS).

  • Database: Web of Science
  • Export Format:BibTeX
  • Record Content:Full Record and Cited References

Download Full Texts in XML format

The full texts can be downloaded by running SpringerNatureDownloader.java in the package cn.edu.bjut.download.

zh/notes/scientific_data.txt · 最后更改: 2024/01/07 17:19 由 pzczxs