这里会显示出您选择的修订版和当前版本之间的差别。
两侧同时换到之前的修订记录 前一修订版 后一修订版 | 前一修订版 | ||
zh:notes:scientific_data [2024/09/18 11:30] pzczxs [Download Full Texts in XML format] |
— (当前版本) | ||
---|---|---|---|
行 1: | 行 1: | ||
- | ====== Scientific Data Construction ====== | ||
- | ===== Citation Information ===== | ||
- | To add later. | ||
- | |||
- | ===== Requirements ===== | ||
- | *[[https://www.re3data.org/|re3data]] | ||
- | * | ||
- | |||
- | ===== Create Database ===== | ||
- | The database SQL file: <color red>scientific_data.sql</color>. | ||
- | |||
- | ===== PLoS ===== | ||
- | Source: https://plos.org/text-and-data-mining/ | ||
- | |||
- | ==== Download Bibliographic Data ==== | ||
- | In the first place, the DOI names are extracted by running <color red>DoiExtractor</color> in the package <color red>cn.edu.bjut.download</color> from the downloaded file [[https://allof.plos.org/allofplos.zip|allofplos.zip]]. Then, the bibliographic data can be downloaded from Web of Science (WoS). | ||
- | |||
- | *Database: Web of Science | ||
- | *Export Format:BibTeX | ||
- | *Record Content:Full Record and Cited References | ||
- | |||
- | ==== Preprocess Bibliographic Data ==== | ||
- | One can preprocess all files in BibTeX format by running <color red>BibTeXPreprocessor.java</color> in the package <color red>cn.edu.bjut.ui</color> of the project <color red>WoSImporter</color>. | ||
- | |||
- | ==== Import Bibliographic Data ==== | ||
- | The bibliographic data can be imported to the database by running <color red>ArticleBibTexImporter.java</color> in the package <color red>cn.edu.bjut.ui</color> of the project <color red>WoSImporter</color>. It is noteworthy that the parameters //checkFlag// and //citedArticleFlag// should be set to //false//. In this procedure, the DOI names of cited articles are pre-processed with the cleaning method in [[https://doi.org/10.1007/s11192-019-03162-4|Xu et al. (2019)]]. | ||
- | |||
- | After then, exploratory analysis on the data quality can be conducted, especially journal information. Once incorrect information is found, one can manually correct it. One can remove the articles and related information beyond the PLoS by running the following statements: | ||
- | <code sql> | ||
- | > DELETE FROM article_affiliation WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886); | ||
- | > DELETE FROM article_author WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886); | ||
- | > DELETE FROM article_category WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886); | ||
- | > DELETE FROM article_cited_article WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886); | ||
- | > DELETE FROM article_country WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886); | ||
- | > DELETE FROM article_funding WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886); | ||
- | > DELETE FROM article_keyword_plus WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886); | ||
- | > DELETE FROM article_research_area WHERE article_id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886); | ||
- | > DELETE FROM article WHERE id IN (642855, 703432, 676918, 812157, 818637, 646786, 873495, 649006, 612453, 889538, 796784, 682058, 900841, 849351, 879235, 856701, 856703, 766864, 723449, 797595, 804317, 822065, 712978, 869023, 818634, 894816, 794687, 635135, 875880, 622562, 609391, 742727, 831367, 693218, 783718, 892274, 686552, 787182, 712590, 659059, 863275, 878725, 754534, 682924, 890398, 826473, 848687, 856737, 812432, 812886); | ||
- | > DELETE FROM journal WHERE id > 2159; | ||
- | > ALTER TABLE journal AUTO_INCREMENT=2160; | ||
- | </code> | ||
- | |||
- | <code sql> | ||
- | > SELECT id, wos_id, title, publication_year, doi FROM article WHERE doi LIKE "%;%" OR doi LIKE "%,%" ORDER BY id ASC; | ||
- | </code> | ||
- | |||
- | <code sql> | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pone.0270273") WHERE id = 755519; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pone.0271835") WHERE id = 834969; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pone.0274263") WHERE id = 759937; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pcbi.0030191") WHERE id = 813782; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pcbi.0030083") WHERE id = 741355; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pmed.0050036") WHERE id = 666819; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pmed.0040236") WHERE id = 813787; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pgen.0030122") WHERE id = 663902; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pbio.0050003") WHERE id = 663493; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pbio.0040154") WHERE id = 787002; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pmed.0020299") WHERE id = 840538; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pmed.0020173") WHERE id = 710321; | ||
- | > UPDATE article SET doi = UPPER("110.1007/s11033-022-08097-3") WHERE id = 14630; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pmed.0050247") WHERE id = 639110; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pmed.0050246") WHERE id = 666780; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pmed.0050070") WHERE id = 855183; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pcbi.0030040") WHERE id = 642571; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pcbi.0030047") WHERE id = 642574; | ||
- | > UPDATE article SET doi = UPPER("10.1186/s12943-022-01693-8") WHERE id = 410654; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pbio.0050291") WHERE id = 874050; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.pone.0032405") WHERE id = 655689; | ||
- | > UPDATE article SET doi = UPPER("10.1371/journal.ppat.0020064") WHERE id = 659150; | ||
- | </code> | ||
- | |||
- | ==== Construct Linkages between Articles ==== | ||
- | <code sql> | ||
- | > SELECT id, preferred_id, title, doi FROM article WHERE journal_id >= 2152 AND TYPE IN ("Correction", "Expression of Concern", "Retraction") AND preferred_id IS NULL; | ||
- | </code> | ||
- | |||
- | ==== Import Passages ==== | ||
- | Multiple passages (such as //data availability//, //acknowledgement// and //supplementary material//) are imported to the database by running <color red>PassageImporter.java</color> in the package <color red>cn.edu.bjut.ui</color>. | ||
- | |||
- | ==== Filter Passages ==== | ||
- | //supplementary material// | ||
- | <code sql> | ||
- | > UPDATE article_supplementary_material SET flag = 0 WHERE mimetype IN (???); | ||
- | > UPDATE article_supplementary_material SET flag = 1 WHERE mimetype IN (???); | ||
- | > UPDATE article_supplementary_material SET flag = 0 WHERE label LIKE "%code%" OR label LIKE "%script%" OR label LIKE "%program%" OR label LIKE "%command%" OR label LIKE "%fig%" OR label LIKE "%image%" OR label LIKE "%movie%" OR label LIKE "%video%" OR label LIKE "%audio%" OR label LIKE "%language%" OR label LIKE "%document%" OR label LIKE "%software%" OR label LIKE "%tool%"; | ||
- | > UPDATE article_supplementary_material SET flag = 0 WHERE "%data%" OR label LIKE "%sequence%" OR label LIKE "%chromatogram%" OR href LIKE "%fasta"; | ||
- | </code> | ||
- | |||
- | //acknowledgement//: | ||
- | <code sql> | ||
- | > UPDATE article_passage SET flag = 1 WHERE passage_type = "ACKNOWLEDGEMENT" AND NOT (xml_fragment LIKE "%avail%" AND xml_fragment LIKE "%xlink:href%"); | ||
- | </code> | ||
- | |||
- | //data_availability//: | ||
- | <code sql> | ||
- | > UPDATE article_passage SET flag = 0 WHERE passage_type = "DATA_AVAILABILITY" AND xml_fragment IN (SELECT DISTINCT xml_fragment FROM article_passage WHERE passage_type = "DATA_AVAILABILITY" AND xml_fragment NOT LIKE '%"%' AND flag = 0); | ||
- | > UPDATE article_passage SET flag = 0 WHERE passage_type = "DATA_AVAILABILITY" AND xml_fragment REGEXP ???; | ||
- | </code> | ||
- | |||
- | ==== Annotate Entity and Relation Mentions ==== | ||
- | To reduce the workload, one can annotate automatically entity and relation mentions on the basis of curated rules by running <color red>PassageExporterFromScratch.java</color> in the package <color red>cn.edu.bjut.brat</color>. In this time, two sub-directories (//data_availability// and //acknowledgement//) will be generated in the directory <color red>data/brat</color>. Then, these two sub-directories are uploaded to the directory <color red>data</color> of the [[http://brat.nlplab.org/|BRAT]] software. As for our case, one can check the entities in each article by visiting our [[http://54xushuo.net:8001/index.xhtml#/scientific_data/|local BRAT]]. Then, the entity and relation mentions are annotated manually. | ||
- | |||
- | To improve the annotation quality, one can run <color red>NestedEntityDetector.java</color> in the package <color red>cn.edu.bjut.brat</color> to detect the nested entities, and run <color red>AnnotationChecker.java</color> in the package <color red>cn.edu.bjut.brat</color> to detect the entity candidates for checking by the resulting annotators. | ||
- | |||
- | ==== Import Annotations ==== | ||
- | The annotated entities and relations can be imported to the database by running <color red>AnnotationImporter.java</color> in the package <color red>cn.edu.bjut.brat</color>. It is worth noting that two entity mentions crossing multiple fragments will not imported to the database, such as T1 in the article with id = 825858, and T13 and T14 in the article with id = 889757. | ||
- | |||
- | To construct the linkages between articles and repository mentions, one can run <color red>RepositoryEnricher.java</color> in the package <color red>cn.edu.bjut.ui</color>. | ||
- | |||
- | ==== Export Annotations ==== | ||
- | If one wants, the annotations in the database can be exported in the BRAT format by running <color red>PassageExporterFromDatabase.java</color> in the package <color red>cn.edu.bjut.brat</color>. | ||
- | |||
- | ==== Extract and Normalize Repositories ==== | ||
- | One can extract unique repositories and hrefs from annotations by running <color red>RepositoryExtractor.java</color> in the package <color red>cn.edu.bjut.ui</color>. Then, the resulting repositories are normalized by running <color red>RepositoryNormalizer.java</color> in the package <color red>cn.edu.bjut.ui</color>. | ||
- | |||
- | ==== Build Linkages between Articles and Repositories ==== | ||
- | The linkages between articles and repositories can be built by running <color red>ArticleRepositoryBuilder.java</color> in the package <color red>cn.edu.bjut.ui</color>. After this operation, the summary about linkages between articles and repositories can be obtained by running <color red>RepositorySummary.java</color> in the package <color red>cn.edu.bjut.ui</color>. | ||
- | ===== Springer Nature ===== | ||
- | Source: https://dev.springernature.com | ||
- | |||
- | ==== Download Bibliographic Data ==== | ||
- | In the first place, the WoS IDs are extracted by running <color red>WosIdExtractor</color> in the package <color red>cn.edu.bjut.download</color>. Then, the bibliographic data can be downloaded from Web of Science (WoS). | ||
- | |||
- | *Database: Web of Science | ||
- | *Export Format:BibTeX | ||
- | *Record Content:Full Record and Cited References | ||
- | ==== Download Full Texts in XML format ==== | ||
- | The full texts can be downloaded by running <color red>SpringerNatureDownloader.java</color> in the package <color red>cn.edu.bjut.download</color>. | ||
- | |||
- | ===== re3data Repository ===== | ||
- | ==== Download Repository ==== | ||
- | |||
- | ~~DISCUSSION:closed~~ |