用户工具

站点工具


zh:notes:cord19_entity

Cord-19-Related Entity Annotation

Citation Information

  1. 徐硕,张萌萌,枊力元,王聪聪,孙睿,李怡琳,徐金楠,安欣,2023. 新冠领域溯源类论文筛选及全文实体标注研究. 农业图书情报学报.

Dataset

Create Database

The database SQL file: cord19_entity.sql.

Import Metadata

Import all_sources_metadata_2020-03-13.csv to the database by running MetadataImporter.java in the package cn.edu.bjut.ui.

Update the resulting identifier by running TargetArticleIdentifierUpdater.java in the package cn.edu.bjut.ui.

Import Full Texts

Import biorxiv_medrxiv.tar.gz, comm_use_subset.tar.gz, noncomm_use_subset.tar.gz, and pmc_custom_license.tar.gz to the database by running FullTextImporter.java in the package cn.edu.bjut.ui.

Import Entities

Import CORD-NER-full.json to the database by running EntityImporter.java in the package cn.edu.bjut.ui.

Deduplicate Target Articles

Export duplicative records according to sha, doi, pubmed_id, pmc_id by running DuplicativeTargetArticleExporter.java in the package cn.edu.bjut.ui. Then, four files sha.csv, doi.csv, pubmed_id.csv, and pmc_id.csv will be generated in the directory data.

One can convert these CSV files to the resulting Excel files, and then check them one by one. Once the checking is done, the resulting data can be imported into the database by running TargerArticleDeduplicator.java in the package cn.edu.bjut.ui.

Update Span Information

Check and update the resulting span information of entities by running EntitySpanChecker.java in the package cn.edu.bjut.ui.

Check the resulting span information of cites and references by running CiteRefSpanChecker.java in the package cn.edu.bjut.ui.

Many abstracts in the metadata are inconsistent with those in the full texts. We take the abstracts in the metadata as golden standard. To align the cite and reference spans appearing in the abstracts of full texts, the resulting cite and reference spans can be exported by running CiteRefSpanExporter.java in the package cn.edu.bjut.ui. Then, four files cite_span.csv, ref_span.csv, cite_span_abstract.txt, and ref_span_abstract.txt will be generated in the directory data.

One can convert these two CSV files to the resulting Excel files, and then check them one by one with the help of TXT files. Once the checking is done, the resulting data can be imported into the database by running CiteRefSpanUpdater.java in the package cn.edu.bjut.ui.

The spans of cites and references are calibrated by running CiteRefSpanCalibrater.java in the package cn.edu.bjut.ui.

Update License Information

> UPDATE target_article SET license = "biorxiv" WHERE license = "See https://www.biorxiv.org/about-biorxiv"; 
> UPDATE target_article SET license = "medrxiv" WHERE license = "See https://www.medrxiv.org/submit-a-manuscript";
> UPDATE target_article SET license = "no-cc" WHERE license = "pd";  
> UPDATE target_article SET license = "no-cc" WHERE license = "NO-CC CODE"; 

Filter Entities

UPDATE target_article_entity SET flag = 0 WHERE category_id IN (6, 9, 10, 11, 12, 13, 14, 15, 19, 21, 24, 25, 26, 28, 29, 30, 31, 33, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 52, 53, 54, 55, 56, 58, 60, 62);  
UPDATE target_article_entity SET flag = 0 WHERE mention IN ("human", "patient", "People", "Fig", "purchased", "control group", "extracts"); 
UPDATE target_article_entity SET flag = 0 WHERE category_id = 16 AND mention IN ("China", "Brazil"); 

Convert the dataset to the Brat format

In order to correct these entities with the brat software, these two corpora need to be converted to the Brat format by running BratExporter.java in the package cn.edu.bjut.brat. Then, two files TARGET_ARTICLE_ID.txt and TARGET_ARTICLE_ID.ann for each target article will be generated in the directory data/cord19.

To speed up the annotation procedure, these target articles are randomly divided into six groups of nearly equal size by running SharedSubsetSampler.java in the package cn.edu.bjut.brat. In addition, a shared subset consisting of 99 articles are also extracted to ensure the annotation consistency.

One can import these files into the directory data of the brat software. As for our case, one can check the entities in each target article by visiting our local Brat.

Entity Annotation

According to the guideline V1.0, the entities in a shared subset consisting of 99 articles are annotated by six annotators. Then, one can run NestedEntityDetector.java in the package cn.edu.bjut.iaa to detect the nested entities, and run AnormalEntityDetector.java in the package cn.edu.bjut.iaa to detect the long entities. The resulting annotators can correct those annotation exceptions.

To measure the agreement between these six annotators by running AgreementMeasure.java in the package cn.edu.bjut.iaa.

To reduce the ambiguity, the guideline is revised to obtain the version V1.5. In this version, there are 21 categories in total. One can convert the annotations with previous guideline to those with the latest guideline by running SchemaAdjustment.java in the package cn.edu.bjut.iaa.

The categories EUKARYOTE and ORGANISM are kept temporarily in this version, but the entities with these categories should be corrected to other properer categories. One can run OldSchemaEntityDetector.java in the package cn.edu.bjut.iaa to check whether the entities are adjusted completely.

To run CommonAndDifferentEntityExtractor.java in the package cn.edu.bjut.iaa, the annotated entities can be divided into two groups: common and different in the directory data/subset. The entities shared by all annotators are saved in the former, and the different entities in the latter. In this time, the annotators can work together to correct the different entities so that the agreement among annotators can reach to 100%.

zh/notes/cord19_entity.txt · 最后更改: 2023/03/14 17:27 由 pzczxs