CHEMDNER-patents CER subtask sample text data (Version 28th March 2015)
------------------------------------------------------------------------

This directory contains the sample set text for the CHEMDNER-patents CER subtask.

1) chemdner_patents_sample_200.txt : Sample set

This file contains plain-text, UTF8-encoded Patent abstracts in a 
tab-separated format with the following three columns:

1- Patent identifier
2- Title of the patent
3- Abstract of the patent

In total 200 abstracts are provided in this sample set (200 titles and 200 abstracts).


2) chemdner_cer_gold_standard_sample.tsv

For the CER (chemical entity relation) task we distribute manually tagged patents (title and abstracts) with 
annotations of mentions of gene and protein related objects (named as GPROs). 

The CER annotations consist of tab-separated fields containing:

1- Patent identifier
2- Type of text from which the annotation was derived (T: Title, A: Abstract)
3- Start offset
4- End offset
5- Text string of the entity mention
6- Type of GPRO entity mention (NO CLASS,NESTED MENTIONS,IDENTIFIER,SEQUENCE, FULL NAME,ABBREVIATION,FAMILY,MULTIPLE)
7- Database identifier of type 1 GPRO mentions, otherwise the tag 'GPRO_TYPE_2' if provided.

Example annotationsfrom the sample set is shown below:

CN102657655A	A	86	106	acetylcholinesterase	TRIVIAL	P22303
CN102657655A	T	54	74	acetylcholinesterase	TRIVIAL	P22303
CN103159754A	A	316	320	CCR5	ABBREVIATION	P51681
CN103159754A	A	426	430	CCR5	ABBREVIATION	P51681
CN103492390A	A	134	140	IGF-1R	ABBREVIATION	P08069
CN103492390A	A	134	140	IGF-1R	ABBREVIATION	P08069


The definition of GPRO entity mentions that were annotated for the CHEMDNER-patents task was primarily concerned with capturing those types of mentions that are of practical relevance (both for end users of the extracted data as well as for the named entity recognition systems). Therefore the covered GPRO entities had to be annotated at a sufficient level of granularity to be able to determine whether the labeled mention can or can not be linked to a specific gene or gene product (represented by an entry of a biological annotation database). The annotation carried out for the CHEMDNER GPRO task was exhaustive for the types of GPRO mentions that were previously specified. This implies that mentions of other entities such as chemicals or substances should not be labeled as GPROs.
 
We distinguish two types of GPRO entity mention types:
 
(1) GPRO entity mention type 1: covering those GPRO mentions that can be normalized to a bio-entity database record. GPRO type 1 includes the following classes: NESTED MENTIONS, IDENTIFIER, FULL NAME and ABBREVIATION
 
(2) GPRO entity mention type 2: covering those GPRO mentions that in principle cannot be normalized to a unique bio-entity database record. GPRO type 2 includes the following classes: NO CLASS, SEQUENCE, FAMILY and MULTIPLE.
 
Additional details will be provided in the annotation guidelines that will be distributed together with the training set.

Important: For the CER task we will only use GPRO entity mentions of type 1 for evaluation purposes. 


3) chemdner_cer_gold_standard_sample_eval.tsv

Gold standard evaluation format to be used for assessment with the biocreative evaluation script.


It consists of tab-separated columns containing:

1- Patent identifier
2- Offset string consisting in a triplet joined by the ':' character. You have to provide the text type (T: Title, A:Abstract), the start offset and the end offset.


Example:

CA2605854C	A:550:583
CA2605854C	A:585:589
CA2804173A1	A:245:252
CA2804173A1	A:41:47
CA2804173A1	T:12:18
CN101843632A	A:171:178
CN101843632A	A:698:705
CN101843632A	T:15:22


4) CER task prediction format.

For the CER task we will only request the prediction of the GPRO mention offsets following
a similar stetting as done for the BioCreative IV CHEMDNER task on PubMed abstracts. Given a set
of patent abstracts, the participants have to return the start and end indices corresponding to 
all the GPRO type 1 entities mentioned in this document. 

It consists of tab-separated columns containing:

1- Patent identifier
2- Offset string consisting in a triplet joined by the ':' character. You have to provide the text type (T: Title, A:Abstract), the start offset and the end offset.
3- The rank of the chemical entity returned for this document
4- A confidence score


An example illustrating the prediction format is shown below:


CA2755954A1	A:139:172	1	0.97
CA2804173A1	A:245:252	1	0.90
CA2804173A1	A:41:47	2	0.80
CA2804173A1	T:12:18	3	0.70




