CHEMDNER-patents CEMPD subtask sample text data (Version 28th March 2015)
------------------------------------------------------------------------

This directory contains the sample set text for the CHEMDNER-patents CEMP subtask.

1) chemdner_patents_sample_200.txt : Sample set

This file contains plain-text, UTF8-encoded Patent abstracts in a 
tab-separated format with the following three columns:

1- Patent identifier
2- Title of the patent
3- Abstract of the patent

In total 200 abstracts are provided in this sample set (200 titles and 200 abstracts)

3: Patent Abstract


2) chemdner_cemp_gold_standard_sample.tsv

For the CEMP (chemical entity mention in patents) task we distribute manually tagged patents (title and abstracts) with 
structure-associated chemical entity mentions (SACEMs). The CEMP annotations consist of tab-separated fields containing:

1- Patent identifier
2- Type of text from which the annotation was derived (T: Title, A: Abstract)
3- Start offset
4- End offset
5- Text string of the entity mention
6- Type of chemical entity mention (ABBREVIATION,FAMILY,FORMULA,IDENTIFIERS,MULTIPLE,SYSTEMATIC,TRIVIAL)

Example annotation for patent 'CA2131495C' from the sample set is shown below:

CA2131495C	A	263	267	iron	SYSTEMATIC
CA2131495C	A	72	85	oxcarbazepine	TRIVIAL
CA2131495C	T	15	28	oxcarbazepine	TRIVIAL


3) chemdner_cemp_gold_standard_sample_eval.tsv

Gold standard evaluation format to be used for assessment with the biocreative evaluation script.

It consists of tab-separated columns containing:

1- Patent identifier
2- Offset string consisting in a triplet joined by the ':' character. You have to provide the text type (T: Title, A:Abstract), the start offset and the end offset.



4) CEMP task prediction format.

For the CEMP task we will only request the prediction of the chemical mention offsets following
a similar stetting as done for the BioCreative IV CHEMDNER task on PubMed abstracts. Given a set
of patent abstracts, the participants have to return the start and end indices corresponding to 
all the chemical entities mentioned in this document. 

It consists of tab-separated columns containing:

1- Patent identifier
2- Offset string consisting in a triplet joined by the ':' character. You have to provide the text type (T: Title, A:Abstract), the start offset and the end offset.
3- The rank of the chemical entity returned for this document
4- A confidence score


An example illustrating the prediction format is shown below:

CA2131495C	A:263:267	1	0.89
CA2131495C	T:15:28	2	0.78
CA2131495C	A:72:85	3	0.76
CA2166003C	A:312:324	1	0.99
CA2166003C	A:193:205	2	0.99
CA2180008C	A:86:163	1	0.76
CA2180008C	A:25:59	2	0.66


