CHEMDNER-patents task sample text data (Version 2nd March 2015)
------------------------------------------------------------------------

This directory contains the sample set text for the CHEMDNER-patents task.

Important note: The corresponding text annotations (see below) for each sub-task will be released mid March (tentatively the 15th). 

-The annotation guidelines will be distributed together with the training set.

1) chemdner_patents_sample_200.txt : Sample set

This file contains plain-text, UTF8-encoded Patent abstracts in a 
tab-separated format with the following three columns:

1- Patent identifier
2- Title of the patent
3- Abstract of the patent

In total 200 abstracts are provided in this sample set. 



2) CEMP subtask comments:
Note that the annotations of chemical entities for the CEMP (chemical entity mention in patents, main task) subtask will have the same format as used for the previous CHEMDNER task on PubMed abstracts. This means that the manually generated annotations of chemical entities will consists of tab-separated fields containing:

1- Article identifier (patent identifier)
2- Type of text from which the annotation was derived (T: Title, A: Abstract)
3- Start offset
4- End offset
5- Text string of the entity mention
6- Type of chemical entity mention (ABBREVIATION,FAMILY,FORMULA,IDENTIFIERS,MULTIPLE,SYSTEMATIC,TRIVIAL)

3) CPD subtask comments:
For the CPD (chemical passage detection, text classification task) we will distribute  manually classified patents (title and abstracts) into those that do mention chemical entities and those that do not. The CPD annotations will consists of tab-separated fields containing:

1- Article identifier (patent identifier)
2- Manual classification (1: does contain chemical entities, 0: does not contain chemical entities)

4) CER subtask comments:
For the CER (chemical entity relation) we will request participants to return the 
mention offsets of the biologically relevant entities co-occurring in the same patent 
(either title or abstract). Those bio-entities will include gene and protein mentions. 
The bio-entities will be provided in a similar format as the chemical entity mentions, that is:

1- Article identifier (patent identifier)
2- Type of text from which the annotation was derived (T: Title, A: Abstract)
3- Start offset
4- End offset
5- Text string of the entity mention
6- Type of biological entity mention



