Teams interested in the CHEMDNER-patents task should register for track 2 of BioCreative V. Important: First you need to register, and then go to the 'Team page' and complete the team information and select the task/s in which you intend to participate.
If you have some additional questions send an e-mail to: Martin Krallinger
This task will address the automatic extraction of chemical and biological data from medicinal chemistry patents. The identification and integration of all information contained in these patents (e.g., chemical structures, their synthesis and associated biological data) is currently a very hard task not only for database curators but for life sciences researches and biomedical text mining experts as well. Despite the valuable characterizations of biomedical relevant entities such as chemical compounds, genes and proteins contained in patents, academic research in the area of text mining and information extraction using patent data has been minimal. Pharmaceutical patents covering chemical compounds provide information on their therapeutic applications and, in most cases, on their primary biological targets.
This would be the first time that a biomedical text mining community challenge handles noisy text data (patents) and could result in software that helps to derive annotations from patents. The methods resulting from this task could potentially also provide useful insights for extracting other kinds of information from patents on the one side, or they could serve to better understand how to detect such information from other text collections such as full text articles or legacy reports.
This task would cover three essential steps for the identification of biomedical relevant descriptions of chemical compounds:
Participating teams do not need to send results for all of three sub-tasks. The can also send results only for individual sub-tasks.
We will focus on the following patents: PCT applications with kind code A1, written in English and assigned to the International Patent Classification (IPC) code of “medicinal preparations containing organic active ingredients”. In order to deal with homogeneous/diverse documents, PTC sampling will take into account issues like: applicant (typically a pharma company), filing country/patent office, origin (country) of the applicant and filing date. The test set collection will consist of recently published patents plus a background set in order to avoid manual correction of results. We will restrict the annotation to particular, well-defined sections of patents, with a special focus on patent abstracts. We plan to annotate exhaustively a minimum of 30,000 patent abstracts. The CHEMDNER-patent corpus will rely on a modified version of the annotation guidelines used for the BioCreative-IV CHEMDNER task. These modifications are mainly intended to deal with spelling errors and spurious line breaks as well as to incorporate guidelines for the annotation of biological targets (mainly gene products) and the therapeutic application. We plan to carry out the same annotation strategy in terms of annotation tools and domain expert manual annotations as done for the CHEMDNER task, including and inter-annotator agreement study to determine the consistency of the annotations. The annotation guidelines together with the entire CHEMDNER-patents corpus will be publicly available after the competition.
We will use an adapted version of the BioCreative II.5 evaluation script to score the predictions. For the CPD task, the BioCreative II.5 ACT evaluation scores will be used (MCC, AUC PR, and accuracy). For the CEMP task, the exact mention evaluation strategy as used for the CHEMDNER task using the balanced F-score will be used. For the CER task, the same evaluation strategy as for the Biocreative II.5 IPT will be used (i.e., F-score). See http://www.biocreative.org/resources/biocreative-ii5/evaluation-library for more details. The evaluation software will also check that team predictions are compliant with the required submission format.
At the BioCreative V Workshop to be held in Seville (Spain) September 9-11 (2015) there will be a session devoted to the CHEMDNER patents task. This session will include an overview talk presenting the used datasets and results obtained by the participating teams. A number of teams will also be invited to present their systems. We plan to have also a discussion session where teams, task organizers and domain experts will discuss the obtained results and future steps. Finally during the poster session all teams will be able to present their participating strategies.
Participating teams will be invited to contribute to the: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop. A selected number of top performing teams will also be invited to contribute with a system description paper to a special issue of a relevant journal in the field.
You can post questions related to the CHEMDNER task to the BioCreative mailing list. To register for the BioCreative mailing list, please visit the following page: http://biocreative.sourceforge.net/mailing.html
The CHEMDNER-Biocreative IV special issue was published in the Journal of Chemoinformatics: Volume 7 Supplement 1, 'Text mining for chemistry and the CHEMDNER track'. It focused on the detection of chemical entities from PubMed abstracts. The entire supplement is available from the J Chem Inf.
The special issue includes an overview paper on the task, a paper on the CHEMDNER corpus and 13 selected systems description papers. Top scoring teams obtained an F-score of 87.39% for the recognition of chemical entity mentions, a very competitive result already close to the human IAA. Additionally some systems could show additional improvements compared to their original submissions.
In addition participating teams provided a short systems description paper for the BioCreative workshop proceedings, see: Proceedings of the Fourth BioCreative Challenge Evaluation Workshop vol. 2.
Additional details can be found at the BioCreative IV CHEMDNER task.
PDF
PDF
PDF
PDF
PDF
PDF
PDF
PDF
PDF
data
PDF
PDF
data
PDF
PDF
data
PDF
data
data