用户工具

站点工具


en:news:challenges:biocreative-v:track-2-chemdner

BioCreative V: Track 2-CHEMDNER-patents

Organizers

  • Martin Krallinger, Spanish National Cancer Research Centre
  • Florian Leitner, Universidad Politecnica de Madrid
  • Obdulia Rabal, Center for Applied Medical Research (CIMA), University of Navarra
  • Julen Oyarzabal, Center for Applied Medical Research (CIMA), University of Navarra
  • Alfonso Valencia, Spanish National Cancer Research Centre

CHEMDNER scientific advisory board

  • Peter Murray-Rust, Reader in Molecular Informatics, Unilever Centre, Dep. of Chemistry, University of Cambridge, UK
  • John P. Overington, EMBL-EBI, Wellcome Genome Campus, Hinxton, UK
  • Erik M. van Mulligen, Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
  • Christian Tyrchan, Computational Chemistry, AstraZeneca
  • Stephen K. Boyer, IBM Almaden Research Center
  • Markus Bundschus, Head Scientific & Business Information Services, Roche Diagnostics GmbH

Registration and participation

Teams interested in the CHEMDNER-patents task should register for track 2 of BioCreative V. Important: First you need to register, and then go to the 'Team page' and complete the team information and select the task/s in which you intend to participate.

Task contact

If you have some additional questions send an e-mail to: Martin Krallinger

Background

This task will address the automatic extraction of chemical and biological data from medicinal chemistry patents. The identification and integration of all information contained in these patents (e.g., chemical structures, their synthesis and associated biological data) is currently a very hard task not only for database curators but for life sciences researches and biomedical text mining experts as well. Despite the valuable characterizations of biomedical relevant entities such as chemical compounds, genes and proteins contained in patents, academic research in the area of text mining and information extraction using patent data has been minimal. Pharmaceutical patents covering chemical compounds provide information on their therapeutic applications and, in most cases, on their primary biological targets.

This would be the first time that a biomedical text mining community challenge handles noisy text data (patents) and could result in software that helps to derive annotations from patents. The methods resulting from this task could potentially also provide useful insights for extracting other kinds of information from patents on the one side, or they could serve to better understand how to detect such information from other text collections such as full text articles or legacy reports.

Tasks

This task would cover three essential steps for the identification of biomedical relevant descriptions of chemical compounds:

  • CEMP (chemical entity mention in patents): the actual detection of chemical entity mentions in text
  • CPD (chemical passage detection): the detection of sentences that mention chemical compounds
  • CER (chemical entity relation): the extraction of chemical compound relations; covering biologically relevant chemical relations (e.g. chemical-biological targets relations)

Participating teams do not need to send results for all of three sub-tasks. The can also send results only for individual sub-tasks.

Data

We will focus on the following patents: PCT applications with kind code A1, written in English and assigned to the International Patent Classification (IPC) code of “medicinal preparations containing organic active ingredients”. In order to deal with homogeneous/diverse documents, PTC sampling will take into account issues like: applicant (typically a pharma company), filing country/patent office, origin (country) of the applicant and filing date. The test set collection will consist of recently published patents plus a background set in order to avoid manual correction of results. We will restrict the annotation to particular, well-defined sections of patents, with a special focus on patent abstracts. We plan to annotate exhaustively a minimum of 30,000 patent abstracts. The CHEMDNER-patent corpus will rely on a modified version of the annotation guidelines used for the BioCreative-IV CHEMDNER task. These modifications are mainly intended to deal with spelling errors and spurious line breaks as well as to incorporate guidelines for the annotation of biological targets (mainly gene products) and the therapeutic application. We plan to carry out the same annotation strategy in terms of annotation tools and domain expert manual annotations as done for the CHEMDNER task, including and inter-annotator agreement study to determine the consistency of the annotations. The annotation guidelines together with the entire CHEMDNER-patents corpus will be publicly available after the competition.

Evaluation

We will use an adapted version of the BioCreative II.5 evaluation script to score the predictions. For the CPD task, the BioCreative II.5 ACT evaluation scores will be used (MCC, AUC PR, and accuracy). For the CEMP task, the exact mention evaluation strategy as used for the CHEMDNER task using the balanced F-score will be used. For the CER task, the same evaluation strategy as for the Biocreative II.5 IPT will be used (i.e., F-score). See http://www.biocreative.org/resources/biocreative-ii5/evaluation-library for more details. The evaluation software will also check that team predictions are compliant with the required submission format.

Timeline

  • January 2015: Task announcement & call for participation
  • March 2nd: Sample set patent abstract plain text release: CHEMDNER-patents sample text
  • March 27th: Sample set annotation release: CPD subtask, CEMP subtask (corrected, version 2), CER subtask
  • End April 2015: Release of training data with annotation guidelines
  • End May 2015: Release of development data
  • July 2015: Release of test data
  • August 2015: team results returned
  • September 2015: team systems description workshop paper due

CHEMDNER session at the BioCreative V workshop

At the BioCreative V Workshop to be held in Seville (Spain) September 9-11 (2015) there will be a session devoted to the CHEMDNER patents task. This session will include an overview talk presenting the used datasets and results obtained by the participating teams. A number of teams will also be invited to present their systems. We plan to have also a discussion session where teams, task organizers and domain experts will discuss the obtained results and future steps. Finally during the poster session all teams will be able to present their participating strategies.

CHEMDNER patents workshop proceedings and journal special issue

Participating teams will be invited to contribute to the: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop. A selected number of top performing teams will also be invited to contribute with a system description paper to a special issue of a relevant journal in the field.

Mailing list

You can post questions related to the CHEMDNER task to the BioCreative mailing list. To register for the BioCreative mailing list, please visit the following page: http://biocreative.sourceforge.net/mailing.html

Previous CHEMDNER (Biocreative IV)

The CHEMDNER-Biocreative IV special issue was published in the Journal of Chemoinformatics: Volume 7 Supplement 1, 'Text mining for chemistry and the CHEMDNER track'. It focused on the detection of chemical entities from PubMed abstracts. The entire supplement is available from the J Chem Inf.

The special issue includes an overview paper on the task, a paper on the CHEMDNER corpus and 13 selected systems description papers. Top scoring teams obtained an F-score of 87.39% for the recognition of chemical entity mentions, a very competitive result already close to the human IAA. Additionally some systems could show additional improvements compared to their original submissions.

In addition participating teams provided a short systems description paper for the BioCreative workshop proceedings, see: Proceedings of the Fourth BioCreative Challenge Evaluation Workshop vol. 2.

Additional details can be found at the BioCreative IV CHEMDNER task.

References

  1. Martin Krallinger, Florian Leitner, Obdulia Rabal, Miguel Vazquez, Jelen Oyarzabal, and Alfonso Valencia, 2015. CHEMDNER: The Drugs and Chemical Names Extraction Challenge. Journal of Cheminformatics, Vol. 7, No. Suppl 1, pp. S1. PDF
  2. Martin Krallinger, Florian Leitner, Obdulia Rabal, Miguel Vazquez, Julen Oyarzabal, and Alfonso Valencia, 2013. Overview of the Chemical Compound and Drug Name Recognition (CHEMDNER) Task. In: Martin Krallinger, Florian Leitner, Obdulia Rabal, Miguel Vazquez, Julen Oyarzabal, and Alfonso Valencia (eds) Proceedings of the 4th BioCreative Challenge Evaluation Workshop, Vol. 2,, Vol. 2, pp. 2-33. PDF
  3. Tiago Grego, Piotr Pęzik, Francisco M. Couto, and Dietrich Rebholz-Schuhmann, 2009. Identification of chemical entities in patent documents. In: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living, pp. 942-949, Springer Berlin Heidelberg. PDF
  4. David M. Jessop, Sam E. Adams, and Peter Murray-Rust, 2011. Mining Chemical Information from Open Patents. Journal of cheminformatics, Vol. 3, No. 1, pp. 40. PDF
  5. Harsha Gurulingappa, Bernd Müller, Roman Klinger, Heinz-Theodor Mevissen, Martin Hofmann-Apitius, Christoph M. Friedrich, and Juliance Fluck, 2010. Prior Art Search in Chemistry Patents Based On Semantic Concepts and Co-Citation Analysis. In: Proceedings of the 19th Text REtrieval Conference (TREC). PDF
  6. Roman Klinger, Corinna Kolarik, Juliane Fluck, Martin Hofmann-Apitius, and Christoph M. Friedrich, 2008. Detection of IUPAC and IUPAC-Like Chemical Names. Bioinformatics, Vol. 24, No. 13, pp. i268-i276. PDF
  7. Peter Corbett and Ann Copestake, 2008. Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition. BMC Bioinformatics, Vol. 9, No. Suppl 11, pp. S4. PDF

Corpus

  1. Martin Krallinger, Obdulia Rabal, Florian Leitner, Miguel Vazquez, David Salgado, Zhiyong Lu, Robert Leaman, Yanan Lu, Donghong Ji, Daniel M Lowe, Roger A Sayle, Riza Theresa Batista-Navarro, Rafal Rak, Torsten Huber, Tim Rocktäschel, Sergio Matos, David Campos, Buzhou Tang, Hua Xu, Tsendsuren Munkhdalai, Keun Ho Ryu, S. V Ramanan, Senthil Nathan, Slavko Zitnik, Marko Bajec, Lutz Weber, Matthias Irmer, Saber A Akhondi, Jan A Kors, Shuo Xu, Xin An, Utpal Kumar Sikdar, Asif Ekbal, Masaharu Yoshioka, Thaer M Dieb, Miji Choi, Karin Verspoor, Madian Khabsa, C. Lee Giles, Hongfang Liu, Komandur Elayavilli Ravikumar, Andre Lamurias, Francisco M Couto, Hong-Jie Dai, Richard Tzong-Han Tsai, Caglar Ata, Tolga Can, Anabel Usie, Joaquim Cruz, Isabel Segura-Bedmar, Paloma Martinez, Julen Oyarzabal, and Alfonso Valencia, 2015. The CHEMDNER Corpus of Chemicals and Drugs and its Annotation Principles. Journal of Cheminformatics, Vol. 7, No. Suppl 1, pp. S2. PDF
  2. Saber A. Akhondi, Alexander G. Klenner, Christian Tyrchan, Anil K. Manchala, Kiran Boppana, Daniel Lowe, Marc Zimmermann, Sarma A. R. P. Jagarlapudi, Roger Sayle, Jan A. Kors, and Sorel Muresan, 2014. Annotated Chemical Patent Corpus: A Gold Standard for Text Mining. PloS one, Vol. 9, No. 9, pp. e107477. PDF data
  3. Márton Kiss, Ágoston Nagy, Veronika Vincze, Attila Almási, Zoltán Alexin, and János Csirik, 2012. A Manually Annotated Corpus of Pharmaceutical Patents. Text, Speech and Dialogue. Springer Berlin Heidelberg. pp. 135–142. PDF

Domain Resource

  1. David S. Wishart, Craig Knox, An Chi Guo, Savita Shrivastava, Murtaza Hassanali, Paul Stothard, Zhan Chang, and Jennifer Woolsey, 2006. DrugBank: A Comprehensive Resource for in silico Drug Discovery and Exploration. Nucleic Acids Research, Vol. 34, No. suppl 1, pp. D668-D672. PDF data
  2. Feng Zhu, BuCong Han, Pankaj Kumar, XiangHui Liu, XiaoHua Ma, XiaoNa Wei, Lu Huang, YangFan Guo, LianYi Han, ChanJuan Zheng, and YuZong Chen, 2010. Update of TTD: Therapeutic Target Database. Nucleic Acids Research, Vol. 38, No. Suppl 1, pp. D787-D791. PDF
  3. Kirill Degtyarenko, Paula de Matos, Marcus Ennis, Janna Hastings, Martin Zbinden, Alan McNaught, Rafael Alcantara, Michael Darsow, Mickael Guedj, and Michael Ashburner, 2007. ChEBI: A Database and Ontology for Chemical Entities of Biological Interest. Nucleic Acids Research, Vol. 36, No. Database, pp. D344-350. PDF data
  4. Kristina M. Hettne, Rob H. Stierum, Martijn J. Schuemie, Peter J. M. Hendriksen, Bob J. A. Schijvenaars, Erik M. van Mulligen, Jos Kleinjans, and Jan A. Kors, 2009. A Dictionary to Identify Small Molecules and Drugs in Free Text. Bioinformatics, Vol. 25, No. 22, pp. 2983-2991. PDF data
  5. Drugs@FDA: a database provided by U.S. Food and Drug Administration. data

Downloads

en/news/challenges/biocreative-v/track-2-chemdner.txt · 最后更改: 2017/04/20 12:27 由 pzczxs