CALBC SILVER STANDARD CORPUS
Abstract
The CALBC initiative aims to provide a large-scale biomedical text corpus that contains semantic annotations for named entities of different kinds. The generation of this corpus requires that the annotations from different automatic annotation systems be harmonized. In the first phase, the annotation systems from five participants (EMBL-EBI, EMC Rotterdam, NLM, JULIE Lab Jena, and Linguamatics) were gathered. All annotations were delivered in a common annotation format that included concept identifiers in the boundary assignments and that enabled comparison and alignment of the results. During the harmonization phase, the results produced from those different systems were integrated in a single harmonized corpus ("silver standard" corpus) by applying a voting scheme. We give an overview of the processed data and the principles of harmonization — formal boundary reconciliation and semantic matching of named entities. Finally, all submissions of the participants were evaluated against that silver standard corpus. We found that species and disease annotations are better standardized amongst the partners than the annotations of genes and proteins. The raw corpus is now available for additional named entity annotations. Parts of it will be made available later on for a public challenge. We expect that we can improve corpus building activities both in terms of the numbers of named entity classes being covered, as well as the size of the corpus in terms of annotated documents.
References
- Genome Biol. 9(), (2008), DOI: 10.1186/gb-2008-9-s2-s2. Google Scholar
- Genome Biol. 9(), (2008), DOI: 10.1186/gb-2008-9-s2-s3. Google Scholar
- Bioinformatics 24(2), 296 (2008), DOI: 10.1093/bioinformatics/btm557. Crossref, Medline, Google Scholar
M. J. Schuemie , R. Jelier and J. A. Kors , Peregrine: Lightweight gene name normalization by dictionary lookup, Proceedings of the BioCreAtIvE 2 Workshop (2007) pp. 131–140. Google ScholarU. Hahn , An overview of JCoRe, the JULIE Lab UIMA component repository, Proceedings of the LREC'08 Workshop "Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP" (2008) pp. 1–7. Google Scholar- Bioinformatics 25(6), 815 (2009), DOI: 10.1093/bioinformatics/btp071. Crossref, Medline, Google Scholar
A. R. Aronson , Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program, Proceedings of the 2001 AMIA Symposium (2001) pp. 17–21. Google Scholar- Bioinformatics 21(14), 3191 (2005), DOI: 10.1093/bioinformatics/bti475. Crossref, Medline, Google Scholar
- Bioinformatics 22(20), 2567 (2006), DOI: 10.1093/bioinformatics/btl421. Crossref, Medline, Google Scholar
- Int. J. Digital Libraries 3(2), 117 (2000). Google Scholar
- BMC Bioinformatics 6(), S1 (2005), DOI: 10.1186/1471-2105-6-S1-S1. Crossref, Medline, Google Scholar
- Genome Biol. 9(), S1 (2008), DOI: 10.1186/gb-2008-9-s2-s1. Crossref, Medline, Google Scholar
-
D. Rebholz-Schuhmann , H. Kirsch and G. Nenadic , IeXML: Towards a framework for interoperability of text processing modules to improve annotation of semantic types in biomedical text , Proceedings of the Joint BioLINK and Bio-Ontologies SIG Meeting @ ISMB 2006 ( 2006 ) . Google Scholar A. Mikheev , Learning part-of-speech guessing rules from lexicon: Extension to non-concatenative operations, Proceedings of the 16th International Conference on Computational Linguistics — COLING 1996 (1996) pp. 770–775. Google Scholar- Int. J. Document Analysis Recognition 5, 166 (2003), DOI: 10.1007/s10032-002-0090-8. Crossref, Google Scholar
- J. Biomed. Inform. 36(6), 414 (2003), DOI: 10.1016/j.jbi.2003.11.002. Crossref, Medline, Google Scholar


