where * is a padding symbol that indicates the beginning of a sentence and STOP is a special HMM state indicating the end of a sentence. The task is to implement this probabilistic model and a decoder for finding the most likely tag sequence for new sentences.

A labeled training dataset gene.train, a labeled and unlabeled versions of the development set, gene.key and gene.dev, and an unlabeled test set gene.test are provided. The labeled files take the format of one word per line with word and tag separated by space and a single blank line separates sentences, e.g.,

Comparison O

with O

alkaline I-GENE

phosphatases I-GENE

and O

5 I-GENE

– I-GENE

nucleotidase I-GENE

Pharmacologic O

aspects O

of O

neonatal O

hyperbilirubinemia O

. O

The following shows a graphical model representation of yet another sentence to be tagged as shown:

An unlabeled test dataset is also provided, the unlabeled file contains only the words of each sentence and will be used to evaluate the performance of the model deveploed.

There are 13796 sentences in the training dataset, whereas the dev dataset contains 509 sentences.

The task consists of identifying gene names within biological text. In this dataset there is one type of entity: gene (GENE). The dataset is adapted from the BioCreAtIvE II shared task (http://biocreative.sourceforge.net/biocreative_2.html).

Here are the steps that are to be followed: