Prediction by Partial Matching for Identification of Biological Entities

dc.contributor.advisorMahoui, Malika
dc.contributor.authorThirumalaiswamy Sekhar, Arvind Kumar
dc.date.accessioned2010-09-29T20:26:26Z
dc.date.available2010-09-29T20:26:26Z
dc.degree.date2008-05
dc.degree.disciplineSchool of Informatics
dc.degree.grantorIndiana University
dc.degree.levelM.S.
dc.description.abstractAs biomedical research and advances in biotechnology generate expansive datasets, the need to process this data into information has grown simultaneously. Specifically, recognizing and extracting these “key” phrases comprising the named entities from this information databank promises a plethora of applications for scientists. The ability to construct interaction maps,identify proteins as drug targets are two important applications. Since we have the choice of defining what is “useful”, we can potentially utilize text mining for our purpose. In a novel attempt to beat the challenge, we have put information theory and text compression through this task. Prediction by partial matching is an adaptive text encoding scheme that blends together a set of finite context Markov models to predict the probability of the next token in a given symbol stream. We observe, named entities such as gene names, protein names, gene functions, protein-protein interactions – all follow symbol statistics uniquely different from normal scientific text. By using well defined training sets that allow us to selectively differentiate between named entities and the rest of the symbols; we were able to extract them with a good accuracy. We have implemented our tests, using the Text Mining Toolkit, on identification of gene functions and protein-protein interactions with f-scores (based on precision & recall) of 0.9737 and 0.6865 respectively. With our results, we foresee the application of such an approach in automated information retrieval in the realm of biology.en
dc.identifier.urihttps://hdl.handle.net/1805/2266
dc.identifier.urihttp://dx.doi.org/10.7912/C2/864
dc.language.isoen_USen
dc.subjectBiological Entitiesen
dc.subjectIdentificationen
dc.subjectPartial Matchingen
dc.subjectPredictionen
dc.titlePrediction by Partial Matching for Identification of Biological Entitiesen
dc.typeThesisen
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SekharThesis.pdf
Size:
709.74 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.96 KB
Format:
Item-specific license agreed upon to submission
Description: