- Browse by Date
Informatics School Theses and Dissertations
Permanent URI for this collection
Please go to "Informatics Graduate Theses and PhD Dissertations" to submit dissertations and theses for the School of Informatics and Computing, at: http://hdl.handle.net/1805/303.
Browse
Browsing Informatics School Theses and Dissertations by Issue Date
Now showing 1 - 10 of 195
Results Per Page
Sort Options
Item Application of Data Pipelining Technology in Cheminformatics and Bioinformatics(2002-12) Mao, Linyong; Perry, Douglas G.Data pipelining is the processing, analysis, and mining of large volumes of data through a branching network of computational steps. A data pipelining system consists of a collection of modular computational components and a network for streaming data between them. By defining a logical path for data through a network of computational components and configuring each component accordingly, a user can create a protocol to perform virtually any desired function with data and extract knowledge from them. A set of data pipelines were constructed to explore the relationship between the biodegradability and structural properties of halogenated aliphatic compounds in a data set in which each compound has one degradation rate and nine structure-derived properties. After training, the data pipeline was able to calculate the degradation rates of new compounds with a relatively accurate rate. A second set of data pipelines was generated to cluster new DNA sequences. The data pipelining technology was applied to identify a core sequence to represent a DNA cluster and construct the 95% confidence distance interval for the cluster. The result shows that 74% of the DNA sequences were correctly clustered and there was no false clustering.Item Security of our Personal Genome(2003-08) Smith, Gregory H.Our personal genome, which is the map of our DNA, is our ultimate source of identity, which should be given our highest concern for security. The primary approach used for securing any highly sensitive health care data such as our genome would be to guard against any personal identity information being associated with the data. The belief that nameless data records eliminates risk and would be a benefit to research is the common pretense for how we manage our health data systems. However, the incredible advances that we are seeing with computational power and more affordable and sophisticated DNA sequencing software may be creating a problem greater then the benefit that it is providing. Now we must be concerned about all data in the health care systems that could provide a link to accessible identity free data. Old data records or samples that provide possibilities of DNA sequence matching to existing identity free genomic data presents a whole new problem. How might this change the face of health care? Will further advances in technology make it impossible for us to secure our personal health information? Solutions could lead to restricting our ability to improve health care or it could force us to rely more heavily on ethical judgment to protect the rights of patients. The unprecedented rate of recent advances in information technologies along with improved speed, economy and accuracy of mapping the human genome has created serious concerns about the usage and security of this new highly sensitive genetic data. Our knowledge of DNA has come along way in the 50 years since James Watson and Francis Crick first presented their discovery of the double helix. The discovery timeline has been crowded in recent years starting with the U.S, Department of Energy’s Human Genome Initiative in 1986 and culminating in completion of the Human Genome Project in 2003. The exponential growth of genomic scientific accomplishment now forces us to assume new milestones will arrive sooner then later.Item Toll Evolution: A Perspective from Regulatory Regions(2004-01) Sankula, Rajakumar; Perumal, NarayananBackground: Toll and Toll-related proteins play an important role in antibacterial innate immunity and are widespread in insects, plants, and mammals. The completion of new genomes such as Anopheles gambiae has provided an avenue for a deeper understanding of Toll evolution. While most evolutionary analyses are performed on protein sequences, here, we present a unique phylogenetic analysis of Toll genes from the perspective of upstream regulatory regions so as to study the importance of evolutionary information inherited in such sequences. Results: In a comparative study, phylogeny on the protein products of Toll like genes showed consistency with earlier literature except for the single point of divergence between insects and mammals. On the other hand, the phylogeny based on upstream regulatory sequences (-3000 to +10) showed a broader distinction between the plants and the rest, though the tree was not well resolved probably due to poor alignment of these sequences. The phylogeny based on TFBs necessitated the development of a supervised statistical approach to determine their “evolutionary informativeness”. Employing the frequency of evolutionarily informative TFBs, a phylogeny was derived using pair-wise distances. It suggested a closer relationship between Anopheles and plants than to Drosophila and a significant homology among mammalian TLRs. Conclusions: A unique approach of using TFBs in studying evolution of Toll genes has been developed. Broadly, this approach showed results similar to the protein phylogeny. The inclusion of the evolutionary information from TFBs may be relevant to such analyses due to the selective pressure of conservation in upstream sequences.Item Computational Mining and Survey of Simple Sequence Repeats (SSRs) in Expressed Sequence Tags (ESTs) of Dicotyledonous Plants(2004-07) Kumpatla, Siva Prasad; Mukhopadhyay, SnehasisDNA markers have revolutionized the field of genetics by increasing the pace of genetic analysis. Simple sequence repeats (SSRs) are repetitions of nucleotide motifs of 1 to 5 bases and are currently the markers of choice in many plant and animal genomes due to their abundant distribution in the genomes, hypervariable nature and suitability for high-throughput analysis. While SSRs, once developed, are extremely valuable, their development is time consuming, laborious and expensive. Sequences from many genomes are continuously made freely available in the public databases and mining of these sources using computational approaches permits rapid and economical marker development. Expressed sequence tags (ESTs) are ideal candidates for mining SSRs not only because of their availability in large numbers but also due to the fact that they represent expressed genes. Large scale SSR mining efforts in plants to date focused on monocotyledonous plants. In this project, an efficient SSR identification tool was developed and used to mine SSRs from more than 53 dicotyledonous species. A total of 92,648 non-redundant ESTs or 6.0% of the 1.54 million dicotyledonous ESTs investigated in this study were found to contain SSRs. The frequency of non-redundant-ESTs containing SSRs among the species investigated ranged from 2.65% to 16.82%. More than 80% of the non-redundant ESTs having SSRs contained a single SSR repeat while others contained 2 or more SSRs. An extensive analysis of the occurrence and frequencies of various SSR types revealed that the A/T mononucleotide, AG/GA/CT/TC dinucleotide, AAG/AGA/GAA/CTT/TTC/TCT trinucleotide and TTTA and TTAA tetranucleotide repeats are the most abundant in dicotyledonous species. In addition, an analysis of the number of repeats across species revealed that majority of the mononucleotide SSRs contained 15-25 repeats while majority of the di- and tri-nucleotide SSRs contained 5-10 repeats. By providing valuable information on the abundance of SSRs in ESTs of a large number of dicotyledonous species, this study demonstrates the potential of computational mining approach for rapid discovery of SSRs towards the development of markers for genetic analysis and related applications.Item Genomics of Osteoporosis(2004-08) Krishnan, Subha; Econs, Michael J.Osteoporosis is the most common bone disease in United States and developed countries and a major public health threat for an estimated 44 million Americans. It is characterized by low bone mineral density and micro architectural deterioration of bone tissue, with a consequent increase in bone fragility and susceptibility to fracture, especially of hip, spine and wrist. Osteoporosis is multifactorial disease influenced by large number of environmental and genetic factors. Though a number of FDA approved drugs are available for treating this complex disease, a medication, which could specifically and effectively reverse symptoms of it is lackin. As the initial step for approaching disease treatment my current research focuses on locatin candidate genes on linkage regions for BMD on human chromosomes, which potentially can be used for developing novel targets and strategies for therapeutic interventions. We will also define the mouse homologs in the syntenic regions as basis for future studies involving animal models of disturbed BMD. An automated interface which will give information on human - mouse synteny between human marker intervals of interest was developed which will expedite future synteny studies.Item Electrostatic Modeling of Protein Aggregation(2004-12) Vanam, Ram; Dubin, Paul L.Electrostatic modeling was done with Delphi of insight II to explain and predict protein aggregation, measured here for β-lactoglobulin and insulin using turbidimetry and stopped flow spectrophotometry. The initial rate of aggregation of β-Lactoglobulin was studied between pH 3.8 and 5.2 in 4.5mM NaCl; and for ionic strengths from 4.5 to 500mM NaCl at pH 5.0. The initial slope of the turbidity vs. time curve was used to define the initial rate of aggregation. The highest initial rate was observed near pH < pI i.e., 4.6 (< 5.2). The decrease in aggregation rate when the pH was increased from 4.8 to 5.0 was large compared to its decrease when the pH was reduced from 4.4 to 4.2; i.e., the dependence of initial rate on pH was highly asymmetric. The initial rate of aggregation at pH 5.0 increased linearly with the reciprocal of ionic strength in the range I = 0.5 to 0.0045M. Protein electrostatic potential distributions are used to understand the pH and ionic strength dependence of the initial rate of aggregation. Similar studies were done with insulin. In contrast to BLG, the highest initial aggregation rate for insulin was observed at pH = pI. Electrostatic computer modeling shows that these differences arise from the distinctly different surface charge distributions of insulin and BLG.Item Automating Laboratory Operations by Intergrating Laboratory Information Management Systems (LIMS) with Analytical Instruments and Scientific Data Management System (SDMS)(2005-06) Zhu, Jianyong; Merchant, MaheshThe large volume of data generated by commercial and research laboratories, along with requirements mandated by regulatory agencies, have forced companies to use laboratory information management systems (LIMS) to improve efficiencies in tracking, managing samples, and precisely reporting test results. However, most general purpose LIMS do not provide an interface to automatically collect data from analytical instruments to store in a database. A scientific data management system (SDMS) provides a “Print-to-Database” technology, which facilitates the entry of reports generated by instruments directly into the SDMS database as Windows enhanced metafiles thus to minimize data entry errors. Unfortunately, SDMS does not allow performing further analysis. Many LIMS vendors provide plug-ins for single instrument but none of them provides a general purpose interface to extract the data from SDMS and store in LIMS. In this project, a general purpose middle layer named LabTechie is designed, built and tested for seamless integration between instruments, SDMS and LIMS. This project was conducted at American Institute of Technology (AIT) Laboratories, an analytical laboratory that specializes in trace chemical measurement of biological fluids. Data is generated from 20 analytical instruments, including gas chromatography/mass spectrometer (GC/MS), high performance liquid chromatography (HPLC), and liquid chromatography/mass spectrometer (LC/MS), and currently stored in NuGenesis SDMS iv (Waters, Milford, MA). This approach can be easily expanded to include additional instruments.Item Interactive Communication Technology and Processing of Behavioral Health Change Messages(2005-06) McCracken-Stratton, Renee Marie; McDaniel, Anna M.Consumer processing of interactive communication technology (ICT) messages is an understudied area. It is incumbent upon the Informatics community to partner with various health content and population domain experts to design healthcare information products that increase reach, improve awareness, and meet consumer needs. This research is a secondary analysis of a larger study to develop and pilot test an interactive, multimedia computer program as an adjunct to usual clinical care in an effort to reduce smoking in low-income rural Indiana communities. The objective of this research was to measure the degree of consumer processing of health behavioral change messages delivered by ICT. The sample size for this research was 30 subjects. Degree of consumer message processing was high (mean processing score=80.5, SD=6.837). Instruments to assess the number of actionable cessation responses (ACRs) and cognitive changes were completed at the 3-month follow-up. A relationship was observed between degree of message processing and making a quit attempt (rbis=.384, p=.044). Knowledge scores improved over baseline measures (t=3.123, p=.004). These results suggest that ICT is feasible for promoting the processing of cessation messages and increasing consideration of ACRs in low-income rural Indiana populations.Item Conversation of Intrinsic Disorder in Protein Domains and Families(2005-08) Chen, Jessica Walton; Dunker, A. KeithProtein regions which lack a fixed structure are called ‘disordered’. These intrinsically disordered regions are not only very common in many proteins, they are also crucial to the function of many proteins, especially proteins involved in signaling and regulation. The goal of this work was to identify the prevalence, characteristics, and functions of conserved disordered regions within protein domains and families. A database was created to store the amino acid sequences of nearly one million proteins and their domain matches from the InterPro database, a resource integrating eight different protein family and domain databases. Disorder prediction was performed on these protein sequences. Regions of sequence corresponding to domains were aligned using a multiple sequence alignment tool. From this initial information, regions of conserved predicted disorder were found within the domains. The methodology for this search consisted of finding regions of consecutive positions in the multiple sequence alignments in which a 90% or more of the sequences were predicted to be disordered. This procedure was constrained to find such regions of conserved disorder prediction that were at least 20 amino acids in length. The results of this work were 3,653 regions of conserved disorder prediction, found within 2,898 distinct InterPro entries. Most regions of conserved predicted disorder detected were short, with less than 10% of those found exceeding 30 residues in length. Regions of conserved disorder prediction were found in protein domains from all available InterPro member databases, although with varying frequency. Regions of conserved disorder prediction were found in proteins from all kingdoms of life, including viruses. However, domains found in eukaryotes and viruses contained a higher proportion of long regions of conserved disorder than did domains found in bacteria and archaea. In both this work and previous work, eukaryotes had on the order of ten times more proteins containing long disordered regions than did archaea and bacteria. Sequence conservation in regions of conserved disorder varied, but was on average slightly lower than in regions of conserved order. Both this work and previous work indicate that in some cases, disordered regions evolve faster, in others they evolve slower, and in the rest they evolve at roughly the same rate. A variety of functions were found to be associated with domains containing conserved disorder. The most common were DNA/RNA binding, and protein binding. Many ribosomal protein families also were found to contain conserved disordered regions. Other functions identified included membrane translocation and amino acid storage for germination. Due to limitations of current knowledge as well as the methodology used for this work, it was not determined whether or not these functions were directly associated with the predicted disordered region. However, the functions associated with conserved disorder in this work are in agreement with the functions found in other studies to correlate to disordered regions. This work has shown that intrinsic disorder may be more common in bacterial and archaeal proteins than previously thought, but this disorder is likely to be used for different purposes than in eukaryotic proteins, as well as occurring in shorter stretches of protein. Regions of predicted disorder were found to be conserved within a large number of protein families and domains. Although many think of such conserved domains as being ordered, in fact a significant number of them contain regions of disorder that are likely to be crucial to their function.Item Eye-Movement Brain Potentials and Family History of Alcoholism: Alcoholism, brain potentials, saccades, antisaccades(2005-08) Vitvitskiy, Victor; O'Connor, Sean