Informatics Graduate Theses and PhD Dissertations

Permanent URI for this collection

https://hdl.handle.net/1805/303

Browse

Now showing 1 - 10 of 165

Security of our Personal Genome
(2003-08) Smith, Gregory H.
Our personal genome, which is the map of our DNA, is our ultimate source of identity, which should be given our highest concern for security. The primary approach used for securing any highly sensitive health care data such as our genome would be to guard against any personal identity information being associated with the data. The belief that nameless data records eliminates risk and would be a benefit to research is the common pretense for how we manage our health data systems. However, the incredible advances that we are seeing with computational power and more affordable and sophisticated DNA sequencing software may be creating a problem greater then the benefit that it is providing. Now we must be concerned about all data in the health care systems that could provide a link to accessible identity free data. Old data records or samples that provide possibilities of DNA sequence matching to existing identity free genomic data presents a whole new problem. How might this change the face of health care? Will further advances in technology make it impossible for us to secure our personal health information? Solutions could lead to restricting our ability to improve health care or it could force us to rely more heavily on ethical judgment to protect the rights of patients. The unprecedented rate of recent advances in information technologies along with improved speed, economy and accuracy of mapping the human genome has created serious concerns about the usage and security of this new highly sensitive genetic data. Our knowledge of DNA has come along way in the 50 years since James Watson and Francis Crick first presented their discovery of the double helix. The discovery timeline has been crowded in recent years starting with the U.S, Department of Energy’s Human Genome Initiative in 1986 and culminating in completion of the Human Genome Project in 2003. The exponential growth of genomic scientific accomplishment now forces us to assume new milestones will arrive sooner then later.
Web-based Email Management For Email Overload
(2005-08-08T16:50:27Z) Campiranon, Chatree
An email overload problem occurs when users try to utilize email service in a way it was not designed for. Moreover, many web-based email services provide large email storage space and users tend to keep more unused emails. Issues that cause email overload are 1) Keeping too many emails, 2) Using email for conversational threads, and 3) Using email as a task management tool. Forty-five participants were selected to participate in user study sessions including questionnaire, time-on-task study, and interview. Participants were divided into three groups of 15. Participants in the first group were assigned as Gmail users. Participants in the second group were assigned as Yahoo! Mail users. After finishing user study sessions for the first two groups, the results were analyzed and the new web-based email prototype was designed as a suggestion of how the web-based email could be developed to handle the email overload problem. Then users in the third group tested the new prototype in the same manner the research was conducted with the first two groups of users. Users in the third group were satisfied with the features and design of the new prototype. The design of the new prototype focused on solutions that are able to handle email overload problem which are 1) Email categorizing, 2) Email thread grouping, 3) Email searching, and 4) Email task management. This study illustrates how the web-based email can be designed with features to handle email overload problems while maintaining the interface usable to most users.
Construction of a Database of Secondary Structure Segments and Short Regions of Disorder and Analysis of Their Properties
Zang, Yizhi; Dunker, Keith
Prediction of the secondary structure of a protein from its amino acid sequence remains an important task. Not only did the growth of database holding only protein sequences outpace that of solved protein structures, but successful predictions can provide a starting point for direct tertiary structure modeling [1],[2], and they can also significantly improve sequence analysis and sequence-structure threading [3],[4] for aiding in structure and function determination. Previous works on predicting secondary structures of proteins have yielded the best percent accuracy ranging from 63% to 71% [5]. These numbers, however, should be taken with caution since performance of a method based on a training set may vary when trained on a different training set. In order to improve predictions of secondary structure, there are three challenges. The first challenge is establishing an appropriate database. The next challenge is to represent the protein sequence appropriately. The third challenge is finding an appropriate method of classification. So, two of three challenges are related to an appropriate database and characteristic features. Here, we report the development of a database of non-identical segments of secondary structure elements and fragments with missing electron densities (disordered fragments) extracted from Protein Data Bank and categorized into groups of equal lengths, from 6 to 40. The number of residues corresponding to the above-mentioned categories is: 219,788 for α-helices, 82,070 for β-sheets, 179,388 for coils, and 74,724 for disorder. The total number of fragments in the database is 49,544; 17,794 of which are α-helices, 10,216 β-sheets, 16,318 coils, and 5,216 disordered regions. Across the whole range of lengths, α-helices were found to be enriched in L, A, E, I, and R, β-sheets were enriched in V, I, F, Y, and L, coils were enriched in P, G, N, D, and S, while disordered regions were enriched in S, G, P, H, and D. In addition to the amino acid sequence, for each fragment of every structural type, we calculated the distance between the residues immediately flanking its termini. The observed distances have ranges between 3 and 30Å. We found that for the three secondary structure types the average distance between the bookending residues linearly increases with sequence length, while distances were more constant for disorder. For each length between 6 and 40, we compared amino acid compositions of all four structural types and found a strong compositional dependence on length only for the β-sheet fragments, while the other three types showed virtually no change with length. Using the Kullback-Leibler (KL) distance between amino acid compositions, we quantified the differences between the four categories. We found that the closest pair in terms of the KL-distance were coil and disorder (dKL = 0.06 bits), then α-helix and β-sheet (dKL = 0.14 bits), while all other pairs we almost equidistant from one another (dKL ≈ 0.25 bits). With the increasing segment length we found a decreasing KL-distance between sheet and coil, sheet and disorder, and disorder and helix. Analyzing hierarchical clustering of length from 6 to 18 for sheet, coil, disorder, and helix, we found that the group coil had the closet proximity among lengths from 6 to 18. The next closest were helix and disorder. The sheet has the most difference among its length from 6 to 18. In group sheet and coil, fragments of length 17 had the longest distance while fragments of length 6 had the longest distance in group disorder and helix.
Application of Data Pipelining Technology in Cheminformatics and Bioinformatics
(2002-12) Mao, Linyong; Perry, Douglas G.
Data pipelining is the processing, analysis, and mining of large volumes of data through a branching network of computational steps. A data pipelining system consists of a collection of modular computational components and a network for streaming data between them. By defining a logical path for data through a network of computational components and configuring each component accordingly, a user can create a protocol to perform virtually any desired function with data and extract knowledge from them. A set of data pipelines were constructed to explore the relationship between the biodegradability and structural properties of halogenated aliphatic compounds in a data set in which each compound has one degradation rate and nine structure-derived properties. After training, the data pipeline was able to calculate the degradation rates of new compounds with a relatively accurate rate. A second set of data pipelines was generated to cluster new DNA sequences. The data pipelining technology was applied to identify a core sequence to represent a DNA cluster and construct the 95% confidence distance interval for the cluster. The result shows that 74% of the DNA sequences were correctly clustered and there was no false clustering.
Automating Laboratory Operations by Intergrating Laboratory Information Management Systems (LIMS) with Analytical Instruments and Scientific Data Management System (SDMS)
(2005-06) Zhu, Jianyong; Merchant, Mahesh
The large volume of data generated by commercial and research laboratories, along with requirements mandated by regulatory agencies, have forced companies to use laboratory information management systems (LIMS) to improve efficiencies in tracking, managing samples, and precisely reporting test results. However, most general purpose LIMS do not provide an interface to automatically collect data from analytical instruments to store in a database. A scientific data management system (SDMS) provides a “Print-to-Database” technology, which facilitates the entry of reports generated by instruments directly into the SDMS database as Windows enhanced metafiles thus to minimize data entry errors. Unfortunately, SDMS does not allow performing further analysis. Many LIMS vendors provide plug-ins for single instrument but none of them provides a general purpose interface to extract the data from SDMS and store in LIMS. In this project, a general purpose middle layer named LabTechie is designed, built and tested for seamless integration between instruments, SDMS and LIMS. This project was conducted at American Institute of Technology (AIT) Laboratories, an analytical laboratory that specializes in trace chemical measurement of biological fluids. Data is generated from 20 analytical instruments, including gas chromatography/mass spectrometer (GC/MS), high performance liquid chromatography (HPLC), and liquid chromatography/mass spectrometer (LC/MS), and currently stored in NuGenesis SDMS iv (Waters, Milford, MA). This approach can be easily expanded to include additional instruments.
Intrinsic Disorder in Transcription Factors
(2005-08) Liu, Jiangang (Al); Perumal, Narayanan B.
Reported evidence suggested that high abundance of intrinsic disorder in eukaryotic genomes in comparison to bacteria and archaea may reflect the greater need for disorder-associated signaling and transcriptional regulation in nucleated cells. The major advantage of intrinsically disordered proteins or disordered regions is their inherent plasticity for molecular recognition, and this advantage promotes disordered proteins or disordered regions in binding their targets with high specificity and low affinity and with numerous partners. Although several well-characterized examples of intrinsically disordered proteins in transcriptional regulation have been reported and the biological functions associated with their corresponding structural properties have been examined, so far no specific systematic analysis of intrinsically disordered proteins has been reported. To test for a generalized prevalence of intrinsic disorder in transcriptional regulation, we first used the Predictor Of Natural Disorder Regions (PONDR VL-XT) to systematically analyze the intrinsic disorder in three Transcription Factor (TF) datasets (TFSPTRENR25, TFSPNR25, TFNR25) and two control sets (PDBs25 and RandomACNR25). PONDR VL-XT predicts regions of ≥30 consecutive disordered residues for 94.13%, 85.19%, 82.63%, 54.51%, and 18.64% of the proteins from TFNR25, TFSPNR25, TFSPTRENR25, RandomACNR25, and PDBs25, respectively, indicating significant abundance of intrinsic disorder in TFs as compared to the two control sets. We then used Cumulative Distribution Function (CDF) and charge-hydropathy plots to further confirm this propensity for intrinsic disorder in TFs. The amino acid compositions results showed that the three TF datasets differed significantly 5 from the two control sets. All three TF datasets were substantially depleted in order-promoting residues such as W, F, I, Y, and V, and significantly enriched in disorder-promoting residues such as Q, S, and P. H and C were highly over-represented in TF datasets because nearly a half of TFs contain several zinc-fingers and the most popular type of zinc-finger is C2H2. High occurrence of proline and glutamine in these TF datasets suggests that these residues might contribute to conformational flexibility needed during the process of binding by co-activators or repressors during transcriptional activation or repression. The data for disorder predictions on TF domains showed that the AT-hooks and basic regions of DNA Binding Domains (DBDs) were highly disordered (the overall disorder scores are 99% and 96% respectively). The C2H2 zinc-fingers were predicted to be highly ordered; however, the longer the zinc finger linkers, the higher the predicted magnitude of disorder. Overall, the degree of disorder in TF activation regions was much higher than that in DBDs. Our studies also confirmed that the degree of disorder was significantly higher in eukaryotic TFs than in prokaryotic TFs, and the results reflected the fact that the eukaryotes have well-developed elaborated gene transcription mechanism, and such a system is in great need of TF flexibility. Taken together, our data suggests that intrinsically disordered TFs or partially unstructured regions in TFs play key roles in transcriptional regulation, where folding coupled to binding is a common mechanism.
Bioinformatics Analysis and Annotation of Microtubule Binding and Associated Proteins (MAPs) - Creating a Database of MAPs
(2005-08) Shenoy, Narmada; Guenther, Brian
Microtubules have many roles in the cytoskeletal infrastructure. This infrastructure underlies vital processes of cellular life such as motility, division, morphology, and intracellular organization and transport. These different roles are carried out by the creation of different microtubule (MT) systems (such as basal bodies, centrioles, flagellum, kinetochores, and mitotic spindles). The changing roles require the cytoskeleton to be both dynamic and static in nature. Guiding these processes are a network of proteins that direct cellular behavior through their ability to bind microtubules (MTs) in a spatial- and temporal-specific manner. The identification and characterization of the suite of microtubule binding and associated proteins (MAPs) involved in MT systems is important for the understanding of the biological form and function of each MT system. This research involved the analysis and annotation of four MAPs – Ensconsin in Humans, Hook (homolog 3) in Humans, Protein Regulator of Cytokinesis 1 (PRC1) in Humans and Anaphase Spindle Elongation protein (ASE1) in yeast. A bioinformatics approach was used for the annotation and analysis. A protocol for analysis and annotation of MAPs was developed. During the process, some limitations in using bioinformatics tools and procedures were encountered. These limitations were overcome, the initial protocol was improved on and a modified protocol of analysis was developed. A database was designed and built to hold annotated information on the MAPs. We seek to disseminate this database and its functionalities as a web resource to the scientific community. It will provide an excellent forum for researchers to obtain relevant information on MT binding and associated proteins (MAPs). Infection by parasitic protozoa causes incalculable morbidity and mortality to humans and agricultural animals. In this research, we have also focused on MAPs in parasitic organisms of the Apicomplexan and Trypanosomatid genera. The protocol for analysis incorporates steps to analyze MAPs from these organisms as well. Malaria (a potentially life threatening disease) is caused by Plasmodium, an Apicomplexan parasite. This parasite is transmitted to people by the female Anopheles mosquito, which feeds on human blood. African Sleeping Sickness is an acute disease 8 caused by Trypanosoma brucei that typically leads to death within weeks or months if not treated. Microtubule-associated proteins (MAPs) and their alteration of the unique microtubule (MT) systems play major roles in these organisms throughout their life cycle and are required for their pathogenic mechanisms. Each parasite contains unique MT systems that will test our annotation process as well as prepare the DB for addition of other novel MT systems, such as those contained with plants. Additionally, these single cell organisms have a multistage life cycle that provide similar annotation challenges to those encountered when one considers multi-cellular organisms. Therefore, a researcher working on any MT system within the database will find useful information regardless of the organism that they are studying. This will leave us with a sub-set of MAPs from parasitic organisms in our database that are potential drug-targets.
Conversation of Intrinsic Disorder in Protein Domains and Families
(2005-08) Chen, Jessica Walton; Dunker, A. Keith
Protein regions which lack a fixed structure are called ‘disordered’. These intrinsically disordered regions are not only very common in many proteins, they are also crucial to the function of many proteins, especially proteins involved in signaling and regulation. The goal of this work was to identify the prevalence, characteristics, and functions of conserved disordered regions within protein domains and families. A database was created to store the amino acid sequences of nearly one million proteins and their domain matches from the InterPro database, a resource integrating eight different protein family and domain databases. Disorder prediction was performed on these protein sequences. Regions of sequence corresponding to domains were aligned using a multiple sequence alignment tool. From this initial information, regions of conserved predicted disorder were found within the domains. The methodology for this search consisted of finding regions of consecutive positions in the multiple sequence alignments in which a 90% or more of the sequences were predicted to be disordered. This procedure was constrained to find such regions of conserved disorder prediction that were at least 20 amino acids in length. The results of this work were 3,653 regions of conserved disorder prediction, found within 2,898 distinct InterPro entries. Most regions of conserved predicted disorder detected were short, with less than 10% of those found exceeding 30 residues in length. Regions of conserved disorder prediction were found in protein domains from all available InterPro member databases, although with varying frequency. Regions of conserved disorder prediction were found in proteins from all kingdoms of life, including viruses. However, domains found in eukaryotes and viruses contained a higher proportion of long regions of conserved disorder than did domains found in bacteria and archaea. In both this work and previous work, eukaryotes had on the order of ten times more proteins containing long disordered regions than did archaea and bacteria. Sequence conservation in regions of conserved disorder varied, but was on average slightly lower than in regions of conserved order. Both this work and previous work indicate that in some cases, disordered regions evolve faster, in others they evolve slower, and in the rest they evolve at roughly the same rate. A variety of functions were found to be associated with domains containing conserved disorder. The most common were DNA/RNA binding, and protein binding. Many ribosomal protein families also were found to contain conserved disordered regions. Other functions identified included membrane translocation and amino acid storage for germination. Due to limitations of current knowledge as well as the methodology used for this work, it was not determined whether or not these functions were directly associated with the predicted disordered region. However, the functions associated with conserved disorder in this work are in agreement with the functions found in other studies to correlate to disordered regions. This work has shown that intrinsic disorder may be more common in bacterial and archaeal proteins than previously thought, but this disorder is likely to be used for different purposes than in eukaryotic proteins, as well as occurring in shorter stretches of protein. Regions of predicted disorder were found to be conserved within a large number of protein families and domains. Although many think of such conserved domains as being ordered, in fact a significant number of them contain regions of disorder that are likely to be crucial to their function.
Interactive Communication Technology and Processing of Behavioral Health Change Messages
(2005-06) McCracken-Stratton, Renee Marie; McDaniel, Anna M.
Consumer processing of interactive communication technology (ICT) messages is an understudied area. It is incumbent upon the Informatics community to partner with various health content and population domain experts to design healthcare information products that increase reach, improve awareness, and meet consumer needs. This research is a secondary analysis of a larger study to develop and pilot test an interactive, multimedia computer program as an adjunct to usual clinical care in an effort to reduce smoking in low-income rural Indiana communities. The objective of this research was to measure the degree of consumer processing of health behavioral change messages delivered by ICT. The sample size for this research was 30 subjects. Degree of consumer message processing was high (mean processing score=80.5, SD=6.837). Instruments to assess the number of actionable cessation responses (ACRs) and cognitive changes were completed at the 3-month follow-up. A relationship was observed between degree of message processing and making a quit attempt (rbis=.384, p=.044). Knowledge scores improved over baseline measures (t=3.123, p=.004). These results suggest that ICT is feasible for promoting the processing of cessation messages and increasing consideration of ACRs in low-income rural Indiana populations.
Computational Mining and Survey of Simple Sequence Repeats (SSRs) in Expressed Sequence Tags (ESTs) of Dicotyledonous Plants
(2004-07) Kumpatla, Siva Prasad; Mukhopadhyay, Snehasis
DNA markers have revolutionized the field of genetics by increasing the pace of genetic analysis. Simple sequence repeats (SSRs) are repetitions of nucleotide motifs of 1 to 5 bases and are currently the markers of choice in many plant and animal genomes due to their abundant distribution in the genomes, hypervariable nature and suitability for high-throughput analysis. While SSRs, once developed, are extremely valuable, their development is time consuming, laborious and expensive. Sequences from many genomes are continuously made freely available in the public databases and mining of these sources using computational approaches permits rapid and economical marker development. Expressed sequence tags (ESTs) are ideal candidates for mining SSRs not only because of their availability in large numbers but also due to the fact that they represent expressed genes. Large scale SSR mining efforts in plants to date focused on monocotyledonous plants. In this project, an efficient SSR identification tool was developed and used to mine SSRs from more than 53 dicotyledonous species. A total of 92,648 non-redundant ESTs or 6.0% of the 1.54 million dicotyledonous ESTs investigated in this study were found to contain SSRs. The frequency of non-redundant-ESTs containing SSRs among the species investigated ranged from 2.65% to 16.82%. More than 80% of the non-redundant ESTs having SSRs contained a single SSR repeat while others contained 2 or more SSRs. An extensive analysis of the occurrence and frequencies of various SSR types revealed that the A/T mononucleotide, AG/GA/CT/TC dinucleotide, AAG/AGA/GAA/CTT/TTC/TCT trinucleotide and TTTA and TTAA tetranucleotide repeats are the most abundant in dicotyledonous species. In addition, an analysis of the number of repeats across species revealed that majority of the mononucleotide SSRs contained 15-25 repeats while majority of the di- and tri-nucleotide SSRs contained 5-10 repeats. By providing valuable information on the abundance of SSRs in ESTs of a large number of dicotyledonous species, this study demonstrates the potential of computational mining approach for rapid discovery of SSRs towards the development of markers for genetic analysis and related applications.

Browse

Browsing Informatics Graduate Theses and PhD Dissertations by browse.metadata.dateaccessioned

Results Per Page

Sort Options