The Database of Genotypes and Phenotypes (dbGaP)

696 Words2 Pages

The Database of Genotypes and Phenotypes (dbGaP) was developed by National Center for Biotechnology Information (NCBI) to archive and distribute the results of various studies that have examined the interaction of genotype and phenotype. It is public repository for individual level phenotype, exposure, genotype, sequence data and the associations between them. Searching relevant studies of particular interest accurately and completely is challenging task due to keyword based search method of dbGaP Entrez system. Text mining is emerging research field which enable users to extract useful information from text documents and deals with retrieval, classification, clustering and machine learning techniques to classify different text document.

In this work we proposed and implemented text classification (naïve bayes) and text clustering (K means) algorithm trained on dbGaP study text to identify heart,lung and blood studies. Classifiers performance compared with keyword based search result of dbGaP.It was determined that text classifiers are always best complement to document retrieval system of dbGaP.

Keywords: Bioinformatics, Data Mining, Text classification, database of Genotypes and Phenotypes.

Introduction

1.1 dbGaP

The National Library of Medicine (NLM), part of the National Institutes of Health (NIH), announces the dbGaP, a new database designed to archive and distribute data from genome wide association studies. GWA studies discover the association between specific genes and observable traits, such as weight and blood pressure, or the presence or absence of a disease or condition (phenotype information). Connecting phenotype and genotype data gives information about the genes that may be involved in a disease process or c...

... middle of paper ...

...ext mining the pattern are extracted from natural language text rather than structural databases.

Text mining, data mining and machine learning algorithms are in great demand in the field of bioinformatics. Text mining techniques applied to bioinformatics importantly involve methods like -

Classification Text documents are arranged into groups of pre-labeled class. Learning schemes learn through training text documents and efficiency of these system is tested by using test text documents. Common algorithms include decision tree learning, naive Bayesian classification, nearest neighbor and neural network. This is called supervised learning.

Clustering This is un-supervised learning method. Text documents here are unlabelled and inherent patterns in text are revealed through cluster formation. This can also be used as prior step for other text mining methods.

More about The Database of Genotypes and Phenotypes (dbGaP)

Open Document