The 1000 Genomes Project is intended to be one of the largest publically accessible database of gene sequences in the biological world. The purpose of creating the database is to allow an extensive and detailed catalog of human genetic variation. The project will also categorize the genomes into an organized library that can be accessed by the public. The article The 1000 Genomes Project: data management and community access by Clark, Sheng-Bradley, & Smith et al., (2012) discusses the project and its details.
The 1000 Genomes Project was made available through the development of high-throughput sequencing technologies. The development of this technology allowed the processing of whole-genome sequencing on a scale and at a lower cost than was available in the past. The intention of the project is to provide a browsable database. Data flow management was a primary consideration of the project. The most complex task was collecting the data from various sources and organizing it and providing rapid availability of the data between institutions. The project used two mirror sites, one located in the UK and one located in the US to provide faster download times of the data.
Another consideration in the project was how to format the information so that it would be most useful to the user. Raw genome data is meaningless to the average public user. User experience with the data led to the development of an annotated version that includes analysis of the results. Users can browse and download project data that includes both sequence reads and individual genotypes for any region of the genome. They are downloadable as both BAM and VCF formats.
Data submission and access were the main challenges faced by the 1000 Genomes Project. They wanted to be able to make the information available as quickly as possible, but they also needed to present it in a way that was organized and useful to the end user. The development of a streamlined archival process provides an example that will benefit other large data projects. Along the process, protocols are in place to assure data integrity and quality. The experiences in data management developed in conjunction with this project demonstrate the difficulty in adopting bioinformatics systems that include high volume data management.
This article focused on the challenges encountered during the development of the 1000 Genomes Project and the solutions to those problems in terms of data management, access, and storage. The project claims that the information available through the project is intended for both professional organizations and the general public. However, this article, and the data in the project are highly technical. It doubtful that in its current form, the average user will be able to access and understand the data.
The stated purpose of the 1000 genomes project is to help better understand diseases in the human population. This project will allow the ability to understand not only a single mutation, but the many different variants that can occur. The data is intended to be versatile, allowing the user to compare copies and variations in order to understand them better. This technology will allow rapid improvements in our understanding of the human genome and how it affects various populations. Through making the information rapidly accessible, it will save both time and money involved in the process of gene sequencing, leading to the discovery of disease mechanisms, and new treatments more rapidly than in the past. Perhaps the greatest contribution of the project and the data management system that developed from it is that as next generation technologies are developed, a system will already be in place for the cataloging and distribution of the information.