The Genesis of Bioinformatics, History, goal and roadmap

Understanding the process of genesis of bioinformatics is a big step for beginners to understand the way they are going in the future.
Bioinformatics science is an integrated discipline which originally arose for the utilitarian purpose of introducing order into the massive large data sets produced by the new technologies of molecular biology in the laboratory. These techniques originated with large-scale DNA sequencing made by NGS techniques and the need for tools for sequence assembly and for sequence an-
notation, i.e., determination of locations of protein-coding regions in DNA. A parallel development was the construction of sequence repositories. The crowning achievement has been the sequencing of the human genome and, subsequently, of many other genomes.
Another new technology, which has started to provide a wealth of new data, is the measurement of multiple gene expression. It employs various physical media, including glass slides, nylon membranes, and other media. The idea is to expose a probe (a DNA chip) including thousands of DNA nucleotide sequences, each uniquely identifying a gene, to a sample of coding DNA ex-
tracted from a specimen of interest. Multiple-gene-expression techniques are usually employed to identify subsets of genes data discriminating between two or more biological conditions (supervised classification), or to identify clusters in the gene sample space, which leads to a classification of both samples and genes (unsupervised classification). Analysis of gene expression data has led to new developments in computational algorithms: existing computational tech- niques, with their origin in computer science, such as self-organizing maps and support vector machines, and of statistical origin such as principal-component
analysis and analysis of variance, have been adapted, and new techniques have been developed.
The next step in the development of the technology includes proteomics techniques, which allow measurements of the abundance and activity of thou- sands of protein species at once. These are usually multi step procedures. The initial phase involves physical separation of proteins from the sample according to one or more (typically two) variables, for example molecular weight and isoelectric point. This is physically accomplished using two-dimensional gels,
on which different proteins can be spotted as individual clusters. The next step involves identification of proteins sampled from different spots on the gel. This involves cleavage of amino acid chains and producing mass spectra using extremely precise mass spectrometry machines. Finally, on the basis of the distribution of molecular weights of the fragmented chains, it is possible to identify known proteins or even to sequence unknown ones. Various more refined versions of the technology exist, which allow the labeling of activated
proteins, various protein subsets, and so forth. The interpretation of proteomic data has led to the development of warping and deconvolution techniques. Two-dimensional protein gels are distorted with respect to the perfect Cartesian coordinates of the two variables describing each protein. To allow comparison with standards and with results obtained under other experimental conditions, it is necessary to transform the gel coordinates into Cartesian ones, a procedure known as warping. As mentioned above, after this is accomplished, we may analyze a gel spot representing a protein, using mass spectrometry. Deciphering the sequence of the
polypeptide chain using mass spectrometry of fragments 5 − 10 amino acids long is accomplished using deconvolution. One of the more notable consequences of the developments in genomics and proteomics has been an explosion in the methodology of genetic and metabolic networks. As is known, the expression of genes is regulated by proteins, which are activated by cascades of reactions involving interactions with other proteins, as well as the promotion or inhibition of the expression of other genes. The resulting feedback loops are largely unknown. They can be identified by perturbing the system in various ways and synthesizing a network on the basis of genomic and proteomic measurements in the presence of perturbations. A variety of network types can be used, varying from Boolean networks (discrete automata) and probabilistic versions of them, to Bayesian networks and others. Although these techniques are still unsatisfactory in practice, in many cases they have allowed us to gain insight into the structure of the feedback loops, which then can be analyzed using more conventional
tools, including, for example, systems of nonlinear differential equations.

Leave a Reply

Your email address will not be published. Required fields are marked *