According to figure 7.2 which of the following is not a common trait of a global strategy

Learning outcomes

When you have read Chapter 7, you should be able to:

  • Describe the strengths and weaknesses of the computational and experimental methods used to analyze genome sequences

  • Describe the basis of open reading frame (ORF) scanning, and explain why this approach is not always successful in locating genes in eukaryotic genomes

  • Outline the various experimental methods used to identify parts of a genome sequence that specify RNA molecules

  • Define the term ‘homology’ and explain why homology is important in computer-based studies of gene function

  • Evaluate the limitations of homology analysis, using the yeast genome project as an example

  • Describe the methods used to inactivate individual genes in yeast and mammals, and explain how inactivation can lead to identification of the function of a gene

  • Give outline descriptions of techniques that can be used to obtain more detailed information on the activity of a protein coded by an unknown gene

  • Describe how the transcriptome and proteome are studied

  • Explain how protein interaction maps are constructed and indicate the key features of the yeast map

  • Evaluate the potential and achievements of comparative genomics as a means of understanding a genome sequence

A genome sequence is not an end in itself. A major challenge still has to be met in understanding what the genome contains and how the genome functions. The former is addressed by a combination of computer analysis and experimentation, with the primary aim of locating the genes and their control regions. The first part of this chapter is devoted to these methods. The second question - understanding how the genome functions - is, to a certain extent, merely a different way of stating the objectives of molecular biology over the last 30 years. The difference is that in the past attention has been directed at the expression pathways for individual genes, with groups of genes being considered only when the expression of one gene is linked to that of another. Now the question has become more general and relates to the expression of the genome as a whole. The techniques used to address this topic will be covered in the latter parts of this chapter.

7.1. Locating the Genes in a Genome Sequence

Once a DNA sequence has been obtained, whether it is the sequence of a single cloned fragment or of an entire chromosome, then various methods can be employed to locate the genes that are present. These methods can be divided into those that involve simply inspecting the sequence, by eye or more frequently by computer, to look for the special sequence features associated with genes, and those methods that locate genes by experimental analysis of the DNA sequence. The computer methods form part of the methodology called bioinformatics, and it is with these that we begin.

7.1.1. Gene location by sequence inspection

Sequence inspection can be used to locate genes because genes are not random series of nucleotides but instead have distinctive features. These features determine whether a sequence is a gene or not, and so by definition are not possessed by non-coding DNA. At present we do not fully understand the nature of these specific features, and sequence inspection is not a foolproof way of locating genes, but it is still a powerful tool and is usually the first method that is applied to analysis of a new genome sequence.

The coding regions of genes are open reading frames

Genes that code for proteins comprise open reading frames (ORFs) consisting of a series of codons that specify the amino acid sequence of the protein that the gene codes for (see Figure 1.17). The ORF begins with an initiation codon - usually (but not always) ATG - and ends with a termination codon: TAA, TAG or TGA (Section 3.3.2). Searching a DNA sequence for ORFs that begin with an ATG and end with a termination triplet is therefore one way of looking for genes. The analysis is complicated by the fact that each DNA sequence has six reading frames, three in one direction and three in the reverse direction on the complementary strand (Figure 7.1), but computers are quite capable of scanning all six reading frames for ORFs. How effective is this as a means of gene location?

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.1

A double-stranded DNA molecule has six reading frames. Both strands are read in the 5′→3′ direction. Each strand has three reading frames, depending on which nucleotide is chosen as the starting position.

The key to the success of ORF scanning is the frequency with which termination codons appear in the DNA sequence. If the DNA has a random sequence and a GC content of 50% then each of the three termination codons - TAA, TAG and TGA - will appear, on average, once every 43 = 64 bp. If the GC content is > 50% then the termination codons, being AT-rich, will occur less frequently but one will still be expected every 100–200 bp. This means that random DNA should not show many ORFs longer than 50 codons in length, especially if the presence of a starting ATG is used as part of the definition of an ‘ORF’. Most genes, on the other hand, are longer than 50 codons: the average lengths are 317 codons for Escherichia coli, 483 codons for Saccharomyces cerevisiae, and approximately 450 codons for humans. ORF scanning, in its simplest form, therefore takes a figure of, say, 100 codons as the shortest length of a putative gene and records positive hits for all ORFs longer than this.

How well does this strategy work in practice? With bacterial genomes, simple ORF scanning is an effective way of locating most of the genes in a DNA sequence. This is illustrated by Figure 7.2, which shows a segment of the E. coli genome with all ORFs longer than 50 codons highlighted. The real genes in the sequence cannot be mistaken because they are much longer than 50 codons in length. With bacteria the analysis is further simplified by the fact that there is relatively little non-coding DNA in the genome (only 11% for E. coli, see Section 2.3.2). If we assume that the real genes do not overlap, and that there are no genes-within-genes (see Box 2.2), which are valid assumptions for most bacterial genomes, then it is only in the non-coding regions that there is a possibility of mistaking a short spurious ORF for a real gene. So if the non-coding component of a genome is small then there is a reduced chance of making mistakes in interpreting the results of a simple ORF scan.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.2

ORF scanning is an effective way of locating genes in a bacterial genome. The diagram shows 4522 bp of the lactose operon of Escherichia coli with all ORFs longer than 50 codons marked. The sequence contains two real genes - lacZ and lacY - indicated (more...)

Simple ORF scans are less effective with higher eukaryotic DNA

Although ORF scans work well for bacterial genomes, they are less effective for locating genes in DNA sequences from higher eukaryotes. This is partly because there is substantially more space between the real genes in a eukaryotic genome (62% of the human genome is intergenic - Box 1.4), increasing the chances of finding spurious ORFs. But the main problem with the human genome and the genomes of higher eukaryotes in general is that their genes are often split by introns (Sections 1.2.1), and so do not appear as continuous ORFs in the DNA sequence. Many exons are shorter than 100 codons, some fewer than 50 codons, and continuing the reading frame into an intron usually leads to a termination sequence that appears to close the ORF (Figure 7.3). In other words, the genes of a higher eukaryote do not appear in the genome sequence as long ORFs, and simple ORF scanning cannot locate them.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.3

ORF scans are complicated by introns. The nucleotide sequence of a short gene containing a single intron is shown. The correct amino acid sequence of the protein translated from the gene is given immediately below the nucleotide sequence: in this sequence (more...)

Solving the problem posed by introns is the main challenge for bioinformaticists writing new software programs for ORF location. Three modifications to the basic procedure for ORF scanning have been adopted (Fickett, 1996):

  • Codon bias is taken into account. ‘Codon bias’ refers to the fact that not all codons are used equally frequently in the genes of a particular organism. For example, leucine is specified by six codons in the genetic code (TTA, TTG, CTT, CTC, CTA and CTG; see Figure 3.20), but in human genes leucine is most frequently coded by CTG and is only rarely specified by TTA or CTA. Similarly, of the four valine codons, human genes use GTG four times more frequently than GTA. The biological reason for codon bias is not understood, but all organisms have a bias, which is different in different species. Real exons are expected to display the codon bias whereas chance series of triplets do not. The codon bias of the organism being studied is therefore written into the ORF scanning software.

  • Exon-intron boundaries can be searched for as these have distinctive sequence features, although unfortunately the distinctiveness of these sequences is not so great as to make their location a trivial task. The sequence of the upstream, exon-intron boundary is usually described as:

    According to figure 7.2 which of the following is not a common trait of a global strategy

    the arrow indicating the precise boundary point. However, only the ‘GT’ immediately after the arrow is invariable; elsewhere in the sequence nucleotides other than the ones shown are quite often found. In other words, the sequence shown is a consensus - the average of a range of variabilities. The downstream intron-exon boundary is even less well defined:

    According to figure 7.2 which of the following is not a common trait of a global strategy

    where ‘Py’ means one of the pyrimidine nucleotides (T or C) and ‘N’ is any nucleotide. Simply searching for the consensus sequences will not locate more than a few exon-intron boundaries because most have sequences other than the ones shown. Writing software that takes account of the known variabilities has proven difficult (Frech et al., 1997), and at present locating exon-intron boundaries by sequence analysis is a hit-and-miss affair.

  • Upstream regulatory sequences can be used to locate the regions where genes begin. This is because these regulatory sequences, like exon-intron boundaries, have distinctive sequence features that they possess in order to carry out their role as recognition signals for the DNA-binding proteins involved in gene expression (Chapter 9). Unfortunately, as with exon-intron boundaries, the regulatory sequences are variable, more so in eukaryotes than in prokaryotes, and in eukaryotes not all genes have the same collection of regulatory sequences. Using these to locate genes is therefore problematic (Ohler and Niemann, 2001).

These three extensions of simple ORF scanning are generally applicable to all higher eukaryotic genomes. Additional strategies are also possible with individual organisms, based on the special features of their genomes. For example, vertebrate genomes contain CpG islands upstream of many genes (Bird, 1986), these being sequences of approximately 1 kb in which the GC content is greater than the average for the genome as a whole. Some 40–50% of human genes are associated with an upstream CpG island. These sequences are distinctive and when one is located in vertebrate DNA, a strong assumption can be made that a gene begins in the region immediately downstream.

Homology searches give an extra dimension to sequence inspection

Most of the various software programs available for gene location can identify up to 95% of the coding regions in a eukaryotic genome, but even the best ones tend to make frequent mistakes in their positioning of the exon-intron boundaries (Reese et al., 2000). Identification of spurious ORFs as real genes is still a major problem. These limitations can be offset to a certain extent by the use of a homology search to test whether a series of triplets is a real exon or a chance sequence. In this analysis the DNA databases are searched to determine if the test sequence is identical or similar to any genes that have already been sequenced. Obviously, if the test sequence is part of a gene that has already been sequenced by someone else then an identical match will be found, but this is not the point of a homology search. Instead the intention is to determine if an entirely new sequence is similar to any known genes, because if it is then there is a chance that the test and match sequences are homologous, meaning that they represent genes that are evolutionarily related. The main use of homology searching is to assign functions to newly discovered genes, and we will therefore return to it when we deal with this aspect of genome analysis later in the chapter (Section 7.2.1). At this point, we will note simply that the technique is also central to gene location because it enables tentative exon sequences located by ORF scanning to be tested for functionality. If the tentative exon sequence gives one or more positive matches after a homology search then it is probably a real exon, but if it gives no match then its authenticity must remain in doubt until it is assessed by one or other of the experiment-based gene location techniques.

7.1.2. Experimental techniques for gene location

Most experimental methods for gene location are not based on direct examination of DNA molecules but instead rely on detection of the RNA molecules that are transcribed from genes. All genes are transcribed into RNA, and if the gene is discontinuous then the primary transcript is subsequently processed to remove the introns and link up the exons (Sections 1.2.1 and 10.1.3). Techniques that map the positions of transcribed sequences in a DNA fragment can therefore be used to locate exons and entire genes. The only problem to be kept in mind is that the transcript is usually longer than the coding part of the gene because it begins several tens of nucleotides upstream of the initiation codon and continues several tens or hundreds of nucleotides downstream of the termination codon (see Figure 1.17). Transcript analysis does not therefore give a precise definition of the start and end of the coding region of a gene, but it does tell you that a gene is present in a particular region and it can locate the exon-intron boundaries. Often this is sufficient information to enable the coding region to be delineated.

Hybridization tests can determine if a fragment contains transcribed sequences

The simplest procedures for studying transcribed sequences are based on hybridization analysis. RNA molecules can be separated by specialized forms of agarose gel electrophoresis and transferred to a nitrocellulose or nylon membrane by the process called northern blotting (see Technical Note 4.4). This differs from Southern blotting (Section 4.1.2) only in the precise conditions under which the transfer is carried out, and the fact that it was not invented by a Dr Northern and so does not have a capital ‘N’. If a northern blot of cellular RNA is probed with a labeled fragment of the genome, then RNAs transcribed from genes within that fragment will be detected (Figure 7.4). Northern hybridization is therefore, theoretically, a means of determining the number of genes present in a DNA fragment and the size of each coding region. There are two weaknesses with this approach:

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.4

Northern hybridization. An RNA extract is electrophoresed under denaturing conditions in an agarose gel (see Technical Note 4.4). After ethidium bromide staining, two bands are seen. These are the two largest rRNA molecules (Section 3.2.1) which are abundant (more...)

  • Some individual genes give rise to two or more transcripts of different lengths because some of their exons are optional and may or may not be retained in the mature RNA (Section 10.1.3). If this is the case, then a fragment that contains just one gene could detect two or more hybridizing bands in the northern blot. A similar problem can occur if the gene is a member of a multigene family (Section 2.2.1).

  • With many species, it is not practical to make an mRNA preparation from an entire organism so the extract is obtained from a single organ or tissue. Consequently any genes not expressed in that organ or tissue will not be represented in the RNA population, and so will not be detected when the RNA is probed with the DNA fragment being studied. Even if the whole organism is used, not all genes will give hybridization signals because many are expressed only at a particular developmental stage, and others are weakly expressed, meaning that their RNA products are present in amounts too low to be detected by hybridization analysis.

A second type of hybridization analysis avoids the problems with poorly expressed and tissue-specific genes by searching not for RNAs but for related sequences in the DNAs of other organisms. This approach, like homology searching, is based on the fact that homologous genes in related organisms have similar sequences, whereas the non-coding DNA is usually quite different. If a DNA fragment from one species is used to probe a Southern blot of DNAs from related species, and one or more hybridization signals are obtained, then it is likely that the probe contains one or more genes (Figure 7.5). This is called zoo-blotting.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.5

Zoo-blotting. The objective is to determine if a fragment of human DNA hybridizes to DNAs from related species. Samples of human, chimp, cow and rabbit DNAs are therefore prepared, restricted, and electrophoresed in an agarose gel. Southern hybridization (more...)

cDNA sequencing enables genes to be mapped within DNA fragments

Northern hybridization and zoo-blotting enable the presence or absence of genes in a DNA fragment to be determined, but give no positional information relating to the location of those genes in the DNA sequence. The easiest way to obtain this information is to sequence the relevant cDNAs. A cDNA is a copy of an mRNA (see Figure 5.32) and so corresponds to the coding region of a gene, plus any leader or trailer sequences that are also transcribed. Comparing a cDNA sequence with a genomic DNA sequence therefore delineates the position of the relevant gene and reveals the exon-intron boundaries.

In order to obtain an individual cDNA, a cDNA library must first be prepared from all of the mRNA in the tissue being studied. Once the library has been prepared, the success of cDNA sequencing as a means of gene location depends on two factors. The first concerns the frequency of the desired cDNAs in the library. As with northern hybridization, the problem relates to the different expression levels of different genes. If the DNA fragment being studied contains one or more poorly expressed genes, then the relevant cDNAs will be rare in the library and it might be necessary to screen many clones before the desired one is identified. To get around this problem, various methods of cDNA capture or cDNA selection have been devised, in which the DNA fragment being studied is repeatedly hybridized to the pool of cDNAs in order to enrich the pool for the desired clones (Lovett, 1994). Because the cDNA pool contains so many different sequences, it is generally not possible to discard all the irrelevant clones by these repeated hybridizations, but it is possible to increase significantly the frequency of those clones that specifically hybridize to the DNA fragment. This reduces the size of the library that must subsequently be screened under stringent conditions to identify the desired clones.

A second factor that determines success or failure is the completeness of the individual cDNA molecules. Usually, cDNAs are made by copying RNA molecules into single-stranded DNA with reverse transcriptase and then converting the single-stranded DNA into double-stranded DNA with a DNA polymerase (see Figure 5.32). There is always a chance that one or other of the strand synthesis reactions will not proceed to completion, resulting in a truncated cDNA. The presence of intramolecular base pairs in the RNA can also lead to incomplete copying. Truncated cDNAs may lack some of the information needed to locate the start and end points of a gene and all its exon-intron boundaries.

Methods are available for precise mapping of the ends of transcripts

The problems with incomplete cDNAs mean that more robust methods are needed for locating the precise start and end points of gene transcripts. One possibility is a special type of PCR which uses RNA rather than DNA as the starting material. The first step in this type of PCR is to convert the RNA into cDNA with reverse transcriptase, after which the cDNA is amplified with Taq polymerase in the same way as in a normal PCR. These methods go under the collective name of reverse transcriptase PCR (RT-PCR) but the particular version that interests us at present is rapid amplification of cDNA ends (RACE; Frohman et al., 1988). In the simplest form of this method one of the primers is specific for an internal region close to the beginning of the gene being studied. This primer attaches to the mRNA for the gene and directs the first reverse-transcriptase-catalyzed stage of the process, during which a cDNA corresponding to the start of the mRNA is made (Figure 7.6). Because only a small segment of the mRNA is being copied, the expectation is that the cDNA synthesis will not terminate prematurely, so one end of the cDNA will correspond exactly with the start of the mRNA. Once the cDNA has been made, a short poly(A) tail is attached to its 3′ end. The second primer anneals to this poly(A) sequence and, during the first round of the normal PCR, converts the single-stranded cDNA into a double-stranded molecule, which is subsequently amplified as the PCR proceeds. The sequence of this amplified molecule will reveal the precise position of the start of the transcript.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.6

RACE - rapid amplification of cDNA ends. The RNA being studied is converted into a partial cDNA by extension of a DNA primer that anneals at an internal position not too distant from the 5′ end of the molecule. The 3′ end of the cDNA is (more...)

Other methods for precise transcript mapping involve heteroduplex analysis. If the DNA region being studied is cloned as a restriction fragment in an M13 vector (Section 6.1.1) then it can be obtained as single-stranded DNA. When mixed with an appropriate RNA preparation, the transcribed sequence in the cloned DNA hybridizes with the equivalent mRNA, forming a double-stranded heteroduplex. In the example shown in Figure 7.7 the start of this mRNA lies within the cloned restriction fragment, so some of the cloned fragment participates in the heteroduplex, but the rest does not. The single-stranded regions can be digested by treatment with a single-strand-specific nuclease such as S1. The size of the heteroduplex is determined by degrading the RNA component with alkali and electrophoresing the resulting single-stranded DNA in an agarose gel. This size measurement is then used to position the start of the transcript relative to the restriction site at the end of the cloned fragment.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.7

S1 nuclease mapping. This method of transcript mapping makes use of S1 nuclease, an enzyme that degrades single-stranded DNA or RNA polynucleotides, including single-stranded regions in predominantly double-stranded molecules, but has no effect on double-stranded (more...)

Exon-intron boundaries can also be located with precision

Heteroduplex analysis can also be used to locate exon-intron boundaries. The method is almost the same as that shown in Figure 7.7 with the exception that the cloned restriction fragment spans the exon-intron boundary being mapped rather than the start of the transcript.

A second method for finding exons in a genome sequence is called exon trapping (Church et al., 1994). This requires a special type of vector that contains a minigene consisting of two exons flanking an intron sequence, the first exon being preceded by the sequence signals needed to initiate transcription in a eukaryotic cell (Figure 7.8). To use the vector the piece of DNA to be studied is inserted into a restriction site located within the vector's intron region. The vector is then introduced into a suitable eukaryotic cell line, where it is transcribed and the RNA produced from it is spliced. The result is that any exon contained in the genomic fragment becomes attached between the upstream and downstream exons from the minigene. RT-PCR with primers annealing within the two minigene exons is now used to amplify a DNA fragment, which is sequenced. As the minigene sequence is already known, the nucleotide positions at which the inserted exon starts and ends can be determined, precisely delineating this exon.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.8

Exon trapping. The exon-trap vector consists of two exon sequences preceded by promoter sequences - the signals required for gene expression in a eukaryotic host (Section 9.2.2). New DNA containing an unmapped exon is ligated into the vector and the recombinant (more...)

7.2. Determining the Functions of Individual Genes

Once a new gene has been located in a genome sequence, the question of its function has to be addressed. This is turning out to be an important area of genomics research, because completed sequencing projects have revealed that we know rather less than we thought about the content of individual genomes. E. coli and S. cerevisiae, for example, were studied intensively by conventional genetic analysis before the advent of sequencing projects, and geneticists were at one time fairly confident that most of their genes had been identified. The genome sequences revealed that in fact there are large gaps in our knowledge. Of the 4288 protein-coding genes in the E. coli genome sequence, only 1853 (43% of the total) had been previously identified (Blattner et al., 1997). For S. cerevisiae the figure was only 30% (Dujon, 1996).

As with gene location, attempts to determine the functions of unknown genes are made by computer analysis and by experimental studies.

7.2.1. Computer analysis of gene function

We have already seen that computer analysis plays an important role in locating genes in DNA sequences, and that one of the most powerful tools available for this purpose is homology searching, which locates genes by comparing the DNA sequence under study with all the other DNA sequences in the databases. The basis of homology searching is that related genes have similar sequences and so a new gene can be discovered by virtue of its similarity to an equivalent, already sequenced, gene from a different organism. Now we will look more closely at homology analysis and see how it can be used to assign a function to a new gene.

Homology reflects evolutionary relationships

Homologous genes are ones that share a common evolutionary ancestor, revealed by sequence similarities between the genes. These similarities form the data on which molecular phylogenies are based, as we will see in Chapter 16. Homologous genes fall into two categories:

  • Orthologous genes are those homologs that are present in different organisms and whose common ancestor predates the split between the species.

  • Paralogous genes are present in the same organism, often members of a recognized multigene family (Section 2.2.1), their common ancestor possibly or possibly not predating the species in which the genes are now found.

A pair of homologous genes do not usually have identical nucleotide sequences, because the two genes undergo different random changes by mutation, but they have similar sequences because these random changes have operated on the same starting sequence, the common ancestral gene. Homology searching makes use of these sequence similarities. The basis of the analysis is that if a newly sequenced gene turns out to be similar to a previously sequenced gene, then an evolutionary relationship can be inferred and the function of the new gene is likely to be the same, or at least similar, to the function of the known gene.

It is important not to confuse the words homology and similarity. It is incorrect to describe a pair of related genes as ‘80% homologous’ if their sequences have 80% nucleotide identity (Figure 7.9). A pair of genes are either evolutionarily related or they are not; there are no in-between situations and it is therefore meaningless to ascribe a percentage value to homology.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.9

Two DNA sequences with 80% sequence identity.

Homology analysis can provide information on the function of an entire gene or of segments within it

A homology search can be conducted with a DNA sequence but usually a tentative gene sequence is converted into an amino acid sequence before the search is carried out. One reason for this is that there are 20 different amino acids in proteins but only four nucleotides in DNA, so genes that are unrelated usually appear to be more different from one another when their amino acid sequences are compared (Figure 7.10). A homology search is therefore less likely to give spurious results if the amino sequence is used. The practicalities of homology searching are not at all daunting. Several software programs exist for this type of analysis, the most popular being BLAST (Basic Local Alignment Search Tool; Altschul et al., 1990). The analysis can be carried out simply by logging on to the web site for one of the DNA databases and entering the sequence into the online search tool.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.10

Lack of homology between two sequences is often more apparent when comparisons are made at the amino acid level. Two nucleotide sequences are shown, with nucleotides that are identical in the two sequences given in red and non-identities given in blue. (more...)

A positive match to a gene already in the database may give a clear indication of the function of the new gene, or the implications of the match might be more subtle. In particular, genes that have no obvious evolutionary relatedness might have short segments that are similar to one another. The explanation of this is often that, although the genes are unrelated, their proteins have similar functions and the shared sequence encodes a domain within each protein that is central to that shared function. Although the genes themselves have no common ancestor, the domains do, but with their common ancestor occurring at a very ancient time, the homologous domains having subsequently evolved not only by single nucleotide changes, but also by more complex rearrangements that have created new genes within which the domains are found (Section 15.2.1). An interesting example is provided by the tudor domain, an approximately 120-amino-acid motif which was first identified in the sequence of the Drosophila melanogaster gene called tudor (Ponting, 1997). The protein coded by the tudor gene, whose function is unknown, is made up of ten copies of the tudor domain, one after the other (Figure 7.11). A homology search using the tudor domain as the test revealed that several known proteins contain this domain. The sequences of these proteins are not highly similar to one another and there is no indication that they are true homologs, but they all possess the tudor domain. These proteins include one involved in RNA transport during Drosophila oogenesis, a human protein with a role in RNA metabolism, and others whose activities appear to involve RNA in one way or another. The homology analysis therefore suggests that the tudor sequence plays some part in the interaction between the protein and its RNA substrate. The information from the computer analysis is incomplete by itself, but it points the way to the types of experiment that should be done to obtain more clear-cut data on the function of the tudor domain.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.11

The tudor domain. The top drawing shows the structure of the Drosophila tudor protein, which contains ten copies of the tudor domain. The domain is also found in a second Drosophila protein, homeless, and in the human A-kinase anchor protein (AKAP149), (more...)

Homology analysis in the yeast genome project

The S. cerevisiae genome project has illustrated both the potential and limitations of homology analysis as a means of assigning functions to new genes. The yeast genome contains approximately 6000 genes, 30% of which had been identified by conventional genetic analysis before the sequencing project got underway. The remaining 70% were studied by homology analysis, giving the following results (Figure 7.12; Dujon, 1996):

According to figure 7.2 which of the following is not a common trait of a global strategy

  • Almost another 30% of the genes in the genome could be assigned functions after homology searching of the sequence databases. About half of these were clear homologs of genes whose functions had been established previously, and about half had less striking similarities, including many where the similarities were restricted to discrete domains. For all these genes the homology analysis could be described as successful, but with various degrees of usefulness (Oliver, 1996a). For some genes the identification of a homolog enabled the function of the yeast gene to be comprehensively determined; examples included identification of yeast genes for DNA polymerase subunits. For other genes the functional assignment could only be to a broad category, such as ‘gene for a protein kinase’; in other words, the biochemical properties of the gene product could be inferred, but not the exact role of the protein in the cell. Some identifications were initially puzzling, the best example being the discovery of a yeast homolog of a bacterial gene involved in nitrogen fixation. Yeasts do not fix nitrogen so this could not be the function of the yeast gene. In this case, the discovery of the yeast homolog refocused attention on the previously characterized bacterial gene, with the subsequent realization that, although being involved in nitrogen fixation, the primary role of the bacterial gene product was in the synthesis of metal-containing proteins, which have broad roles in all organisms, not just nitrogen-fixing ones.

  • About 10% of all the yeast genes had homologs in the databases, but the functions of these homologs were unknown. The homology analysis was therefore unable to help in assigning functions to these yeast genes. These yeast genes and their homologs are called orphan families.

  • The remaining yeast genes, about 30% of the total, had no homologs in the databases. A proportion of these (about 7% of the total) were questionable ORFs which might not be real genes, being rather short or having an unusual codon bias. The remainder looked like genes but were unique. These are called single orphans.

7.2.2. Assigning gene function by experimental analysis

It is clear that homology analysis is not a panacea that can identify the functions of all new genes. Experimental methods are therefore needed to complement and extend the results of homology studies. This is proving to be one of the biggest challenges in genomics research, and most molecular biologists agree that the methodologies and strategies currently in use are not entirely adequate for assigning functions to the vast numbers of unknown genes being discovered by sequencing projects. The problem is that the objective - to plot a course from gene to function - is the reverse of the route normally taken by genetic analysis, in which the starting point is a phenotype and the objective is to identify the underlying gene or genes. The problem we are currently addressing takes us in the opposite direction: starting with a new gene and hopefully leading to identification of the associated phenotype.

Functional analysis by gene inactivation

In conventional genetic analysis, the genetic basis of a phenotype is usually studied by searching for mutant organisms in which the phenotype has become altered. The mutants might be obtained experimentally, for example by treating a population of organisms (e.g. a culture of bacteria) with ultraviolet radiation or a mutagenic chemical (see Section 14.1.1), or the mutants might be present in a natural population. The gene or genes that have been altered in the mutant organism are then studied by genetic crosses (Section 5.2.4), which can locate the position of a gene in a genome and also determine if the gene is the same as one that has already been characterized. The gene can then be studied further by molecular biology techniques such as cloning and sequencing.

The general principle of this conventional analysis is that the genes responsible for a phenotype can be identified by determining which genes are inactivated in organisms that display a mutant version of the phenotype. If the starting point is the gene, rather than the phenotype, then the equivalent strategy would be to mutate the gene and identify the phenotypic change that results. This is the basis of most of the techniques used to assign functions to unknown genes.

Individual genes can be inactivated by homologous recombination

The easiest way to inactivate a specific gene is to disrupt it with an unrelated segment of DNA (Figure 7.13). This can be achieved by homologous recombination between the chromosomal copy of the gene and a second piece of DNA that shares some sequence identity with the target gene. Homologous (and other types of) recombination are complex events, which we will deal with in detail in Section 14.3.1. For present purposes it is enough to know that if two DNA molecules have similar sequences, then recombination can result in segments of the molecules being exchanged.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.13

Gene inactivation by homologous recombination. The chromosomal copy of the target gene recombines with a disrupted version of the gene carried by a cloning vector. As a result, the target gene becomes inactivated. For more information on recombination (more...)

How is gene inactivation carried out in practice? We will consider two examples, the first with S. cerevisiae. Since completing the genome sequence in 1996, yeast molecular biologists have embarked on a coordinated, international effort to determine the functions of as many orphan genes as possible (Oliver, 1996b). One technique that is being used is shown in Figure 7.14 (Wach et al., 1994). The central component is the ‘deletion cassette’, which carries a gene for antibiotic resistance. This gene is not a normal component of the yeast genome but it will work if transferred into a yeast chromosome, giving rise to a transformed yeast cell that is resistant to the antibiotic geneticin. Before using the deletion cassette, new segments of DNA are attached as tails to either end. These segments have sequences identical to parts of the yeast gene that is going to be inactivated. After the modified cassette is introduced into a yeast cell, homologous recombination occurs between the DNA tails and the chromosomal copy of the yeast gene, replacing the latter with the antibiotic-resistance gene. Cells which have undergone the replacement are therefore selected by plating the culture onto agar medium containing geneticin. The resulting colonies lack the target gene activity and their phenotypes can be examined to gain some insight into the function of the gene.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.14

The use of a yeast deletion cassette. The deletion cassette consists of an antibiotic-resistance gene preceded by the promoter sequences needed for expression in yeast, and flanked by two restriction sites. The start and end segments of the target gene (more...)

The second example of gene inactivation uses an analogous process but with mice rather than yeast. The mouse is frequently used as a model organism for humans because the mouse genome is similar to the human genome, containing many of the same genes. Identifying the functions of unknown human genes is therefore being carried out largely by inactivating the equivalent genes in the mouse, these experiments being ethically unthinkable with humans. The homologous recombination part of the procedure is identical to that described for yeast and once again results in a cell in which the target gene has been inactivated. The problem is that we do not want just one mutated cell, we want a whole mutant mouse, as only with the complete organism can we make a full assessment of the effect of the gene inactivation on the phenotype. To achieve this it is necessary to use a special type of mouse cell, an embryonic stem or ES cell (Evans et al., 1997). Unlike most mouse cells, ES cells are totipotent, meaning that they are not committed to a single developmental pathway and can therefore give rise to all types of differentiated cell. The engineered ES cell is therefore injected into a mouse embryo, which continues to develop and eventually gives rise to a chimera, a mouse whose cells are a mixture of mutant ones, derived from the engineered ES cells, and non-mutant ones, derived from all the other cells in the embryo. This is still not quite what we want, so the chimeric mice are allowed to mate with one another. Some of the offspring result from fusion of two mutant gametes, and will therefore be non-chimeric, as every one of their cells will carry the inactivated gene. These are knockout mice, and with luck their phenotypes will provide the desired information on the function of the gene being studied. This works well for many gene inactivations but some are lethal and so cannot be studied in a homozygous knockout mouse. Instead, a heterozygous mouse is obtained, the product of fusion between one normal and one mutant gamete, in the hope that the phenotypic effect of the gene inactivation will be apparent even though the mouse still has one correct copy of the gene being studied.

Gene inactivation without homologous recombination

Homologous recombination is not the only way to disrupt a gene in order to study its function. One alternative is use transposon tagging, in which inactivation is achieved by the insertion of a transposable element into the gene. Most genomes contain transposable elements (Section 2.4.2) and although the bulk of these are inactive, there are usually a few that retain their ability to transpose. Under normal circumstances, transposition is a relatively rare event, but it is sometimes possible to use recombinant DNA techniques to make modified transposons that change their position in response to an external stimulus. One way of doing this, involving the yeast retrotransposon Ty1, is shown in Figure 7.15.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.15

Artificial induction of transposition. Recombinant DNA techniques have been used to place a promoter sequence (Section 3.2.2) that is responsive to galactose upstream of a Ty1 element in the yeast genome. When galactose is absent, the Ty1 element is not (more...)

Transposon tagging is central to the technique called genetic footprinting (Smith et al., 1995), which has been used to inactivate many of the yeast orphans as a first step to assessing their function. Transposon tagging is also important in analysis of the fruit-fly genome, using the endogenous Drosophila transposon called the P element (Engels, 2000). The weakness with transposon tagging is that it is difficult to target individual genes, because transposition is more or less a random event and it is impossible to predict where a transposon will end up after it has jumped. If the intention is to inactivate a particular gene then it is necessary to induce a substantial number of transpositions and then to screen the resulting organisms to find one with the correct insertion. Transposon tagging is therefore more applicable to global studies of genome function, in which genes are inactivated at random and groups of genes with similar functions identified by examining the progeny for interesting phenotype changes.

A completely different approach to gene inactivation is provided by RNA interference. In this technique, rather than disrupting the gene itself, its mRNA is destroyed. This is accomplished by introducing into the cell short double-stranded RNA molecules whose sequences match that of the mRNA being targeted. The double-stranded RNAs are broken down into shorter molecules which induce degradation of the mRNA (Figure 7.16). The process has been shown to work effectively in the worm Caenorhabditis elegans (Fire et al., 1998), whose genome has been completely sequenced (see Table 2.1) and which is looked on as an important model organism for higher eukaryotes (Section 12.3.2). Almost 2500 of the 2769 predicted genes on chromosome I of C. elegans have been individually inactivated by RNA interference, simply by placing the worms in a solution containing the double-stranded RNA and allowing normal uptake processes to transport the molecules into the cells (Fraser et al., 2000). Similar projects are being directed at the other C. elegans chromosomes.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.16

RNA interference. The double-stranded RNA molecule is broken down by the Dicer ribonuclease into ‘short interfering RNAs’ (siRNAs) of 21–25 bp in length. One strand of each siRNA base pairs to the target mRNA, which is then degraded (more...)

RNA interference is known to occur naturally in a range of eukaryotes, but applying it to mammalian cells was expected to be difficult because these organisms display a parallel response to double-stranded RNA, in which protein synthesis is generally inhibited, resulting in cell death (Bass, 2001). These worries were unfounded, however, because it has now been shown that introduction of double-stranded RNAs into cultured human cells by fusion with liposomes (Figure 7.17) results in inactivation of the target mRNA, with no measurable decrease in overall protein synthesis (Elbashir et al., 2001). The drawback to using this technique with mammals is that it is only possible to work with single cells, rather than whole organisms, because the double-stranded RNAs have a limited lifetime within the cell and cannot be used to engineer permanent changes such as those necessary in the construction of knockout mice.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.17

Fusion with liposomes can be used to deliver double-stranded RNA into a human cell.

Gene overexpression can also be used to assess function

So far we have concentrated on techniques that result in inactivation of the gene being studied (‘loss of function’). The complementary approach is to engineer an organism in which the test gene is much more active than normal (‘gain of function’) and to determine what changes, if any, this has on the phenotype. The results of these experiments must be treated with caution because of the need to distinguish between a phenotype change that is due to the specific function of an overexpressed gene, and a less specific phenotype change that reflects the abnormality of the situation where a single gene product is being synthesized in excessive amounts, possibly in tissues in which the gene is normally inactive. Despite this qualification, overexpression has provided some important information on gene function.

To overexpress a gene a special type of cloning vector must be used, one designed to ensure that the cloned gene directs the synthesis of as much protein as possible. The vector is therefore multicopy, meaning that it multiplies inside the host organism to 40–200 copies per cell, so there are many copies of the test gene. The vector must also contain a highly active promoter (Section 9.2.2) so that each copy of the test gene is converted into large quantities of mRNA, again ensuring that as much protein as possible is made. An example of the technique used with mice genes is shown in Figure 7.18 (Simonet et al., 1997). In this project the genes to be studied were selected because their sequences suggested that they code for proteins that are secreted into the bloodstream. The cloning vector that was used contained a highly active promoter that is expressed only in the liver, so each transgenic mouse overexpressed the test gene in its liver and then secreted the resulting protein into the blood. The phenotype of each transgenic mouse was examined in the search for clues regarding the functions of the cloned genes. An interesting discovery was made when it was realized that one transgenic mouse had bones that were significantly more dense than those of normal mice. This was important for two reasons: first, it enabled the relevant gene to be identified as one involved in bone synthesis; second, the discovery of a protein that increases bone density has implications for the development of treatments for human osteoporosis, a fragile-bone disease.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.18

Functional analysis by gene overexpression. The objective is to determine if overexpression of the gene being studied has an effect on the phenotype of a transgenic mouse. A cDNA of the gene is therefore inserted into a cloning vector carrying a highly (more...)

According to figure 7.2 which of the following is not a common trait of a global strategy

Box 7.1

Analysis of chromosome I of Caenorhabditis elegans by RNA interference. Functions have been assigned to 339 genes on C. elegans chromosome I after individual inactivation by the RNA interference technique. C. elegans is a tiny nematode worm (see Figure (more...)

7.2.3. More detailed studies of the activity of a protein coded by an unknown gene

Gene inactivation and overexpression are the primary techniques used by genome researchers to determine the function of a new gene, but these are not the only procedures that can provide information on gene activity. Other methods can extend and elaborate the results of inactivation and overexpression. These can be used to provide additional information that will aid identification of a gene function, or might form the basis of a more comprehensive examination of the activity of a protein whose gene has already been characterized.

Directed mutagenesis can be used to probe gene function in detail

Inactivation and overexpression can determine the general function of a gene, but they cannot provide detailed information on the activity of a protein coded by a gene. For example, it might be suspected that part of a gene codes for an amino acid sequence that directs its protein product to a particular compartment in the cell, or is responsible for the ability of the protein to respond to a chemical or physical signal. To test these hypotheses it would be necessary to delete or alter the relevant part of the gene sequence, but to leave the bulk unmodified so that the protein is still synthesized and retains the major part of its activity. The various procedures of site-directed or in vitro mutagenesis (Technical Note 7.1) can be used to make these subtle changes. These are important techniques whose applications lie not only with the study of gene activity but also in the area of protein engineering, where the objective is to create novel proteins with properties that are better suited for use in industrial or clinical settings.

According to figure 7.2 which of the following is not a common trait of a global strategy

Box 7.1

Site-directed mutagenesis. Methods for making a precise alteration in a gene sequence in order to change the structure and possibly the activity of a protein. Changes in protein structure can be engineered by site-directed mutagenesis techniques, which (more...)

After mutagenesis the gene sequence must be introduced into the host cell so that homologous recombination can replace the existing copy of the gene with the modified version. This presents a problem because we must have a way of knowing which cells have undergone homologous recombination. Even with yeast this will only be a fraction of the total, and with mice the fraction will be very small. Normally we would solve this problem by placing a marker gene (e.g. one coding for antibiotic resistance) next to the mutated gene and looking for cells that take on the phenotype conferred by this marker. In most cases, cells that insert the marker gene into their genome also insert the closely attached mutated gene and so are the ones we want. The problem is that in a site-directed mutagenesis experiment we must be sure that any change in the activity of the gene being studied is the result of the specific mutation that was introduced into the gene, rather than the indirect result of changing its environment in the genome by inserting a marker gene next to it. The answer is to use a more complex two-step gene replacement (Figure 7.19). In this procedure the target gene is first replaced with the marker gene on its own, the cells in which this recombination takes place being identified by selecting for the marker gene phenotype. These cells are then used in the second stage of the gene replacement, when the marker gene is replaced by the mutated gene, success now being monitored by looking for cells that have lost the marker gene phenotype. These cells contain the mutated gene and their phenotypes can be examined to determine the effect of the directed mutation on the activity of the protein product.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.19

Two-step gene replacement. See the text for details.

Reporter genes and immunocytochemistry can be used to locate where and when genes are expressed

Clues to the function of a gene can often be obtained by determining where and when the gene is active. If gene expression is restricted to a particular organ or tissue of a multicellular organism, or to a single set of cells within an organ or tissue, then this positional information can be used to infer the general role of the gene product. The same is true of information relating to the developmental stage at which a gene is expressed. This type of analysis has proved particularly useful in understanding the activities of genes involved in the earliest stages of development in Drosophila (Section 12.3.3) and is increasingly being used to unravel the genetics of mammalian development. It is also applicable to those unicellular organisms, such as yeast, which have distinctive developmental stages in their life cycle.

Determining the pattern of gene expression within an organism is possible with a reporter gene. This is a gene whose expression can be monitored in a convenient way, ideally by visual examination (Table 7.1), cells that express the reporter gene becoming blue, fluorescing or giving off some other visible signal. For the reporter gene to give a reliable indication of where and when a test gene is expressed, the reporter must be subject to the same regulatory signals as the test gene. This is achieved by replacing the ORF of the test gene with the ORF of the reporter gene (Figure 7.20). Most of the regulatory signals that control gene expression are contained in the region of DNA upstream of the ORF, so the reporter gene should now display the same expression pattern as the test gene. The expression pattern can therefore be determined by examining the organism for the reporter signal.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.20

A reporter gene. The open reading frame of the reporter gene replaces the open reading frame of the gene being studied. The result is that the reporter gene is placed under control of the regulatory sequences that usually dictate the expression pattern (more...)

As well as knowing in which cells a gene is expressed, it is often useful to locate the position within the cell where the protein coded by the gene is found. For example, key data regarding gene function can be obtained by showing that the protein product is located in mitochondria, in the nucleus, or on the cell surface. Reporter genes cannot help here because the DNA sequence upstream of the gene - the sequence to which the reporter gene is attached - is not involved in targeting the protein product to its correct intracellular location. Instead it is the amino acid sequence of the protein itself that is important. Therefore the only way to determine where the protein is located is to search for it directly. This is done by immunocytochemistry, which makes use of an antibody that is specific for the protein of interest and so binds to this protein and no other. The antibody is labeled so that its position in the cell, and hence the position of the target protein, can be visualized (Figure 7.21). Fluorescent labeling and light microscopy are used for low-resolution studies; alternatively, high-resolution immunocytochemistry can be carried out by electron microscopy using an electron-dense label such as colloidal gold.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.21

Immunocytochemistry. The cell is treated with an antibody that is labeled with a blue fluorescent marker. Examination of the cell shows that the fluorescent signal is associated with the inner mitochondrial membrane. A working hypothesis would therefore (more...)

7.3. Global Studies of Genome Activity

Even if every gene in a genome can be identified and assigned a function, a challenge still remains. This is to understand how the genome as a whole operates within the cell, specifying and coordinating the various biochemical activities that take place. These global studies of genome activity must address not the genome itself but the transcriptome and proteome that are synthesized and maintained by the genome (Chapter 3). The objective is to understand the key features of the transcriptomes and proteomes that are present in different tissues and during different developmental stages and, in the case of humans, in different disease states (Section 3.2.3).

7.3.1. Studying the transcriptome

The transcriptome comprises the mRNAs that are present in a cell at a particular time. Transcriptomes can have highly complex compositions, with hundreds or thousands of different mRNAs represented, each making up a different fraction of the overall population (Section 3.2.3). To characterize a transcriptome it is therefore necessary to identify the mRNAs that it contains and, ideally, to determine their relative abundances.

The composition of a transcriptome can be assayed by SAGE

The most direct way to characterize a transcriptome is to convert its mRNA into cDNA (see Figure 5.32), and then to sequence every clone in the resulting cDNA library. Comparisons between the cDNA sequences and the genome sequence will reveal the identities of the genes whose mRNAs are present in the transcriptome. This approach is feasible but it is laborious, with many different cDNA sequences being needed before a near-complete picture of the composition of the transcriptome begins to emerge. If two or more transcriptomes are being compared then the time needed to complete the project increases. Can any shortcuts be used to obtain the vital sequence information more quickly?

Serial analysis of gene expression (SAGE) provides a solution (Velculescu et al., 2000). Rather than studying complete cDNAs, SAGE yields short sequences, as little as 12 bp in length, each of which represents an mRNA present in the transcriptome. The basis of the technique is that these 12-bp sequences, despite their shortness, are sufficient to enable the gene that codes for the mRNA to be identified. The argument is that any particular 12-bp sequence should appear in the genome once every 412 = 16 777 216 bp. The average size of a eukaryotic mRNA is about 1500 bp, so 412 bp is equivalent to the combined length of over 11 000 transcripts. This number is higher than the number of transcripts expected in all but the most complex transcriptomes, so the 12-bp sequence tags should be able to identify unambiguously the genes coding for all the mRNAs that are present.

The procedure used to generate the 12-bp tags is shown in Figure 7.22. First, the mRNA is immobilized in a chromatography column by annealing the poly(A) tails present at the 3′ ends of these molecules to oligo(dT) strands that have been attached to cellulose beads. The mRNA is converted into double-stranded cDNA and then treated with a restriction enzyme that recognizes a 4-bp target site and so cuts frequently in each cDNA. The terminal restriction fragment of each cDNA remains attached to the cellulose beads, enabling all the other fragments to be eluted and discarded. A short oligonucleotide is now attached to the free end of each cDNA, this oligonucleotide containing a recognition sequence for Bsm FI. This is an unusual restriction enzyme in that rather than cutting within its recognition sequence, it cuts 10–14 nucleotides downstream. Treatment with Bsm FI therefore removes a fragment with an average length of 12 bp from the end of each cDNA. The fragments are collected, ligated head-to-tail to produce a concatamer, and sequenced. The individual tag sequences are identified within the concatamer and compared with the sequences of the genes in the genome.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.22

SAGE. See the text for details. In this example, the first restriction enzyme to be used is Alu I, which recognizes the 4-bp target site 5′-AGCT-3′ (see Table 4.3). The oligonucleotide that is ligated to the cDNA contains the recognition (more...)

Using chip and microarray technology to study a transcriptome

DNA chips and microarrays (see Technical Note 5.1) can also be used to study transcriptomes. With a small genome such as that of S. cerevisiae, chips that carry oligonucleotides representing every gene can be constructed. A transcriptome is then characterized by converting its mRNA into cDNA, labeling the cDNA, and applying it to the chip. The positions at which hybridization occurs indicate the oligonucleotides representing the genes whose transcripts are present in the transcriptome (Figure 7.23A). Compared with SAGE, this approach has the advantage that a rapid evaluation of the differences between two or more transcriptomes can be made by hybridizing the different cDNA preparations to identical chips and comparing the hybridization patterns. A further embellishment can be achieved by probing the chip with cDNA that has been prepared from the mRNA fraction that is bound to ribosomes in the cells being studied, rather than from total mRNA. These mRNAs correspond to the part of the transcriptome that is actively directing protein synthesis, giving a slightly different picture of genome activity (Pradet-Balade et al., 2001).

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.23

Transcriptome analysis. (A) Transcriptome analysis with a DNA chip carrying oligonucleotides representing all the genes in a small genome. After adding labeled cDNA, the positions of the hybridization signals on the chip indicate which genes have contributed (more...)

Microarrays are used in a similar way to chips (Marshall, 1999; Knight 2001), but instead of immobilized oligonucleotides, they carry samples of cloned DNA. Often these are cDNA clones derived from one of the transcriptomes that is being studied. This might appear illogical but the approach enables two related transcriptomes to be compared, differences in their mRNA compositions being visualized as differences in the intensities of the hybridization signals emanating from the immobilized cDNAs when the microarray is probed with each transcriptome in turn (Figure 7.23B).

7.3.2. Studying the proteome

Proteome studies are important because of the central role that the proteome plays as the link between the genome and the biochemical capability of the cell (Section 3.3). Characterization of the proteomes of different cells is therefore the key to understanding how the genome operates and how dysfunctional genome activity can lead to diseases such as cancer. Transcriptome studies can only partly address these issues. Examination of the transcriptome gives an accurate indication of which genes are active in a particular cell, but gives a less accurate indication of the proteins that are present. There are several reasons for this lack of equivalence between transcriptome and proteome, the most important being:

  • Not all mRNAs are actively translated at any particular time.

  • The protein content of the cell is determined by both synthesis of new proteins and degradation of existing ones.

Methods for studying the proteome are therefore needed in order to obtain a complete picture of genome expression.

Proteomics - methodology for characterizing the protein content of a cell

The methodology used to study proteomes is collectively called proteomics. It is based on two techniques - protein electrophoresis and mass spectrometry - both of which have long pedigrees but which were rarely applied together in the pre-genomics era. Today they have been combined into one of the major growth areas of modern research.

In order to characterize a proteome it is first necessary to prepare pure samples of its constituent proteins. This is a far from trivial undertaking in view of the complexity of the average proteome: remember that a mammalian cell may contain 10 000–20 000 different proteins (Section 3.3). Polyacrylamide gel electrophoresis (see Technical Note 6.1) is the standard method for separating proteins, but the usual procedure, in which proteins are separated according to their molecular weights, is unable to resolve the many proteins in an average proteome. To separate individual proteins, the polyacrylamide gel is rotated by 90° and a second electrophoresis carried out at right angles to the first (Figure 7.24A). Usually, different conditions are employed in this second run so that the proteins are now separated on the basis of their charges. The result of this two-dimensional gel electrophoresis is a series of spots, each one representing a different protein. Not all the components of the proteome will be visible because the staining methods used to reveal the spots have a detection limit, but a clear picture of the most abundant proteins is obtained. Differences between two proteomes are seen as changes in the position and/or intensity of one or more spots on the gel.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.24

Studying a proteome by two-dimensional gel electrophoresis followed by MALDI-TOF. (A) After two-dimensional gel electrophoresis a protein of interest is excised from the gel and digested with a protease such as trypsin, which cuts immediately after arginine (more...)

How do we identify which protein is present in which spot? This used to be a difficult procedure but advances in mass spectrometry have provided the rapid and accurate identification procedure dictated by the requirements of genome studies. Mass spectrometry was originally designed as a means of identifying a compound on the basis of the mass-to-charge ratios of the ionized forms that are produced when molecules are exposed to a high-energy field. The standard technique could not be used with proteins because they are too large to be ionized effectively, but a new procedure, called matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF), gets around this problem, at least with peptides of up to 50 amino acids in length (Yates, 2000). Once ionized, the mass-to-charge ratio of a peptide is determined from its ‘time of flight’ within the mass spectrometer as its passes from the ionization source to the detector (Figure 7.24B). The mass-to-charge ratio enables the molecular weight to be worked out, which in turn allows the amino acid composition of the peptide to be deduced. If a number of peptides from a single protein spot in the two-dimensional gel are analyzed, these peptides obtained by treatment of the protein with a protease such as trypsin, then the resulting compositional information can be related to the genome sequence in order to identify the gene that specifies that protein (Figure 7.24C).

Proteomics can also be taken beyond simple characterization of proteome content. For example, the compositions of the peptides derived from a single protein can be used to check a gene sequence (Mann and Pandey, 2001), and in particular to ensure that exon-intron boundaries have been correctly located. This not only helps to delineate the exact position of a gene in a genome (Section 7.1.1), it also allows differential splicing pathways to be identified in cases where two or more proteins are derived from the same gene.

Identifying proteins that interact with one another

Important data pertaining to genome activity can also be obtained by identifying pairs and groups of proteins that interact with one another. At a detailed level, this information is often valuable when attempts are made to assign a function to a newly discovered gene or protein (Section 7.2) because an interaction with a second well-characterized protein can often indicate the role of an unknown protein. For example, an interaction with a protein that is located on the cell surface might indicate that an unknown protein is involved in cell-cell signaling (Section 12.1.2). At a global level, the construction of protein interaction maps is looked on as an important step in linking the proteome with the cellular biochemistry.

There are several methods for studying protein-protein interactions, the two most useful being phage display and the yeast two-hybrid system. In phage display a special type of cloning vector is used, one based on λ bacteriophage or one of the filamentous bacteriophages such as M13 (Clackson and Wells, 1994). The vector is designed so that a new gene that is cloned into it is expressed in such a way that its protein product becomes fused with one of the phage coat proteins (Figure 7.25A). The phage protein therefore carries the foreign protein into the phage coat, where it is ‘displayed’ in a form that enables it to interact with other proteins that the phage encounters. There are several ways in which phage display can be used to study protein interactions. In one method, the test protein is displayed and interactions sought with a series of purified proteins or protein fragments of known function. This approach is limited because it takes time to carry out each test, so is feasible only if some prior information has been obtained about likely interactions. A more powerful strategy is to prepare a phage display library, a collection of clones displaying a range of proteins, and identify which members of the library interact with the test protein (Figure 7.25B).

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.25

Phage display. (A) The cloning vector used for phage display is a bacteriophage genome with a unique restriction site located within a gene for a coat protein. The technique was originally carried out with the gene III coat protein of the filamentous (more...)

The yeast two-hybrid system detects protein interactions in a more complex way (Fields and Sternglanz, 1994). In Section 9.3.2 we will see that proteins called activators are responsible for controlling the expression of genes in eukaryotes. To carry out this function an activator must bind to a DNA sequence upstream of a gene and stimulate the RNA polymerase enzyme that copies the gene into RNA. These two abilities - DNA-binding and polymerase activation - are specified by different parts of the activator, and some activators will work even after cleavage into two segments, one segment containing the DNA-binding domain and one the activation domain. In the cell, the two segments interact to form the functional activator.

The two-hybrid system makes use of an S. cerevisiae strain that lacks an activator for a reporter gene. This gene is therefore switched off. An artificial gene that codes for the DNA-binding domain of the activator is ligated to the gene for the protein whose interactions we wish to study. This protein can come from any organism, not just yeast: in the example shown in Figure 7.26A it is a human protein. After introduction into yeast, this construct specifies synthesis of a fusion protein made up of the DNA-binding domain of the activator attached to the human protein. The recombinant yeast strain is still unable to express the reporter gene because the modified activator only binds to DNA; it cannot influence the RNA polymerase. Activation only occurs after the yeast strain has been cotransformed with a second construct, one comprising the coding sequence for the activation domain fused to a DNA fragment that specifies a protein able to interact with the human protein that is being tested (Figure 7.26B). As with phage display, if there is some prior knowledge about possible interactions then individual DNA fragments can be tested one by one in the two-hybrid system. Usually, however, the gene for the activation domain is ligated with a mixture of DNA fragments so that many different constructs are made. After transformation, cells are plated out and those that express the reporter gene identified. These are cells that have taken up a copy of the gene for the activation domain fused to a DNA fragment that encodes a protein able to interact with the test protein.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.26

The yeast two-hybrid system. (A) On the left, a gene for a human protein has been ligated to the gene for the DNA-binding domain of a yeast activator. After transformation of yeast, this construct specifies a fusion protein, part human protein and part (more...)

Protein interaction maps

Protein interaction maps display all of the interactions that occur between the components of a proteome (Legrain et al., 2001). Although a major undertaking, such maps have been constructed for the bacterium Helicobacter pylori, comprising over 1200 interactions involving almost half of the proteins in the proteome (Rain et al., 2001), and for 2240 interactions between 1870 proteins from the S. cerevisiae proteome (Jeong et al., 2001). These two maps were constructed almost entirely from two-hybrid experiments, but various researchers are developing more innovative ways of identifying possible links between proteins. One approach is based on the observation that pairs of proteins that are separate molecules in some organisms are fused into a single polypeptide chain in others. An example is provided by the yeast gene HIS2, which codes for an enzyme involved in histidine biosynthesis. In E. coli, two genes are homologous to HIS2. One of these, itself called his2, has sequence similarity with the 5′ region of the yeast gene, and the second, his10, is similar to the 3′ region (Figure 7.27). The implication is that the proteins coded by his2 and his10 interact within the E. coli proteome to provide part of the histidine biosynthesis activity. Analysis of the sequence databases reveals many examples of this type, where two proteins in one organism have become fused into a single protein in another organism (Enright et al., 1999; Marcotte et al., 1999), and this information is proving valuable in extending the results of two-hybrid and other experimental studies.

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.27

Using homology analysis to deduce protein-protein interactions. The 5′ region of the yeast HIS2 gene is homologous to Escherichia coli his2, and the 3′ region is homologous to E. coli his10.

What are the interesting features of the protein interaction maps that have been generated? The yeast map (Figure 7.28) is particularly intriguing because the network is made up of a small number of proteins that have many interactions, and a much larger number of proteins with few individual connections. This architecture, which is also displayed by the internet, is thought to minimize the effect on the proteome of the disruptive effects of mutations which might inactivate individual proteins. Only if a mutation affects one of the proteins at a highly interconnected node will the network as a whole be damaged. This hypothesis is consistent with the discovery, from gene inactivation studies (Section 7.2.2), that a substantial number of yeast proteins are apparently redundant, meaning that if the protein activity is destroyed the proteome as whole continues to function normally, and there is no discernible impact on the phenotype of the cell (see Box 7.1).

According to figure 7.2 which of the following is not a common trait of a global strategy

Figure 7.28

The yeast protein interaction map. Each dot represents a protein, with connecting lines indicating interactions between pairs of proteins. Red dots are essential proteins: an inactivating mutation in the gene for one of these proteins is lethal. Mutations (more...)

7.4. Comparative Genomics

The final method for understanding a genome sequence that we will consider is comparative genomics. We have already seen how similarities between homologous genes from different organisms provide one way of assigning a function to an unknown gene (Section 7.2.1). This is an example of how knowledge about the genome of one organism can help in understanding the genome of a second organism. The possibility that a more general comparison with other genomes might be a valuable means of deciphering the human sequence was recognized when the Human Genome Project was planned in the late 1980s, and the Project has actively stimulated the development of genome projects for model organisms such as the mouse and fruit fly. In this section we will explore the extent to which comparisons between different genomes are proving useful.

7.4.1. Comparative genomics as an aid to gene mapping

The basis of comparative genomics is that the genomes of related organisms are similar. The argument is the same one that we considered when looking at homologous genes (Section 7.2.1). Two organisms with a relatively recent common ancestor will have genomes that display species-specific differences built onto the common plan possessed by the ancestral genome. The closer two organisms are on the evolutionary scale, the more related their genomes will be (Nadeau and Sankoff, 1998).

If the two organisms are sufficiently closely related then their genomes might display synteny, the partial or complete conservation of gene order. Then it is possible to use map information from one genome to locate genes in the second genome. At one time it was thought that mapping the genomes of the mouse and other mammals, which are at least partially syntenic with the human genome, might provide valuable information that could be used in construction of the human genome map. The problem with this approach is that all the close relatives of humans have equally large genomes that are just as difficult to study, the only advantage being that a genetic map is easier to construct with an animal which, unlike humans, can be subjected to experimental breeding programs (Section 5.2.4). Despite the limitations of human pedigree analysis, progress has been more rapid in mapping the human genome than in mapping those of any of our close relatives, so in this respect comparative genomics is proving more useful in mapping the animal genomes rather than our own. This in itself is a useful corollary to the Human Genome Project because it is revealing animal homologs of human genes involved in diseases, providing animal models for the study of these diseases.

Mapping is significantly easier with a small genome than with a large one. This means that if one member of a pair of syntenic genomes is substantially smaller than the other, then mapping studies with this small genome are likely to provide a real boost to equivalent work with the larger genome. The pufferfish, Fugu rubripes, has been proposed in this capacity with respect to the human genome. The pufferfish genome is just 400 Mb, less than one-seventh the size of the human genome but containing approximately the same number of genes. The mapping work carried out to date with the pufferfish indicates that there is some similarity with the human gene order, at least over short distances. This means that it should be possible, to a certain extent, to use the pufferfish map to find human homologs of pufferfish genes, and vice versa. This may be useful in locating undiscovered human genes, but holds greatest promise in identifying essential sequences such as promoters and other regulatory signals upstream of human genes. This is because these signals are likely to be similar in the two genomes, and recognizable because they are surrounded by non-coding DNA that has diverged quite considerably by random mutations (Elgar et al., 1996; Hardison, 2000).

One area where comparative genomics has a definite advantage is in the mapping of plant genomes. Wheat provides a good example. Wheat is the most important food plant in the human diet, being responsible for approximately 20% of the human calorific intake, and is therefore one of the crop plants that we most wish to study and possibly manipulate in the quest for improved crops. Unfortunately, the wheat genome is huge at 16 000 Mb, five times larger than even the human genome. A small model genome with a gene order similar to that of wheat would therefore be useful as a means of mapping desirable genes which might then be obtained from their equivalent positions in the wheat genome. Wheat, and other cereals such as rice, are members of the Gramineae, a large and diverse family of grasses. The rice genome is only 430 Mb, substantially smaller than that of wheat, and there are probably other grasses with even smaller genomes. Comparative mapping of the rice and wheat genomes has revealed many similarities, and the possibility therefore exists that genes from the wheat genome might be isolated by first mapping the positions of the equivalent genes in a smaller Gramineae genome (Gura, 2000).

7.4.2. Comparative genomics in the study of human disease genes

One of the main reasons for sequencing the human genome is to gain access to the sequences of genes involved in human disease. The hope is that the sequence of a disease gene will provide an insight into the biochemical basis of the disease and hence indicate a way of preventing or treating the disease. Comparative genomics has an important role to play in the study of disease genes because the discovery of a homolog of a human disease gene in a second organism is often the key to understanding the biochemical function of the human gene. If the homolog has already been characterized then the information needed to understand the biochemical role of the human gene may already be in place; if it has not been characterized then the necessary research can be directed at the homolog.

To be useful in the study of disease-causing genes, the second genome does not need to be syntenic with the human genome, nor even particularly closely related. Drosophila holds great promise in this respect, as the phenotypic effects of many Drosophila genes are well known, so the data already exist for inferring the mode of action of human disease genes that have homologs in the Drosophila genome (Guffanti et al., 1997). But the greatest success has been with yeast. Several human disease genes have homologs in the S. cerevisiae genome (Table 7.2). These disease genes include ones involved in cancer, cystic fibrosis, and neurological syndromes, and in several cases the yeast homolog has a known function that provides a clear indication of the biochemical activity of the human gene. In some cases it has even been possible to demonstrate a physiological similarity between the gene activity in humans and yeast. For example, the yeast gene SGS1 is a homolog of a human gene involved in the diseases called Bloom's and Werner's syndromes, which are characterized by growth disorders. Yeasts with a mutant SGS1 gene live for shorter periods than normal yeasts and display accelerated onset-of-aging indicators such as sterility (Sinclair et al., 1997). The yeast gene has been shown to code for one of a pair of related DNA helicases that are required for transcription of rRNA genes and for DNA replication (Lee et al., 1999). The link between SGS1 and the genes for Bloom's and Werner's syndromes, provided by comparative genomics, has therefore indicated the possible biochemical basis of the human diseases.

Table 7.2

Examples of human disease genes that have homologs in Saccharomyces cerevisiae.

Study Aids For Chapter 7

Key terms

Give short definitions of the following terms:

  • Embryonic stem cell

  • Homology search

  • In vitro mutagenesis

  • Knockout mice

  • Two-dimensional gel electro-phoresis

  • Zoo-blotting

Self study questions

1.

Explain why ORF scanning is a feasible way of identifying genes in a prokaryotic DNA sequence.

2.

What modifications are introduced when ORF scanning is applied to a eukaryotic DNA sequence?

3.

Describe how homology searching is used to locate genes in a DNA sequence and to assign possible functions to those genes.

4.

Distinguish between northern blotting and zoo-blotting. What are the applications of these two techniques in gene location?

5.

Explain how cDNA capture or cDNA selection are used to enrich a clone library for a particular cDNA sequence.

6.

Draw a fully annotated diagram illustrating the procedure called 5′-RACE.

7.

Describe how S1 nuclease is used to map the positions of the ends of a transcript on to a DNA sequence.

8.

What experimental methods can be used to locate exon-intron boundaries in a DNA sequence?

9.

Using the yeast genome project as an example, illustrate the strengths and weaknesses of homology analysis as a means of assigning functions to unknown genes.

10.

Describe how gene inactivation can be used to determine the function of an unknown gene.

11.

Give an example of the use of gene overexpression to determine the function of an unknown gene.

12.

Describe how oligonucleotide-directed mutagenesis is carried out and outline the use of this technique in studying the activity of the protein coded by an unknown gene.

13.

What is a reporter gene and how is it used?

14.

Describe the methods used to study transcriptomes.

15.

Explain how two-dimensional gel electrophoresis combined with mass spectrometry is used to study a proteome.

16.

Draw diagrams illustrating the techniques called (a) phage display, and (b) the yeast two-hybrid system. What are the similarities and differences between these two techniques?

17.

What is a protein interaction map? What has the yeast protein interaction map told us about the construction of the proteome of this organism?

18.

Define the term ‘synteny’ and, using examples, explain how synteny can predict the positions of genes in a genome sequence.

19.

Describe the applications of comparative genomics in the study of human disease genes.

Problem-based learning

1.

Defend one of the following statements:

‘In future years it will be possible to use bioinformatics to obtain a complete description of the locations and functions of the genes in a genome sequence.’

‘In future years bioinformatics will become obsolete because of the development of rapid and effective experimental methods for locating and assigning functions to the genes in a genome sequence.’

2.

Devise a hypothesis to explain the codon biases that occur in the genomes of various organisms. Can your hypothesis be tested?

3.

Gene inactivation studies have suggested that at least some genes in a genome are redundant, meaning that they have the same function as a second gene and so can be inactivated without affecting the phenotype of the organism. What evolutionary questions are raised by genetic redundancy? What are the possible answers to these questions?

4.

Explore the natural role of RNA interference in living organisms.

5.

Gene overexpression has so far provided limited but important information on the function of unknown genes. Assess the overall potential of this approach in functional analysis.

6.

‘Comparative genomics has an important role to play in the study of disease genes.’ Evaluate this statement.

References

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. (1990);215:403–410. [PubMed: 2231712]

  2. Bass BL. The short answer. Nature. (2001);411:428–429. [PubMed: 11373658]

  3. Bassett DE, Boguski MS, Hieter P. Yeast genes and human disease. Nature. (1996);379:589–590. [PubMed: 8628392]

  4. Bird A. CpG-rich islands and the function of DNA methylation. Nature. (1986);321:209–213. [PubMed: 2423876]

  5. Blattner FR, Plunkett G, Bloch CA. et al. The complete genome sequence of Escherichia coli K-12. Science. (1997);277:1453–1462. [PubMed: 9278503]

  6. Church DM, Stotler CJ, Rutter JL, Murrell JR, Trofatter JA, Buckler AJ. Isolation of genes from complex sources of mammalian genomic DNA using exon amplification. Nature Genet. (1994);6:98–105. [PubMed: 8136842]

  7. Clackson T, Wells JA. In vitro selection from protein and peptide libraries. Trends Biotechnol. (1994);12:173–184. [PubMed: 7764900]

  8. Cornish-Bowden A, Cárdenas ML. Silent genes given voice. Nature. (2001);409:571–572. [PubMed: 11214302]

  9. Dujon B. The yeast genome project: what did we learn? Trends Genet. (1996);12:263–270. [PubMed: 8763498]

  10. Elbashir SM, Harborth J, Lendeckel W, Yalcin A, Weber K, Tuschl T. Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature. (2001);411:494–498. [PubMed: 11373684]

  11. Elgar G, Sandford R, Aparicio S, Macrae A, Vekatesh B, Brenner S. Small is beautiful: comparative genomics with the pufferfish (Fugu rubripes). Trends Genet. (1996);12:145–150. [PubMed: 8901419]

  12. Engels WR. Reversal of fortune for Drosophila geneticists? Science. (2000);288:1973–1975. [PubMed: 10877715]

  13. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA. Protein interaction maps for complete genomes based on gene fusion events. Nature. (1999);402:86–90. [PubMed: 10573422]

  14. Evans MJ, Carlton MBL, Russ AP. Gene trapping and functional genomics. Trends Genet. (1997);13:370–374. [PubMed: 9287493]

  15. Fickett JW. Finding genes by computer: the state of the art. Trends Genet. (1996);12:316–320. [PubMed: 8783942]

  16. Fields S, Sternglanz R. The two-hybrid system: an assay for protein-protein interactions. Trends Genet. (1994);10:286–292. [PubMed: 7940758]

  17. Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature. (1998);391:806–811. [PubMed: 9486653]

  18. Fraser AG, Kamath RS, Zipperlen P, Martinez-Campos M, Sohrmann M, Ahringer J. Functional genomic analysis of C. elegans chromosome I by systematic RNA interference. Nature. (2000);408:325–330. [PubMed: 11099033]

  19. Frech K, Quandt K, Werner T. Finding protein-binding sites in DNA sequences: the next generation. Trends Biochem. Sci. (1997);22:103–104. [PubMed: 9066261]

  20. Frohman MA, Dush MK, Martin GR. Rapid production of full-length cDNAs from rare transcripts: amplification using a single gene-specific oligonucleotide primer. Proc. Natl Acad. Sci. USA. (1988);85:8998–9002. [PMC free article: PMC282649] [PubMed: 2461560]

  21. Guffanti A, Banfi S, Simon G, Ballabio A, Borsani G. DRES search engine: of flies, men and ESTs. Trends Genet. (1997);13:79–80. [PubMed: 9055610]

  22. Gura T. Reaping the plant gene harvest. Science. (2000);287:412–414. [PubMed: 10671160]

  23. Hardison RC. Conserved non-coding sequences are reliable guides to regulatory elements. Trends Genet. (2000);16:369–372. [PubMed: 10973062]

  24. Jeong H, Mason SP, Barabási A-L, Oltvai ZN. Lethality and centrality in protein networks. Nature. (2001);411:41–42. [PubMed: 11333967]

  25. Knight J. When the chips are down. Nature. (2001);410:860–861. [PubMed: 11309581]

  26. Lee S-K, Johnson RE, Yu S-L, Prakash L, Prakash S. Requirement of yeast SGS1 and SRS2 genes for replication and transcription. Science. (1999);286:2339–2342. [PubMed: 10600744]

  27. Legrain P, Wojcik J, Gauthier J-M. Protein-protein interaction maps: a lead towards cellular functions. Trends Genet. (2001);17:346–352. [PubMed: 11377797]

  28. Lovett M. Fishing for complements: finding genes by direct selection. Trends Genet. (1994);10:352–357. [PubMed: 7985239]

  29. Mann M, Pandey A. Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases. Trends Biochem. Sci. (2001);26:54–61. [PubMed: 11165518]

  30. Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D. A combined algorithm for genome-wide prediction of protein function. Nature. (1999);402:83–86. [PubMed: 10573421]

  31. Marshall E. Do-it-yourself gene watching. Science. (1999);286:444–447. [PubMed: 10577207]

  32. Nadeau JH, Sankoff D. Counting on comparative maps. Trends Genet. (1998);14:495–501. [PubMed: 9865155]

  33. Ohler U, Niemann H. Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet. (2001);17:56–60. [PubMed: 11173099]

  34. Oliver SG. From DNA sequence to biological function. Nature. (1996a);379:597–600. [PubMed: 8628394]

  35. Oliver SG. A network approach to the systematic analysis of yeast gene function. Trends Genet. (1996b);12:241–242. [PubMed: 8763491]

  36. Ponting CP. Tudor domains in proteins that interact with RNA. Trends Biochem. Sci. (1997);22:51–52. [PubMed: 9048482]

  37. Pradet-Balade B, Boulmé F, Beug H, Müllner EW, Garcia-Sanz JA. Translation control: bridging the gap between genomics and proteomics? Trends Biochem. Sci. (2001);26:225–229. [PubMed: 11295554]

  38. Rain J-C, Selig L, De Reuse H. et al. The protein-protein interaction map of Helicobacter pylori. Nature. (2001);409:211–215. [PubMed: 11196647]

  39. Reese MG, Hartzell G, Harris NL, Ohler U, Abril JF, Lewis SE. Genome annotation assessment in Drosophila melanogaster. Genome Res. (2000);10:483–501. [PMC free article: PMC310877] [PubMed: 10779488]

  40. Simonet WS, Lacey DL, Dunstan CR. et al. Osteoprotegrin: a novel secreted protein involved in the regulation of bone density. Cell. (1997);89:309–319. [PubMed: 9108485]

  41. Sinclair DA, Mills K, Guarente L. Accelerated aging and nucleolar fragmentation in yeast sgs1 mutants. Science. (1997);277:1313–1316. [PubMed: 9271578]

  42. Smith V, Botstein D, Brown PO. Genetic footprinting: a genomic strategy for determining a gene's function given its sequence. Proc. Natl Acad. Sci. USA. (1995);92:6479–6483. [PMC free article: PMC41541] [PubMed: 7604017]

  43. Velculescu VE, Vogelstein B, Kinzler KW. Analysing uncharted transcriptomes with SAGE. Trends Genet. (2000);16:423–425. [PubMed: 11050322]

  44. Wach A, Brachat A, Pohlmann R, Philippsen P. New heterologous modules for classical or PCR-based gene disruptions in Saccharomyces cerevisiae. Yeast. (1994);10:1793–1808. [PubMed: 7747518]

  45. Yates JR. Mass spectrometry: from genomics to proteomics. Trends Genet. (2000);16:5–8. [PubMed: 10637622]

Further Reading

  1. Ambros V. Dicing up RNAs. Science. (2001);293:811–813. —Describes current thinking on the natural role of RNA interference in living organisms. [PubMed: 11486075]

  2. Birney E, Bateman A, Clamp ME, Hubbard TJ. Mining the draft human genome. Nature. (2001);409:827–828. —A guide to the bioinformatics tools available for analyzing the human genome. [PMC free article: PMC2658632] [PubMed: 11236999]

  3. Fields S. Proteomics in genomeland. Science. (2001); 291:1221–1224. —Explains the importance of proteomics in understanding the human genome sequence. [PubMed: 11233445]

  4. Galas DJ. Making sense of the sequence. Science. (2001);291:1257–1260. —Another description of the bioinformatics tools available for analysis of the human genome sequence. [PubMed: 11233451]

  5. Mann M, Hendrickson RC, Pandey A. Analysis of proteins and proteomes by mass spectrometry. Ann. Rev. Biochem. (2001);70:437–473. [PubMed: 11395414]

  6. O'Brien SJ, Menotti-Raymond M, Murphy WJ. et al. The promise of comparative genomics in mammals. Science. (1999);286:458–481. —A review that draws out the importance of comparative methods in studies of mammalian genomes. [PubMed: 10521336]

  7. Pennisi E. Keeping genome databases clean and up to date. Science. (2001);286:447–450. —Highlights some of the problems with the DNA databases. [PubMed: 10577208]

  8. Roos DS. Bioinformatics - trying to swim in a sea of data. Science. (2001);291:1260–1261. —Outlines the challenges to bioinformatics posed by large sequences such as the human genome. [PubMed: 11233452]

  9. Searls DB. Bioinformatics tools for whole genomes. Annu. Rev. Genomics Hum. Genet. (2000);1:251–279. —A review of the applications of computer-based studies in analysis of genome sequences. [PubMed: 11701631]

  10. Various authors (2000) Proteomics: A Trends Guide. Elsevier Science, London. —Reviews and commentaries on various aspects of proteomics.