当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 病菌学杂志 > 2005年 > 第12期 > 正文
编号:11202702
Predicting Coding Potential from Genome Sequence:
     Department of Mathematics

    Department of Microbiology & Immunology, Stanford University, Stanford, California 94305

    ABSTRACT

    Prediction of protein-coding regions and other features of primary DNA sequence have greatly contributed to experimental biology. Significant challenges remain in genome annotation methods, including the identification of small or overlapping genes and the assessment of mRNA splicing or unconventional translation signals in expression. We have employed a combined analysis of compositional biases and conservation together with frame-specific G+C representation to reevaluate and annotate the genome sequences of mouse and rat cytomegaloviruses. Our analysis predicts that there are at least 34 protein-coding regions in these genomes that were not apparent in earlier annotation efforts. These include 17 single-exon genes, three new exons of previously identified genes, a newly identified four-exon gene for a lectin-like protein (in rat cytomegalovirus), and 10 probable frameshift extensions of previously annotated genes. This expanded set of candidate genes provides an additional basis for investigation in cytomegalovirus biology and pathogenesis.

    INTRODUCTION

    Sequence analysis has been crucial to understanding the biology of cytomegalovirus (CMV) as well as other herpesviruses (13, 30). Human CMV is an important pathogen, causing neurological damage following congenital infection (37) as well as opportunistic infections in immunocompromised individuals. Models of human CMV pathogenesis and immune control have employed related betaherpesviruses that naturally infect guinea pigs (42), rats (8), and mice (21, 25). The initial annotation of a laboratory-propagated human CMV strain, AD169 (10), predicted 194 unique open reading frames (ORFs). Following this report, reevaluation of genome organization has occurred through correction of errors in the AD169 strain sequence (12, 31, 35, 43), recognition of mRNA splicing events (15, 39), and empirical identification of genes that had escaped annotation (3, 24, 26). The human CMV sequence has been updated through analyses of additional strains (9, 16, 17, 33) as well as by comparison to rhesus CMV (20) and chimpanzee CMV (14) genome sequences. Several revisions of the full genome complement of natural CMV have resulted from these studies. The number of genes in human CMV was estimated to range from under 150 to over 200 genes, and the current estimate of 165 genes is considered reasonable (16). Different estimates depend on the information considered, including homology with other genes in available databases, codon bias, preservation of known protein motifs, and the presence of transcription signals (13, 32).

    The annotated human CMV (HCMV) genome sequence has formed a basis for comparisons to other betaherpesviruses. Murine CMV (MCMV) (40) and rat CMV (RCMV) (45) retain obvious sequence homologs of about 80 HCMV ORFs, or roughly 50% of the annotated genes in these viruses. Non-CMV betaherpesviruses infecting humans, such as herpesvirus 6 (19) and herpesvirus 7 (36), as well as those infecting lower primates, such as herpesvirus tupaia (2), retain similar core sets of ORFs. Approximately 40 of these 80 betaherpesvirus-specific ORFs are shared with all mammalian and avian herpesviruses (13) and are considered to be herpesvirus common. Despite obvious levels of divergence in the betaherpesviruses, common biological characteristics have emerged from studies of viruses infecting laboratory animals, and these have helped us to define immune control by the host and immune escape by the virus and to accumulate a myriad of additional basic information on replication, pathogenesis, and latency (25, 28-30, 41).

    Prediction of the protein-coding potential of genomes is by nature provisional. In particular, herpesviruses and other eukaryotic viruses have been difficult to annotate accurately using conventional criteria, as evidenced by the recognition of additional genes as well as the elimination of ORFs found to be spurious based on additional investigation. For example, evidence suggests that the commonly employed limitations of ORF length (100 codons) and maximum ORF overlap (<60%) lead to the exclusion of known CMV gene products, such as the multiply spliced immunomodulatory function, viral interleukin 10 (24, 26), and the 73-amino-acid herpesvirus-conserved smallest capsid protein (18). Similarly, recent efforts to identify structural proteins in MCMV have also resulted in several revisions to genome annotation (23). In addition to the small sizes of ORFs, biologically relevant events that may confound conventional annotation methods include posttranscriptional modification, mRNA splicing, alternate translation initiation sites, and stop codon suppression. Finally, automated annotation procedures may also be confounded due to unrecognized errors in underlying sequencing. Current limitations of analysis might be overcome by new approaches that are less restrictive and provide an extended list of candidate genes for experimental verification.

    In the present study, we investigated the protein-coding potential of the MCMV and RCMV genomes, taking into account the conservation of ORFs and genome-specific sequence features. Analogously to the human and chimpanzee CMV genomes (14), MCMV and RCMV retain a remarkable level of evolutionary relatedness and similarity in both functional organization and arrangement of genes (40, 45). Our analysis of genome-specific sequence features will focus on translational "frame analysis" (5), exploiting the differential G+C distribution among codon base positions in genomes of high G+C content (see Materials and Methods and Fig. S1 in the supplemental material). To provide an objective means to evaluate the extent to which G+C content influences the translational frames and to reveal a potential coding region in any sequence, we also defined a new measure of gene compositional bias and a related measure of coding potential. Our approach makes no assumptions about the minimum length of coding sequences, although we focused on ORFs of 20 codons, and does not impose restrictions on the degree of overlap between putative protein-coding regions. This procedure represents a marked modification of standard methods and produces a substantial revision of the current annotations for the MCMV and RCMV genomes. Our analysis suggests that CMV genomes likely encode a greater number of overlapping genes than previously thought.

    MATERIALS AND METHODS

    The genome sequence of the MCMV Smith strain (GenBank accession number U68299) and the genome sequence of the RCMV Maastricht strain (GenBank accession number NC_002512) were compared, and sequence features were analyzed using frame-specific profiles of G+C contents.

    Homologies. Similarity between ORF products of MCMV and RCMV was evaluated by the significant segment pair alignment (SSPA) program, and regions of similarity were identified by the multiple alignment program ITERALIGN (7). The alignment of viral genomes employed ORF products with lengths of 20 codons. The predicted products of all ORFs with lengths of 20 codons (60 nucleotides [nt]) were queried against a large nonredundant database of protein sequences using the BLASTP program (1).

    Frame-specific G+C profiles (S-profiles). We characterized the G+C contents and distribution of genomic sequences of MCMV and RCMV by three measures of frame-specific G+C content (5). The G+C content of the genome was evaluated within a moving window of fixed length (201 nt or 102 nt) with respect to every third nucleotide of the genome. First, genome positions 1, 4, 7, and so on, up to the end of the genome sequence, were scanned, and then genome positions 2, 5, 8, and so on, were scanned, followed by genome positions 3, 6, 9, and so on (Fig. S1). With this procedure, variations in G+C contents along the genome were represented by three profiles, each representing a frame, referred to as "S-profiles." The relationships among S-profiles were used to assess the presence of protein-coding genes in genome regions of high G+C content (5), qualitatively by visual examination and quantitatively through the definitions of a bias in a frame-specific G+C distribution (S-bias) and of a related measure of coding potential (see below).

    S-bias. For a potential coding region of G+C content S, we defined a measure of how the G+C contents at codon base positions 1, 2, and 3 (S1, S2, and S3) compared to expectations (S-bias). Expectations 1(S), 2(S), and 3(S) of S1, S2, and S3 for a potential coding sequence of G+C content S were defined by the linear regressions of S1, S2, and S3 over S, measured in a set of 2,813 published herpesvirus genes (Fig. 1B to D). S1, S2, and S3 values were normalized to these expectations by the differences: , , and . The obvious relation S1 + S2 + S3 = 3S holds for each gene. Since also 1(S) + 2(S) + 3(S) = 3S, the normalized G+C content values project onto the plane , which can be represented in the two orthogonal dimensions and . The scaling factor K was specified so that Var(T1) was equal to Var(T2) for the set of 2,813 published herpesvirus genes (see Results). The S-bias of a putative coding region of G+C contents S, S1, S2, and S3 was defined as the magnitude of the corresponding vector (T1, T2): S-bias(S1, S2, S3 | S) = (T12 + T22)1/2. With this definition, a sequence with a distribution of G+C nucleotides among codon base positions corresponding to expectations will have an S-bias of 0.0, independently of its overall G+C content. As the S-bias increases, the likelihood of an ORF to code for a protein decreases.

    Biases in codon and amino acid usage. Biases in codon usage were calculated as described in reference 22. Let g(x, y, z) be the frequency of the codon composed of bases (x, y, z) in gene group G (G could be a single gene), normalized for each amino acid codon family such that

    where the sum extends over all codons translated to amino acid a. Let f(x, y, z) be the codon frequencies for a gene group F normalized to 1 in each amino acid codon family. The codon usage bias (C-bias) of gene group G relative to F was calculated by the formula (22)

    where [pa(G)] is the amino acid frequencies of gene group G. For each genome, two reference groups of genes F were chosen to calculate C-biases, one corresponding to all annotated genes inside the regions of high G+C content of the respective genome and a second group including all other annotated genes (see Results for a definition of these regions in the MCMV and RCMV genomes).

    The amino acid bias (aa-bias) of gene group G relative to gene group F was defined as

    where pa(G) and pa(F) are the frequencies of the amino acid a in G and in F, respectively, and the sum extends over all amino acids.

    Coding potentials. Local coding potentials were evaluated from nucleotide composition as follows. For each of the six coding frames (three on the direct strand and three on the complementary strand), an S-bias was evaluated within a window of 102 nt. The probability distribution of S-biases in coding regions was determined from similar windows extracted from all annotated coding regions of 28 herpesvirus genomes. Corresponding probabilities were obtained for random distributions of S1, S2, and S3 given G+C content S. In the case of random distributions, S1, S2, and S3 values have the same expectation (S) and same distribution. The S-bias for random distribution (S-biasrand) was calculated by normalizing S1, S2, and S3 values as follows: , S2 = S2 – S, and and biases were directly calculated with the equation . The conditional probability (coding potential) [P(CODiF)] of a sequence, F, to be coding in frame i was evaluated as follows:

    where P[(S–biasi(F)] is the frequency of S–biases greater than or equal to S–biasi(F) among all windows extracted from herpesvirus annotated coding regions and P[S–biasrand(F)] is the frequency compared to the random distribution obtained for sequences of the same G+C content.

    GeneMark coding potentials (6) were also evaluated based on predictions obtained with the program GeneMarkS (4) as implemented at the website http://opal.biology.gatech.edu/GeneMark/genemarks.cgi.

    RESULTS

    G+C contents in coding regions of the MCMV and RCMV genomes. Analysis of genes and genomes of prokaryotic origin has shown that global G+C content correlates distinctively with the G+C content at the first (S1), second (S2), and third (S3) codon positions (5, 34). As expected, similar relationships extended to complete herpesvirus genomes (Fig. 1A) as well as to individual herpesvirus genes (Fig. 1B to D). Variations in global G+C content corresponded to large differences in S3, whereas smaller differences occurred in S1 and, especially, in S2. The relationship S3 >> S1 >> S2 is typical of the three codon positions of expressed genes in G+C-rich genomes.

    The high overall G+C contents of the MCMV and RCMV genomes (58.7% and 61.0%, respectively) and the corresponding high contrasts in G+C usage at different codon positions (Fig. 1A) were expected to provide a means to reliably identify protein-coding regions. However, C and G bases were not evenly distributed across the two genomes (Fig. 2). In both MCMV and RCMV, the G+C contents were greatest (61.7% and 69.2%, respectively) in the large genomic segment containing the betaherpesvirus-conserved protein-coding regions. G+C contents were more varied and generally lower (54.3% and 47.7%, respectively) in the remaining genomic segments. These differences in G+C contents resulted in varied contrasts of G+C usage at each codon position for genes expressed in different genomic regions.

    Compositional analysis of ORFs of 20 codons from the MCMV and RCMV genomes. We identified a total of 5,541 MCMV and 4,741 RCMV ORFs with lengths of 20 codons (defined without regard to AUG codons, from stop codon to stop codon). All ORFs were analyzed in terms of S-bias, C-bias, and aa-bias (see Materials and Methods). The compositional biases of ORFs corresponding to previously annotated coding sequences were determined over the previously reported length (40, 45), and newly annotated ORFs were evaluated over the entire stop codon-to-stop codon distance as well as beginning at AUG codons when these were at least 60 nt upstream of a stop codon.

    S-biases. We devised a scoring system to quantify biases in G+C contents at codon positions across a putative coding region (see Materials and Methods). The expected G+C contents at codon positions 1, 2, and 3 were determined from the regression lines over 2,813 annotated ORFs from 28 herpesvirus genomes (Fig. 1B to D). Normalized G+C values were transformed into the coordinate system T1 and T2 with a scaling factor (K) equal to 1.804 (see Materials and Methods). The distribution of S-biases of all MCMV and RCMV ORFs of 20 codons is shown by the black lines in Fig. 3A and B, where we chose the starting position associated with the lowest bias for each newly annotated ORF. The distribution of all such ORFs with G+C contents of >50% (4,348 ORFs from MCMV and 3,095 ORFs from RCMV) is shown by the gray lines. The distribution of biases among previously annotated ORFs is also shown for comparison. A large proportion (46% in MCMV and 51% in RCMV) of all previously nonannotated ORFs had a low S-bias typical of coding regions (lower than the threshold corresponding to 95% of the annotated genes). Virtually all ORFs with high S-biases (>40) had high G+C contents (>50%), as expected from the great asymmetries in frame-specific G+C usage that distinguish G+C-rich coding from noncoding sequences. However, 34% of the ORFs with high G+C contents showed low S-biases. We also evaluated the biases in codon usage and amino acid usage for all ORFs (see Materials and Methods). We computed all biases relative to the average frequencies observed either among previously annotated genes encoded in the region of high G+C content or among all other annotated genes of the respective genome. For each ORF we selected the smaller of the two biases. The distribution of C-biases and aa-biases among ORFs of MCMV and RCMV are shown in Fig. 3C to F. The C-biases and aa-biases of previously annotated ORFs were low compared to those of the ORF sets analyzed here, although, as for S-biases, a large number of these ORFs have C-biases (18% in MCMV and 32% in RCMV) and aa-biases (19% in MCMV and 33% in RCMV) within the 95th percentile range of the corresponding annotated genes.

    We identified 227 ORFs from MCMV and 350 ORFs from RCMV (supplemental Tables S1 and S2) that had not been previously annotated but were characterized by compositional biases typical of annotated genes (within the 95% confidence interval). For each of these ORFs, all starting positions associated with low biases are listed in supplemental Tables S1 and S2. The average length of these ORFs was 108 codons in RCMV and 129 codons in MCMV. Most of the ORFs from MCMV (91.6%) and the majority of those from RCMV (72.3%) were characterized by a high G+C content (>50%), with averages of 59.6% in MCMV and 58.0% in RCMV. By compositional criteria alone, these ORFs were considered potentially expressed.

    Conservation between MCMV and RCMV genomes. We searched for similarities between proteins potentially encoded by ORFs with lengths of 20 codons from MCMV and RCMV using the computer protocol SSPA (7). Pairwise comparisons between 5,541 MCMV and 4,734 RCMV ORFs resulted in 73,330 pairs (0.28%) that exhibited statistically significant similarity. As expected, extended similarities (50% SSPA similarity) were distributed along the two viral genomes in a collinear fashion. We then applied the ITERALIGN multiple sequence alignment program (7) to identify all ungapped blocks of aligned positions with lengths of >10 codons. To select the most-reliable regions of homology among all blocks, we constructed a pairwise alignment of the MCMV and RCMV genomes, starting from the longest blocks and progressively adding shorter blocks. Blocks that were not collinear with the partial alignment obtained from the longer blocks were excluded. The resulting genome alignment, shown in Fig. 4, involved 107,739 positions, covering about 47% of each genome sequence. Of these, 94,261 (87.5%) coincided with alignments between amino acids of annotated proteins and were plotted in black. A remarkable 13,478 positions (12.5%), plotted in red, corresponded to alignments between ORF pairs involving at least one ORF not appearing in the original genome annotations.

    We found 66 ORF pairs corresponding to conserved regions involving nonannotated ORFs (Table 1) , comprising 49 ORFs from MCMV and 56 ORFs from RCMV. ORFs whose alignment regions were not closely interspersed within regions otherwise conserved between annotated genes are shown in Table 1 in boldface. The conservation of 10 MCMV ORFs and 18 RCMV ORFs (Table 1) strongly suggested that they were expressed genes (see alignments in the supplemental material). Among these, two ORFs from MCMV showed similarities to annotated genes of RCMV. One of them (newly annotated as m120.1) was similar to RCMV r119.4 and included a putative start codon (AUG) at position 175665. Based on position and conservation, a second ORF similar to RCMV R31 and annotated as M31b appeared to be a likely 3' frameshift extension of M31, as recently reported (23). ORF M73.5e2 was not published in the initial report of the MCMV genome sequence but was subsequently described as exon 2 of M73.5 in MCMV TS9 (42a) (GenBank entry L34342). Seven additional ORFs from MCMV (m38.5, m44.1, m44.3, m45.2, m106.1, m106.3, and m123.1) showed similarities to corresponding newly annotated ORFs from RCMV.

    In RCMV we found 11 newly annotated ORFs (r4, R23a, r25.3b, r48.2, R71, R73.5e2, R98a, R102b, r115.1, r124.1, and r132e2) with similarity to published genes of MCMV. Four of these were interpreted, for their relative positions, as likely to derive from frameshifts (likely resulting from sequencing errors) within the sequence of neighboring annotated genes. In particular, we found a 5' extension of R23 (R23a), a 3' extension of r25.3 (r25.3b), a 5' extension of R98 (R98a), and a 3' extension of R102 (R102b). The similarity of ORF r4 to MCMV m20 suggested a substantial 5' extension of this ORF compared to the current annotation (see the supplemental material). Surprisingly, no AUG codon was found in the extended region of r4. Similarly, the strong conservation between r70.1 and MCMV M71 suggested that the coding region of r70.1 may be extended to a region 5' of the current annotation, which also lacked any AUG codons, and that this gene should be named R71 to emphasize its homology to the gene already published in MCMV. The lack of AUG codons in the extended regions of these ORFs suggests an alternative translation start mechanism, a multiexon gene structure, or the possibility of sequencing errors. We identified a homolog of MCMV M73.5e2 in RCMV that we designated R73.5e2. The short ORF r115.1 showed strong similarity to the C-terminal region of M116 from MCMV, which did not show significant similarity to R116. The newly annotated ORF r132e2 of RCMV was recognized as the second exon of annotated gene r133 for its high similarity to exon m132e2 in MCMV (see the supplemental material). This result suggests that r133 (homologous to m133e1) should be renamed r133e1. Seven newly annotated ORFs from RCMV (r38.5, r44.1, r44.3, r45.2, r106.1, r108.1, and r124.2) showed similarities to corresponding newly annotated ORFs from MCMV. Groups of conserved ORFs overlapping in the respective genomes are italicized in Table 1. Additional sequence features would be needed to distinguish the frame(s) of the coding sequence(s) among these ORFs (see the supplemental material for the cases, e.g., of m48.1 and m48.2).

    BLASTP analyses. We queried the products of all ORFs with lengths of 20 codons against the NCBI nonredundant database of 644,068 proteins from coding sequence translations of sequences in GenBank, the Protein Data Bank, Swiss-Prot, and PIR using BLASTP (1). Significant results (E-value, <0.001) from this analysis for ORFs in the MCMV and RCMV genomes not previously annotated are reported in supplemental Tables S3 and S4, respectively. Eleven of these matches (see the footnotes of Tables S3 and S4) involved sequences of low complexity and are likely spurious. Other matches confirmed ORFs M31b, M73.5e2, m120.1, r48.2, R73.5e2, and R102b, newly annotated based on SSPA analysis.

    BLASTP analysis identified four additional candidate genes in the MCMV genome, one with similarity to a region of RCMV r5, one overlapping M57 and similar to the single-stranded DNA-binding protein of primate CMVs, one similar to a hypothetical protein of the rhesus macaque CMV, and one similar to RCMV r95.1. In RCMV, 16 ORFs showed interesting BLASTP matches (boldfaced in supplemental Table S4). Three ORFs showed respective similarities to the arabinogalactan protein of maize, to the regulatory protein E2 from human papillomavirus, and to BHLF1 from EBV. ORF r169.1 (overlapping r169) showed extensive similarity to ORF r171, located immediately downstream in the RCMV genome (see the supplemental material). A notable feature in the RCMV genome evidenced by the BLASTP analysis was the existence of multiple similarities between ORFs overlapping in different frames the published genes r121.1, r121.2, and r125. These similarities corresponded to multiple exact repetitions of long DNA elements (supplemental Tables S5 and S6) duplicated in different frames within the same overlapping ORF. The lack of relatedness of these DNA structures to any coding frame suggests that these genome regions may not code for proteins at all.

    Three ORFs of RCMV, newly named r153e2, r153e3, and r153e4, showed significant similarity to a lectin-like glycoprotein first identified in the English isolate of RCMV (46), where the protein is encoded by five exons. Similarity analysis of this protein against our collection of peptides suggested that a homologous lectin-like protein is also encoded in the RCMV Maastricht genome, within the region including positions 217034 to 217816. By the alignment of the putative products of these ORFs to the protein identified in the English isolate (Fig. 5) and the identification of putative donor and acceptor sites in the RCMV genome, we suggest that this protein is encoded in RCMV Maastricht but employs four exons and has a total length of 186 aa.

    S-profile analysis. Figures 6 and 7 display frame-specific profiles of G+C contents (see Materials and Methods) and regions of conservation of the MCMV and RCMV genomes. For frame-specific analysis (5), the frequencies of G+C were calculated with respect to every third position using windows of 102 nt or 201 nt. The G+C contents determined over positions in frame with nt 1, 2, or 3 of the complete genomes are represented by the red, green, and blue curves (see Materials and Methods and the legend of Fig. 6). We refer to the shape and relationship of the three curves at various sequence positions as "S-profiles." Published genes (40, 45) are represented in Fig. 6 and 7 as colored with reference to the frame of the third base of their codons (see the legend of Fig. 6). The color and orientation of each gene also identify the G+C curves corresponding to the first and second base positions of its codons. For example, the G+C contents of codon base positions 1, 2, and 3 for gene M100, colored in green and encoded on the complementary strand, are shown by the red, blue, and green curves, respectively. For gene M102, colored in green but encoded on the direct strand, positions 1, 2, and 3 instead correspond to the blue, red, and green curves, respectively. Figures 6 and 7 also show all conserved regions identified by SSPA analysis (7), all nonannotated ORFs highlighted by SSPA analysis (from Table 1) or BLASP analysis (from supplemental Tables S3 and S4) (1), and the frames of all regions of high coding potential, determined by S-profile biases and by the GeneMark procedure (6). Finally, Fig. 6 and 7 show all potential coding regions identified by our analysis of S-profiles and conservation among all ORFs with 20 codons from the two genomes (see the legend of Fig. 6).

    Previously annotated genes recognized by S-profile analysis. Within the region of high G+C contents of the MCMV genome and of the RCMV genome, there was a clear correspondence of S-profiles with most previously annotated ORFs (see also supplemental Tables S7 and S8). In this region, S-profile analysis recognized 74 of the 88 annotated ORFs from MCMV and 79 of the 92 annotated ORFs from RCMV. S-profiling was sensitive to G+C content such that in the MCMV regions of low G+C contents, only 50 of the 89 annotated ORFs were identified and in the low-G+C-content region of RCMV, 21 of the 75 annotated ORFs were recognized solely by this method. Considering the G+C compositions of individual ORFs, S-profiles recognized 219 of the 291 annotated ORFs with a G+C content of >50%, compared to 5 of 53 ORFs with a G+C content of <50%. These results indicated that S-profile analysis is a powerful approach to identifying coding sequences with G+C contents as low as 50%. Among the annotated genes of high G+C content (>50%) not identified by S-profiles, some were clearly contradicted by their S-profiles whereas others exhibited high G+C contents (often >60%) in all three codon positions (see below).

    To provide an objective means to identify regions in the MCMV and RCMV genomes where S-profiles would predict the presence of protein-coding sequences, we first identified all positions (centered in windows of 102 nt) where frame-specific G+C contents differed by more than 35% (corresponding to a random probability of 0.01). We then excluded all regions where these contrasts could be explained by the presence of previously annotated ORFs. The remaining regions of high frame-specific G+C contrasts, shown as shaded blocks in Fig. 6 and 7, suggest the existence of expressed genes.

    Newly annotated ORFs. All ORFs whose expression was consistent with the observed high contrasts in S-profiles (>35%) were shown among the newly annotated ORFs in Fig. 6 and 7 and were listed in Table 2 (MCMV) and Table 3 (RCMV). In these tables we have indicated the genome positions of the annotated coding sequences, their G+C contents, and the newly assigned name for each ORF, and we have also indicated the published genes that overlapped each newly annotated ORF. For each ORF we have also identified the presence of a putative translation initiation codon, low compositional biases, conservation, and overlap to sequence regions of high coding potential measured by S-biases or measured by the GeneMark (6) procedure, as well as weak conservation or overlap only to short regions of high coding potential. The presence of an AUG codon did not apply when ORFs were interpreted as internal or last exons or as 3' frameshift extensions of a sequence in a different frame. Evidence from S-profiles was distinguished as extending through the full length of the ORF or as partially covering the length of the ORF. We included 33 ORFs in Tables 2 and 3 that were identified by SSPA and/or BLASTP similarity and had been retained after scrutiny through S-profile analysis. ORFs that were most reliably predicted as coding sequences (by the extension and strength of the S-profile signal and/or by strong conservation) are indicated. ORFs supported by conservation (through SSPA and/or BLASTP analysis) and by S-profiles are also indicated.

    A total of 126 ORFs from the MCMV genome and 105 ORFs from the RCMV genome were newly annotated as a result of this analysis. In 101 of these ORFs, we could identify a putative start codon; 87 ORFs showed conservation through SSPA or BLASP analyses, and 77 ORFs were characterized by low compositional biases. We observed substantial concordance between regions of high coding potential identified by S-profiles or by the GeneMark procedure. In either viral genome, 175 ORFs included regions of high coding potential based on S-biases and a majority (140 ORFs) were also supported by the coding potential evaluated by GeneMark. A total of 61 ORFs coincided with the underlying contrasts in S-profiles, whereas 152 ORFs were characterized by partial S-profiles.

    S-profiles of ORFs identified by SSPA similarity. All annotated ORFs identified by SSPA similarity retained in Tables 2 and 3 are marked. Among the potential coding regions identified by SSPA similarity analysis, 2 newly annotated ORFs from MCMV (M31b and M73.5e2) and 10 newly annotated ORFs from RCMV (R23a, r25.3b, r38.5, r48.2, R73.5e2, R98a, R102b, r115.1, r153e3, and r153e1) were also confirmed by strong contrasts in S-profiles. S-profiles also distinguished the most likely reading frame within some of the groups of overlapping MCMV and RCMV ORFs with similarity to one another (Table 1). Within two of these groups, S-profiles favored as coding regions RCMV ORF r48.2, homologous to published ORF m48.2, and ORF r124.1, homologous to published ORF m124. In a third group of ORFs overlapping M116 and R116, S-profiles identified in MCMV an ORF borne on the direct strand (m116.1), whereas RCMV conservation and S-profiles favored an ORF on the complementary strand (r115.1). Seven ORFs from MCMV (m44.1, m44.3, m45.2, m106.1, m106.3, m120.1, and m123.1) and seven ORFs from RCMV (r44.1, r44.3, r45.2, r106.1, r108.1, r124.2, and r133e2), identified by extended SSPA similarity, were not recognized by S-profiles. Of these, ORF m120.1 from MCMV and ORF r132e2 from RCMV showed particularly strong conservation. All 14 ORFs are listed in Tables 2 and 3 as potential protein-coding sequences.

    S-profiles of ORFs identified by BLASTP similarity. All annotated ORFs identified by BLASTP analysis and retained in Tables 2 and 3 are indicated. Among four ORFs in MCMV and eight ORFs in RCMV that were identified as candidate genes by BLASTP analysis (Tables S3 and S4 in the supplemental material), three ORFs from RCMV, corresponding to two exons (r153e1 and r153e3) of the lectin-like gene and to a paralog (r169.1) of published ORF r171, were also supported by S-profiles. Exons 2 and 4 of r153 could not be confirmed by their S-profiles due to low G+C contents. Another ORF from RCMV (r58.1) was identified by BLASTP for its similarity to the regulatory protein E2 of human papillomavirus. Although S-profiles did not support the expression of this ORF over its entire length, the expression of the C-terminal portion (corresponding to the conserved region) was supported by extended GeneMark coding potentials and by a weak S-profile signal. Other BLASTP-identified coding regions were not supported by S-profiles, which strongly supported the authenticity of previously annotated ORFs in the same regions. These findings suggest that a reevaluation of other published proteins matching these ORFs (mostly hypothetical proteins from various herpesviruses [see supplemental Tables S3 and S4]) would be valuable.

    Nonconserved ORFs identified by S-profile analysis. Eight ORFs from MCMV (m20b, m116.1, m122.5, m122.6, m143b, m154.1, m154.2, and m163.1) and three ORFs from RCMV (r2.2, R27a, and r41.1), although not or poorly conserved, corresponded to strong contrasts in S-profiles and to extended regions of high coding potential. Among these, ORF m20b has been experimentally verified as a frameshift 3' extension of m20 (23). We also interpreted ORF m143b as a frameshift 3' extension (or possibly a second exon) of m143, consistent with the lack of an AUG codon. ORF R27a was interpreted as a frameshift 5' extension of R27 and terminated at the corresponding approximate position (see the supplemental material). ORF r41.1 was similar to m41.1 mostly in a region coincident with a corresponding region of conservation with the overlapping published genes r41 and m41. However, strong contrasts in S-profiles and the presence of a conserved initiation codon (AUG) strongly suggest that this ORF (and its MCMV homolog, m41.1) is expressed.

    S-profiles and overlapping ORFs. S-profiles yielded useful verification of the position of most previously annotated ORFs of high (>50%) G+C content (see above). However, among these ORFs we identified 99 sequences that were only partially matched by S-profiles. The S-profile evidence for these sequences was classified as "partial" in the "evidence" column of Tables S7 and S8, where it was also diagrammatically represented (e.g., for ORF m25.2 "++–" indicates that over approximately the first two thirds of the annotated sequence, S-profiles conform to the expression of this ORF but not over the last third). Many partial S-profile inconsistencies observed in previously annotated genes coincided with the overlap of newly annotated sequences. In MCMV, 35 previously annotated genes overlapped 58 newly identified ORFs (Table S7), and in RCMV, 24 previously annotated genes overlapped 35 newly identified ORFs (Table S8). Irregular S-profiles were observed in these regions of overlap. In 36 of these situations, the identification of a new ORF fully explained the irregularity (supplemental Tables S7 and S8). Irregular regions could be partly explained in 18 other cases.

    Alternative start of translation of previously annotated ORFs. The use of an alternative translation start site was suspected when consistent S-profiles failed to coincide with the most 5'-end-proximal AUG in annotated genes. Alternative initiation sites have already been characterized for some genes, such as MCMV m131, a short first exon of the mck gene, where the fourth AUG codon in the full-length ORF is where translation starts (27).

    Using S-profile analysis, we identified 25 ORF candidates in the MCMV genome that may employ alternative translation start sites located upstream of the previously annotated site (Table S7). A different initiation codon downstream of the previous annotation was predicted in 19 cases (m9, M25, m25.1, M31, M34, M43, M51, M53, M55, M69, M71, M72, M73, M77, M102, m119.1, m131, and m139) (Table S7). An upstream start site was suggested by S-profiles for ORF m16, although no AUG codon was found in this region.

    In RCMV we found 11 ORF candidates for alternative start sites (Table S8), 9 of these apparently starting downstream and 2 (r4 and r70.1) apparently starting upstream of previously suggested start sites. Six of the nine ORFs for which S-profiles suggested a downstream start of translation (R31, r41, r74, R91, R115, and r171.1) also encoded a putative initiation codon (AUG) in corresponding positions. For three ORFs (R77, R122e5, and r166) in which an alternative start site could not be predicted, overlap to other coding sequences was found to explain the observed S-profiles. In the case of R122e5, S-profiles were also consistent with an alternative exon structure. In the cases of r4 and r70.1, S-profiles strongly confirmed evidence from sequence conservation that the coding regions of these genes should be extended 5' of the original annotation (see also the section on SSPA similarity analysis and the supplemental material).

    Other contributing evidence from S-profiles. S-profile inconsistencies were found in published ORFs of high G+C content from MCMV (M24, m25.2, M46, m48.2, M50, M69, M71, M87, M93, M112e1, m129, m131, m144, m159, m163, m165, and m170) and from RCMV (R43, r70.4, R77, r133, and r171) that could not be explained by overlapping sequences or alternative translation initiation. Most striking examples of these arrangements were found in genes M69, M87, and R77. Although the nature of these anomalous regions was unclear, in specific cases these might correspond to proteins of peculiar amino acid compositions or to the presence of introns.

    Annotated genes not evidenced by S-profiles. Fourteen previously annotated genes from MCMV and 42 from RCMV could not be confirmed by S-profiles due to low G+C content (indicated in Tables S7 and S8 as not applicable). Virtually all of these genes belonged to the regions of low G+C content of the corresponding genomes. The only exception was gene m74, which had uncharacteristically low G+C content despite its location in the high-G+C region of MCMV. Among annotated genes of high G+C content, 39 genes from MCMV and 28 genes from RCMV did not show the expected S-profile contrasts even though their G+C contents were often >60%. In the case of m19, m48.1, m108, M116, m119.5, and m134 from MCMV and r2.1, r4.1, r25.2, r95.1, and r167 from RCMV (indicated as "contradicted" in Tables S7 and S8), S-profiles largely contradicted their expression, providing evidence for expression of overlapped ORFs in different frames. Fifty-six other annotated genes (classified by "no evidence" in Tables S7 and S8) had high G+C contents in all three codon positions. The atypical S-profiles underlying these genes may be a consequence of corresponding gene products of atypical amino acid composition. It must be noted that most of these genes were not conserved between MCMV and RCMV and that their expression and functionality have not been characterized as yet in any direct investigation.

    DISCUSSION

    The sequencing of genomes and the use of different annotation methods have brought considerable revisions to the definition of the gene complement of cytomegalovirus. In particular, the human CMV genome sequence has been updated with analyses of additional strains (9, 17, 33) and comparisons of cytomegaloviruses from related species (14, 32, 33). We have compared the gene contents of the closely related MCMV Smith strain and RCMV Maastricht strain genomes, emphasizing sequence conservation and frame-specific G+C content (5). As a consequence of this analysis, we propose substantial revision to the annotation of these genomes. In MCMV, 14 newly identified ORFs were convincingly characterized as protein coding and 113 other ORFs showed evidence of a protein-coding capacity. Our analyses suggest the reevaluation of the translation start site of at least 18 previously annotated protein-coding sequences and provide S-profile evidence that contradicts the expression of six previously annotated genes. In RCMV we found strong evidence of at least 20 newly identified protein-coding sequences and the potential for 85 other coding sequences. We propose alternative translation start sites for at least eight previously annotated genes and provide evidence that suggests that five others are not expressed. At least two (up to four) examples in MCMV genes and at least five (up to six) examples in RCMV genes were found where frameshifts extend previously annotated genes. Among these newly annotated genes we have included in the MCMV genome a protein identified (23, 42a) as an alternative spliced form of M73 (M73.5e2) and we have identified its homolog in RCMV, designated R73.5e2. In RCMV we have also identified a second exon to ORF r133 (r132e2) and a homolog to a lectin-like protein (46) encoded by four exons (r158e1 to e4).

    We have avoided conventional criteria of minimum ORF length (100 codons) and maximum ORF overlap (<60%) to prevent the a priori exclusion of a class of genes that has previously been found only by direct experimental investigation. We also have not required an AUG codon to be present in potential coding regions. This allowed us to uncover small and overlapping ORFs, mRNA splicing, use of alternative translation initiation sequences in the coding complement of herpesvirus genomes, and several frameshifts within coding sequences. In fact, two of the frameshift extensions identified by our analysis in the MCMV genome, m20b and M31b, have been experimentally verified as correct 3'-terminal sequences of genes m20 and M31 (23). Alternative translation initiation signals are apparently used in a bona fide complete gene (R71, renamed r70.1) and in ORF r4, whose conservation and S-profile signals clearly extend 5' of their first AUG codon. Other examples of herpesvirus coding regions not initiated by an AUG codon have been reported (46). A coding sequence lacking an initiation codon may also relate to a potential multiexonic structure of the corresponding genes or appear as a consequence of sequencing errors.

    Our analyses revealed intriguing anomalies and potential in the G+C distribution (S-profiles) within annotated genes. Within annotated genes showing N-terminal anomalies in S-profiles, AUG codons were often located near the ends of these regions. This suggested the possibility of an alternative start of translation. An interesting example is M25 from MCMV, encoding tegument protein pM25. This protein presents extensive low-complexity regions 5' of several possible alternative starts of translation identified by our analysis. Intriguingly, in viral preparations, pM25 is found in forms of different molecular masses, identified as a true late 130-kDa peptide (included in the tegument) and two early 105-kDa and 95-kDa peptides (47) and later also as a 200-kDa, 52-kDa, or 48-kDa peptide (23). Peptides translated from the AUG codons corresponding to the region of conservation between MCMV and RCMV and to consistent S-profiles have predicted molecular masses of 57.2 kDa and 45.2 kDa. The sizes of these peptides are consistent with the smaller peptides isolated from viral preparations before replication (the annotated gene has a predicted molecular mass of 103 kDa). We suggest that some peptides from M25 may result from alternative transcription and translation start sites rather than from posttranslational proteolysis.

    Interesting anomalies in G+C distribution were evident also in gene M55 (glycoprotein B) from MCMV. This gene shows striking differences in S-profiles between its 5'-terminal, central, and 3'-terminal parts (Fig. 6). The functional form of this protein is generated by cleavage in the central part of the protein (38, 44). The 3' part of the gene, corresponding to the region of highest S-profile contrasts, is preceded by an AUG codon and two putative TATA box sequences (see the supplemental material). This suggests that the C-terminal part of glycoprotein B may also be alternatively translated from a shortened transcript.

    Many anomalies in S-profiles involving the central or C-terminal parts of annotated sequences cannot be explained by alternative start codons. In many cases these coincide with parts of the protein that are not conserved and often include low-complexity sequences, as, for example, in the pairs of homologs M34/R34, M56/R56, M69/R69, M83/R83, and M105/R105. The hydrophilic amino acid composition and lack of sequence conservation of these regions suggests that they may function as flexible linkers between separate functional domains of a protein or that they may correspond to loops or, for terminal elements, to nonfunctional tails. In the case of MCMV m45.1, the entire sequence has an anomalous composition (see the supplemental material). It is possible that m45.1 evolved from a seemingly nonfunctional N-terminal sequence of M45, still present in the homologous sequence R45 from RCMV.

    The herpesvirus capsid limits the size of the genome that can be packaged. From this perspective, it seems unlikely that nonfunctional regions of DNA can be retained in a genome where genes tend to be densely packed. While noncoding regions may be involved as control elements in transcription or DNA replication, we speculate that the presence of regions of weak selection in herpesvirus proteins may allow these viruses to encode overlapped genes to a greater extent than presently described. Frame analysis of G+C content suggests that the MCMV and RCMV genomes contain ORFs of high coding potential that overlap.

    Annotation is a process of prediction and confirmation by methods that provide a working set of data for additional empirical experimental studies. We believe that there is a need to relax the criteria used in conventional annotation methods in the study of eukaryotic viruses, where overlapping genes and posttranscriptional regulation, such as mRNA splicing and the use of nonconventional translation signals, are relevant biological processes. The increased ease of current experimental techniques in verifying the expression of coding sequences makes striving for coverage, perhaps with a reduction in specificity, a reasonable approach to gene prediction. Our application of different methods of sequence analysis identified a plethora of candidate genes that are excluded by more conventional criteria of annotation, providing a more comprehensive picture of the coding potential of these genomes for experimental verification.

    ACKNOWLEDGMENTS

    This work was supported by PHS grant 2 RO1 GM010452 to S.K., PHS grants RO1A120211 and RO1AI30363 to E.S.M., and an Alfred Benzon Foundation fellowship to T.N.K.

    Supplemental material for this article may be found at http://jvi.asm.org/.

    REFERENCES

    Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.

    Bahr, U., and G. Darai. 2001. Analysis and characterization of the complete genome of tupaia (tree shrew) herpesvirus. J. Virol. 75:4854-4870.

    Baldick, C. J., Jr., and T. Shenk. 1996. Proteins associated with purified human cytomegalovirus particles. J. Virol. 70:6097-6105.

    Besemer, J., A. Lomsadze, and M. Borodovsky. 2001. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29:2607-2618.

    Bibb, M. J., P. R. Findlay, and M. W. Johnson. 1984. The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences. Gene 30:157-166.

    Borodovsky, M., and J. McIninch. 1993. GeneMark: parallel gene recognition for both DNA strands. Comput. Chem. 17:123-133.

    Brocchieri, L., and S. Karlin. 1998. A symmetric-iterated multiple alignment of protein sequences. J. Mol. Biol. 276:249-264.

    Bruggeman, C. A., H. Meijer, F. Bosman, and C. P. van Boven. 1985. Biology of rat cytomegalovirus infection. Intervirology 24:1-9.

    Cha, T. A., E. Tom, G. W. Kemble, G. M. Duke, E. S. Mocarski, and R. R. Spaete. 1996. Human cytomegalovirus clinical isolates carry at least 19 genes not found in laboratory strains. J. Virol. 70:78-83.

    Chee, M. S., A. T. Bankier, S. Beck, R. Bohni, C. M. Brown, R. Cerny, T. Horsnell, C. A. Hutchison III, T. Kouzarides, J. A. Martignetti, et al. 1990. Analysis of the protein-coding content of the sequence of human cytomegalovirus strain AD169. Curr. Top. Microbiol. Immunol. 154:125-169.

    Reference deleted.

    Dargan, D. J., F. E. Jamieson, J. MacLean, A. Dolan, C. Addison, and D. J. McGeoch. 1997. The published DNA sequence of human cytomegalovirus strain AD169 lacks 929 base pairs affecting genes UL42 and UL43. J. Virol. 71:9833-9836.

    Davison, A. J. 2002. Evolution of the herpesviruses. Vet. Microbiol. 86:69-88.

    Davison, A. J., A. Dolan, P. Akter, C. Addison, D. J. Dargan, D. J. Alcendor, D. J. McGeoch, and G. S. Hayward. 2003. The human cytomegalovirus genome revisited: comparison with the chimpanzee cytomegalovirus genome. J. Gen. Virol. 84:17-28.

    Davis-Poynter, N. J., D. M. Lynch, H. Vally, G. R. Shellam, W. D. Rawlinson, B. G. Barrell, and H. E. Farrell. 1997. Identification and characterization of a G protein-coupled receptor homolog encoded by murine cytomegalovirus. J. Virol. 71:1521-1529.

    Dolan, A., C. Cunningham, R. D. Hector, A. F. Hassan-Walker, L. Lee, C. Addison, D. J. Dargan, D. J. McGeoch, D. Gatherer, V. C. Emery, P. D. Griffiths, C. Sinzger, B. P. McSharry, G. W. Wilkinson, and A. J. Davison. 2004. Genetic content of wild-type human cytomegalovirus. J. Gen. Virol. 85:1301-1312.

    Dunn, W., C. Chou, H. Li, R. Hai, D. Patterson, V. Stolc, H. Zhu, and F. Liu. 2003. Functional profiling of a human cytomegalovirus genome. Proc. Natl. Acad. Sci. USA 100:14223-14228.

    Gibson, W., K. S. Clopper, W. J. Britt, and M. K. Baxter. 1996. Human cytomegalovirus (HCMV) smallest capsid protein identified as product of short open reading frame located between HCMV UL48 and UL49. J. Virol. 70:5680-5683.

    Gompels, U. A., J. Nicholas, G. Lawrence, M. Jones, B. J. Thomson, M. E. Martin, S. Efstathiou, M. Craxton, and H. A. Macaulay. 1995. The DNA sequence of human herpesvirus-6: structure, coding content, and genome evolution. Virology 209:29-51.

    Hansen, S. G., L. I. Strelow, D. C. Franchi, D. G. Anders, and S. W. Wong. 2003. Complete sequence and genomic analysis of rhesus cytomegalovirus. J. Virol. 77:6620-6636.

    Hudson, J. B. 1979. The murine cytomegalovirus as a model for the study of viral pathogenesis and persistent infections. Arch. Virol. 62:1-29.

    Karlin, S., and J. Mrazek. 2000. Predicted highly expressed genes of diverse prokaryotic genomes. J. Bacteriol. 182:5238-5250.

    Kattenhorn, L. M., R. Mills, M. Wagner, A. Lomsadze, V. Makeev, M. Borodovsky, H. L. Ploegh, and B. M. Kessler. 2004. Identification of proteins associated with murine cytomegalovirus virions. J. Virol. 78:11187-11197.

    Kotenko, S. V., S. Saccani, L. S. Izotova, O. V. Mirochnitchenko, and S. Pestka. 2000. Human cytomegalovirus harbors its own unique IL-10 homolog (cmvIL-10). Proc. Natl. Acad. Sci. USA 97:1695-1700.

    Krmpotic, A., I. Bubic, B. Polic, P. Lucin, and S. Jonjic. 2003. Pathogenesis of murine cytomegalovirus infection. Microbes Infect. 5:1263-1277.

    Lockridge, K. M., S. S. Zhou, R. H. Kravitz, J. L. Johnson, E. T. Sawai, E. L. Blewett, and P. A. Barry. 2000. Primate cytomegaloviruses encode and express an IL-10-like protein. Virology 268:272-280.

    MacDonald, M. R., M. W. Burney, S. B. Resnick, and H. W. I. Virgin. 1999. Spliced mRNA encoding the murine cytomegalovirus chemokine homolog predicts a beta chemokine of novel structure. J. Virol. 73:3682-3691.

    McCormick, A. L., A. Skaletskaya, P. A. Barry, E. S. Mocarski, and V. S. Goldmacher. 2003. Differential function and expression of the viral inhibitor of caspase 8-induced apoptosis (vICA) and the viral mitochondrial-localized inhibitor of apoptosis (vMIA) cell death suppressors conserved in primate and rodent cytomegaloviruses. Virology 316:221-233.

    Mocarski, E. S. 2004. Immune escape and exploitation strategies of cytomegaloviruses: impact on and imitation of the major histocompatibility system. Cell. Microbiol. 6:707-717.

    Mocarski, E. S. 2002. Immunomodulation by cytomegaloviruses: manipulative strategies beyond evasion. Trends Microbiol. 10:332-339.

    Mocarski, E. S., Jr., and C. T. Courcelle. 2001. Cytomegaloviruses and their replication, p. 2629-2673. In D. M. Knipe, P. M. Howley, D. E. Griffin, R. A. Lamb, M. A. Martin, B. Roizman, and S. E. Straus (ed.), Fields virology, 4th ed., vol. 2. Lippincott Williams & Wilkins, Philadelphia, Pa.

    Mocarski, E. S., M. N. Prichard, C. S. Tan, and J. M. Brown. 1997. Reassessing the organization of the UL42-UL43 region of the human cytomegalovirus strain AD169 genome. Virology 239:169-175.

    Murphy, E., I. Rigoutsos, T. Shibuya, and T. E. Shenk. 2003. Reevaluation of human cytomegalovirus coding potential. Proc. Natl. Acad. Sci. USA 100:13585-13590.

    Murphy, E., D. Yu, J. Grimwood, J. Schmutz, M. Dickson, M. A. Jarvis, G. Hahn, J. A. Nelson, R. M. Myers, and T. E. Shenk. 2003. Coding potential of laboratory and clinical strains of human cytomegalovirus. Proc. Natl. Acad. Sci. USA 100:14976-14981.

    Muto, A., and S. Osawa. 1987. The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl. Acad. Sci. USA 84:166-169.

    Neote, K., D. DiGregorio, J. Y. Mak, R. Horuk, and T. J. Schall. 1993. Molecular cloning, functional expression, and signaling characteristics of a C-C chemokine receptor. Cell 72:415-425.

    Nicholas, J. 1996. Determination and analysis of the complete nucleotide sequence of human herpesvirus 7. J. Virol. 70:5975-5989.

    Pass, R. F. 2001. Cytomegalovirus, p. 2675-2705. In D. M. Knipe, P. M. Howley, D. E. Griffin, R. A. Lamb, M. A. Martin, B. Roizman, and S. E. Straus (ed.), Fields virology, 4th ed., vol. 2. Lippincott Williams & Wilkins, Philadelphia, Pa.

    Rapp, M., M. Messerle, B. Buhler, M. Tannheimer, G. M. Keil, and U. H. Koszinowski. 1992. Identification of the murine cytomegalovirus glycoprotein B gene and its expression by recombinant vaccinia virus. J. Virol. 66:4399-4406.

    Rawlinson, W. D., and B. G. Barrell. 1993. Spliced transcripts of human cytomegalovirus. J. Virol. 67:5502-5513.

    Rawlinson, W. D., H. E. Farrell, and B. G. Barrell. 1996. Analysis of the complete DNA sequence of murine cytomegalovirus. J. Virol. 70:8833-8849.

    Reddehase, M. J. 2002. Antigens and immunoevasins: opponents in cytomegalovirus immune surveillance. Nat. Rev. Immunol. 2:831-844.

    Schleiss, M. R. 2002. Animal models of congenital cytomegalovirus infection: an overview of progress in the characterization of guinea pig cytomegalovirus (GPCMV). J. Clin. Virol. 25(Suppl. 2):S37-S49.

    Scalzo, A. A., P. B. Dallas, C. A. Forbes, A. S. J. Mikosza, P. Fleming, L. J. Lathbury, P. A. Lyons, S. Laferté, M. M. Craggs, and L. C. Loh. 2004. The murine cytomegalovirus M73.5 gene, a member of a 3' co-terminal alternatively spliced gene family, encodes the gp24 virion glycoprotein. Virology 329:234-250.

    Smith, J. A., and G. S. Pari. 1995. Human cytomegalovirus UL102 gene. J. Virol. 69:1734-1740.

    Spaete, R. R., R. M. Thayer, W. S. Probert, F. R. Masiarz, S. H. Chamberlain, L. Rasmussen, T. C. Merigan, and C. Pachl. 1988. Human cytomegalovirus strain Towne glycoprotein B is processed by proteolytic cleavage. Virology 167:207-225.

    Vink, C., E. Beuken, and C. A. Bruggeman. 2000. Complete DNA sequence of the rat cytomegalovirus genome. J. Virol. 74:7656-7665.

    Voigt, S., G. R. Sandford, L. Ding, and W. H. Burns. 2001. Identification and characterization of a spliced C-type lectin-like gene encoded by rat cytomegalovirus. J. Virol. 75:603-611.

    Wu, C. A., M. E. Carlson, S. C. Henry, and J. D. Shanley. 1999. The murine cytomegalovirus M25 open reading frame encodes a component of the tegument. Virology 262:265-276.(Luciano Brocchieri, Thoma)