Tải bản đầy đủ

Tài liệu Báo cáo khoa học: Domain deletions and substitutions in the modular protein evolution doc

Domain deletions and substitutions in the modular protein
evolution
January Weiner 3rd, Francois Beaussart and Erich Bornberg-Bauer
Division of Bioinformatics, School of Biological Sciences, The Westfalian Wilhelms University of Mu
¨
nster, Germany
Proteins are well known to evolve not only by point
mutations, but also by modular rearrangements [1–
3]. By and large, these rearrangements occur at the
level of domains, which are independent folding units
and have been proposed to represent the unit of
modular evolution [3,4]. Most domains always form
the same combinations; that is, they are always
found next to the same neighbours. For example,
domains found in ribosomal proteins are not found
elsewhere and are present always in the same con-
text. Also, it has been reported that many domains
appear in a very much conserved order (suprado-
mains) [5], and that the frequent occurrence of cer-
tain modular arrangements (arrangements of modules
along a sequence) across phyla is the result of con-

servation [6].
While few domains co-occur with many others at
least once in the same protein, most domains have few
partner domains, or are even always singletons [3,7–9].
Well-known examples of highly linked domains occur-
ring in many different combinations are the P-loop
nucleotide triphosphate hydrolase domain, the epider-
mal growth factor (EGF) domain, the SH3 domain,
the P-kinase domain and the domains involved in the
blood clotting cascade [1,10].
The phenomenon of differential arrangements has
often been termed domain mobility [11]. However,
this term may be misleading as it implies that single
Keywords
domain loss; fission; fusion; protein
domains; protein evolution
Correspondence
E. Bornberg-Bauer, Division of
Bioinformatics, School of Biological
Sciences,The Westfalian Wilhelms
University of Mu
¨
nster, Schlossplatz 4,
D48149 Mu
¨
nster, Germany
Fax: +49 251 8321631
Tel: +49 251 8321630
E-mail: ebb@uni-muenster.de
(Received 5 December 2005, revised 13
February 2006, accepted 9 March 2006)
doi:10.1111/j.1742-4658.2006.05220.x
The main mechanisms shaping the modular evolution of proteins are
gene duplication, fusion and fission, recombination and loss of frag-
ments. While a large body of research has focused on duplications and
fusions, we concentrated, in this study, on how domains are lost. We
investigated motif databases and introduced a measure of protein simi-
larity that is based on domain arrangements. Proteins are represented as
strings of domains and comparison was based on the classic dynamic
alignment scheme. We found that domain losses and duplications were


more frequent at the ends of proteins. We showed that losses can be
explained by the introduction of start and stop codons which render the
terminal domains nonfunctional, such that further shortening, until the
whole domain is lost, is not evolutionarily selected against. We demon-
strated that domains which also occur as single-domain proteins are less
likely to be lost at the N terminus and in the middle, than at the C ter-
minus. We conclude that fission ⁄ fusion events with single-domain
proteins occur mostly at the C terminus. We found that domain substi-
tutions are rare, in particular in the middle of proteins.We also showed
that many cases of substitutions or losses result from erroneous annota-
tions, but we were also able to find courses of evolutionary events where
domains vanish over time. This is explained by a case study on the bac-
terial formate dehydrogenases.
Abbreviations
Domain ID, domain identification number; EGF, epidermal growth factor; FDHF, formate dehydrogenase H.
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2037
modules or small arrangements are being transferred
from one protein to another. Considering that often
two modules or larger arrangements as such are
fused into one protein, it becomes difficult to defne
which of the modules is ‘mobile’ and which is ‘sta-
tic’. Therefore, it has been suggested that the term
versatility ahould be used instead of domain mobility
[3,12]. Independently of the perspective taken, the
underlying mechanisms of modular rearrangements
are mostly gene fusion and domain loss and, prob-
ably to a lesser extent, domain shuffling of exons
and recombination [13–17].
While the emergence of domain combinations is well
documented [4,6,7,18–21], relatively little is known
about domain losses.
In this article, we focus on how domains are lost.
Ultimately, this question is difficult to discern from the
recruitment of domains because, in comparing two
proteins, phylogenetic analysis is required to detect
whether a domain has been recruited in one protein or
lost in the other. To deal with this problem, we investi-
gated the possible genetic mechanisms that can cause a
domain to be lost or gained.
As usual in sequence analysis, information on the
history of evolution can only be assumed a posteri-
ori, meaning that disadvantagous mutations (frame-
shifts, domain deletions, etc.) have been weeded out
by negative selection. Thus, we only observe events
of modular rearrangements that are either beneficial
or neutral. For the sake of comprehensiveness, we
used the ProDom database [22], which records
conserved sequence fragments. However, they are not
always identical to structural domains. To confer
with the general definition of domains [3], all key
results were confirmed using Pfam, which largely
agrees with structural domain definitions [23].
In the following study we first investigated whether
the relative frequencies of deletions (or recruitements)
depend on if a domain is at the end or the middle of
a protein. Unless explicitly stated, we used the term
‘deletion’ as synonymous for deletions and recruit-
ments. We then investigated whether eliminations are
more frequently observed at the boundaries of
domains and whether or not domain substitutions are
frequent. For that purpose, we categorized and des-
cribed misannotations of domains to discern them
from real substitutions or deletions of domains. Next,
we studied whether some domains are more often lost
and whether frequencies of domain deletions depend
on domain versatility. Finally, we discussed the impli-
cations of our results for a wider understanding of
modular protein evolution and the possibilities for gen-
erating a model in which modular protein evolution is
formally described in terms of module edit operations
and cost functions.
Results and Discussion
Single domain deletions
The first question we asked was whether the probabil-
ity of a domain deletion is evenly distributed through-
out a protein. The null hypothesis was that genetic
mechanisms which lead to domain deletions (for exam-
ple, deletions and insertions of sequence fragments,
intron recombinations, etc.) do not depend on the
position within the sequence. However, two factors
could cause a bias. First, any point mutation that cre-
ates a premature stop codon will cause a C-terminal
deletion of a protein. Likewise, a mutation leading to
the emergence of an alternative transcription or trans-
lation start will cause an N-terminal deletion. Second,
a fission producing two genes from one will result in
the deletion of a terminal fragment from a protein or,
vice versa, a fusion of two smaller proteins into one
will result in the observed pattern.
We first grouped proteins by the number of domains
they have (see the Materials and methods). For each
protein, we searched for deletion events, that is, a pro-
tein which has exactly the same domain arrangement,
except for a single domain missing anywhere in the
arrangement. Then we calculated the frequency of the
deletion at each domain position within the group of
proteins containing a given number of domains.
We found that the domain deletions are more com-
mon at either of the protein termini, and that their
occurrence is slightly higher at one of the termini,
depending on the number of domains in the protein
and the database selected (Fig. 1). The prevalence of
terminal deletions did not depend on the number of
domains in proteins, and the results for Pfam and Pro-
Dom databases were similar. In only a few cases were
slightly increased frequencies of domain deletions
observed at a central position.
According to our predictions, this suggests that the
genetic mechanism of domain deletions acts predomin-
antly on sequence termini. Therefore, we tentatively
propose that the insertions of new transcription start
and stop codons, as well as gene fusion and fission,
are more likely to occur than, for example, intron
mobility caused by exon shuffling.
Multiple domain deletions
We supported the previous findings by analysing cases
where one or more domains were deleted from a
Mechanisms shaping modular protein evolution J. Weiner 3rd et al.
2038 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
protein. We considered only deletions in which at least
half of the domains of the full length arrangement was
preserved, to ensure that homologous arrangements
were being compared. The results were similar to those
of single domain deletions, in that the terminal dele-
tions were prevalent (see the Supplementary Material).
In many cases, a deleted domain is a part of a lar-
ger, deleted fragment. We have found that fragments
deleted at either termini are, in general, much longer
than fragments deleted within a protein sequence. The
deletions within the protein are much more often single
domain deletions (Fig. 2). The total number of dele-
tions that concern only one, single domain, is higher
for the positions between the termini. However, the
number of major deletions (deletions that span more
than one domain) is higher at terminal positions. This
supports the view that the deletions generally involve
the protein termini.
In-detail analysis of the deletion events
During our analyses, we noted that some of the appar-
ent domain deletions are actually just misannotations.
A lack of a domain identifier at a given position in a
protein annotation does not necessarily mean that the
corresponding domain is physically deleted. Likewise,
a different identifier does not necessarily signify a
physical substitution. To address this problem, we con-
structed clusters of similar proteins that contained at
Position
Proportion of domains deleted
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1234
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
123456
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
123456789101112345678910
Fig. 1. Statistics of single domain deletions in the whole SwissProt ⁄ TrEMBL set of proteins. The figure shows the relative proportion of
domain deletions at different positions within the proteins of length 4, 6, 10 and 11 domains. Dark grey, Pfam; Light grey, ProDom.
Length of the deleted fragment (in domains)

Number of occurencies
Fig. 2. Number of occurrences of domain deletions as a function of
the length (in domains) of the deleted fragment. Diamonds, N-term-
inal deletions; squares, deletions within the protein; circles, C-term-
inal deletions. Single domain losses occur preferentially on one of
the middle positions, whereas longer fragments tend to be deleted
at the termini.
J. Weiner 3rd et al. Mechanisms shaping modular protein evolution
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2039
least six ProDom domains. We aligned the domain
arrangements within a cluster using a simple progres-
sive multiple alignment algorithm [24], based on
pairwise alignments generated using the Needleman-
Wunsch algorithm [25] (Supplementary material).
We were able to distinguish five types of phenom-
ena that resulted in an apparent deletion from the
domain arrangement (Table 1, Fig. 3). The first two
were real substitutions and physical deletions of
domains. In some cases, at the site where the domain
annotation was missing, there was, in fact, a sequence
similar to the sequence of this domain. However,
because of length or large evolutionary distance, this
sequence was not annotated by the automatic annota-
tion mechanism of ProDom (‘erosion’). In other
cases, if there is a high sequence variation between
the instances of the domains with a given identifica-
tion number (ID), homologous sequences can be
assigned different ProDom identifiers (‘camouflage’).
Yet, in other cases, although the annotation (ProDom
Table 1. Criteria used to distinguish between various types of sequence rearrangements and annotation artefacts that result in a disappear-
ance of a domain in the domain string of a protein.
Evolutionary events
physical deletion a domain is physically deleted from the protein sequence, and only a short (<20 amino acids) fragment can
be found between the neighbouring domains
substitution a domain is replaced by another domain that bears no similarity with the original domain
shadow domain at a given position, in one protein there is a ProDom domain; at the same position in another protein
there is an amino acid sequence which is not similar to the given domain and which does
not correspond to a ProDom ID
Annotation artefacts
camouflage although there are two different ProDom domains at the same position in two proteins,
they are significantly similar (E<<1)
erosion the domain is not annotated in ProDom, but there is at this position a similar amino acid sequence
Domain−wise evolutionary events Annotation artifacts
Substitution
A Substitution
A
A
D
B
C
C
A
A
B
C
C
Shadow domain
seq
B Shadow domain
Deletion
A
A
B
C
C Physical deletion
C
D Camouflage
A
D
B
C
C
A
Camouflage
E Erosion
A
A
B
C
C
Erosion
seq
E−value (B,D) ~ 1
E−value (B,seq) ~ 1
E−value (B,D) << 1
E−value (B,seq) << 1
Fig. 3. Classification of domain-wise events observed in the domain databases. Different evolutionary events (A, B, C) and annotation arte-
facts (D, E) result in an apparent ‘deletion’ of a ProDom domain from a protein annotated in terms of ProDom domains. Domain and dot
plots can be found in the Supplementary material.
Mechanisms shaping modular protein evolution J. Weiner 3rd et al.
2040 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
ID) of a given domain is missing, there is no physical
deletion or misannotation. Instead, the amino acid
sequence at this position is not similar to the given
ProDom domain; therefore, it is a case of a real sub-
stitution.We call this case a ‘shadow domain’.
For each of these events, we counted its occur-
rence in the constructed protein clusters (see the
Materials and methods for details), at each position
in each protein cluster, as follows. If a domain was
found to be deleted from an arrangement in a clus-
ter, the amino acid sequences occurring in all the
sequences of the cluster at the given position were
analysed. We have applied the criteria from Table 1
to distinguish between the three types of real evolu-
tionary events (physical domain deletion, substitution
and shadow domains) and two types of annotation
artefacts (camouflage and erosion). In the case of
physical deletions, shadow domains and erosions, the
numbers of these events were simply counted. How-
ever, in the case of substitutions and camouflage, it
is not reasonable to count the number of occur-
rences of such an event without inferring a direction
of the substitution. For example, if at a certain posi-
tion in a cluster, domain A occurs in two sequences,
and each of the domains B and C occurs five times,
then what frequency of the substitutions should be
assumed here? We have used the following routine:
all possible pairwise combinations of domains from
different proteins occurring at the same domain posi-
tion in a cluster were analysed. If the two domains
in a pair were different, then an event (substitution
or camouflage) was recorded. Therefore, the calcula-
ted numbers of substitution and camouflage events
cannot be used to infer any conclusions on the act-
ual substitution rate of domains; however, because at
all domain positions the number of camouflage and
substitution events have been calculated in the same
way, relative frequencies of the camouflage and sub-
stitution events at different positions can be inferred.
The relative frequencies of physical domain dele-
tions, substitutions and shadow domains are all
higher at the termini. The average domain deletion
frequency is 9%, 7% at the nonterminal position
and 20% at the termini (Table 3). This trend cannot
be seen in the case of annotation artefacts (Fig. 4,
Table 3). Furthermore, annotation artefacts are 10
times rarer than real, physical events (Table 3).
Therefore, our previous results for single-domain and
multiple deletions are scarcely affected by inaccur-
acies of the database annotations and reflect real
evolutionary events. This supports the aforemen-
tioned finding that the majority of deletions are
caused by the physical deletions of protein termini.
We repeated this analysis to test whether there are
differences between prokaryotes and eukaryotes; how-
ever, we did not find significant differences (see the
Supplementary material).
Distribution of termini length in proteins
We have further pursued the question of whether the
terminal deletions can be regarded as truly modular
events; that is, to what extent evolution preserves
domain boundaries upon domain deletion. The null
hypothesis is that in the case of nearly neutral evolu-
tion, the domains are depleted gradually, and partially
deleted domain fragments are common. In such a case,
the evolution of proteins cannot be modelled by the
approximation of domains or modules. However, sev-
eral factors can make the situation different. First,
selection pressure could rapidly eliminate the truncated
fragments – unnecessary biosynthesis of the nonfunc-
tional protein fragments should reduce fitness. Second,
if domain deletions are caused by genetic mechanisms
preserving domain boundaries (such as gene fusions),
partial domains will be rare. If this is the case, amino
acid sequence deletions can be simplified to domain
deletion events, and thus protein evolution could be
abstracted to the level of modules.
We tackled this problem as follows. We have con-
structed clusters of proteins. Each cluster contained
proteins with the same domain arrangement, or with
an arrangement shortened by a terminal domain dele-
tion, either N terminal or C terminal. We recorded
the length of the N- or C-terminal amino acid
Evolutionary events Annotation artefacts
Fig. 4. Results of the protein clusters analysis: relative percentages
of different evolutionary events and annotation artefacts at different
domain positions within the analysed sequences. Error bars indi-
cate the standard error of the calculated proportion. The values for
the ‘Middle position’ were averaged from the values for all non-
terminal positions.
J. Weiner 3rd et al. Mechanisms shaping modular protein evolution
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2041
sequence and plotted the distribution of its length
(see the Materials and methods for details). The
lengths were normalized for every protein cluster and
then averaged for evaluation. A length of 0 corres-
ponds to the case when the terminal domain is com-
pletely deleted, and 100 to the average length of the
terminal domain in the whole cluster. Furthermore,
we refined these results by counting only the protein
sequence fragments that are similar, at the amino acid
sequence level, to the remaining sequence of the dele-
ted domain, given one of two E-value thresholds.
These E-values between those fragments and the
intact domain were recorded and put in three bins,
each for a different range of E-values (any E-value,
0 £ E £ 0.01; 0 £ E £ 1 · 10
)5
).
The distributions of termini lengths are shown in
Fig. 5. The distributions show that complete domains
are much more likely to be present in proteins, and
that partial domains are rare at the terminal ends.
These distributions hold also for sets of data in which
sequences containing three or fewer domains were
removed, and also in the case of Pfam domains
(Fig. 5, bottom). If an E-value was applied (only frag-
ments similar to the given domain were considered),
the shorter sequences with a terminal fragment that
was completely lost were eliminated from the histo-
gram. This was not necessarily because the fragments
were not homologous, but because the fragments were
too short to show any significant similarity. However,
the right part of the distribution, corresponding to
sequence fragments of > 50% of the average domain
length, did not change significantly (grey bars on
Fig. 5).
Domain deletions and domain versatility
Finally, we investigated whether the domain deletion
events were connected to the properties of the deleted
domains itself. Specifically, we wished to establish whe-
ther the versatility of a domain plays a role in domain
deletions. Furthermore, we considered that domains
can, in general, fold autonomously. Therefore, we
ProDom, N−terminus ProDom, C−terminus
Length of the N terminus in % of the deleted domain
Length of the C terminus in % of the deleted domain
Pfam, N−terminus Pfam, C −terminus
Number of occurences
Number of occurences
25000
35000
55000
0 100 200 300
05000
0100
200
300
0
5000 15000
0 100 200 300
0500015000
0 100 200 300
0500
0
15000
Fig. 5. Length distributions of the remaining
fragment from a terminal domain. Distribut-
ion of the length of the terminal sequences
is based on comparison of domain arrange-
ments alignments. Left, distribution on
the N-termini; right; distribution on the
C-termini.The lengths are relative to the size
of the deleted domain (¼ 100%). White bars;
all terminal fragments; light grey, terminal
fragments similar to the deleted domain
(E < 0.01); dark grey, terminal fragments
significantly similar to the deleted domain
(E <1· 10–
5
). Top, results for the ProDom
database; bottom, results for the PfamA data
set.
Table 2. Deleted domains and domain versatility.
Position Fraction as single for all
a
Fraction as single for deleted
b
Average NN for all
c
Average NN for deleted
d
Total for all domains 3.00% ± 0.04 1.82% ± 0.10 2.50 ± 0.02 2.40 ± 0.08
N-terminus 2.46% ± 0.07 1.81% ± 0.16 1.67 ± 0.02 2.13 ± 0.12
middle 2.40% ± 0.05 0.96% ± 0.13 3.65 (1.83) ± 0.03 4.12 (2.06) ± 0.29
C-terminus 3.20% ± 0.09 3.64% ± 0.27 1.70 ± 0.02 2.72 ± 0.19
a
Overall fraction of domains that were found to form single-domain proteins;
b
fraction of deleted domains that were found to form single-
domain proteins;
c
average number of neighbours for all domains in the protein clusters ± standard error;
d
average number of neighbours
for the deleted proteins ± standard error. As each of the domains in a middle position has two neighbours, the values in parentheses are
the averages divided by two. The results are based on a dataset with proteins having 3 or more domains.
Mechanisms shaping modular protein evolution J. Weiner 3rd et al.
2042 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
recorded how often domains that are lost form single-
domain proteins.
First, we calculated the fraction of domains that also
occur as single-domain genes in the sets of domains
that are deleted at an N-terminal, C-terminal or cen-
tral position.We found that the domains which also
occur as single-domain proteins are found two- to four
times more frequently at the termini, and twice as fre-
quently at the C terminus than at the N terminus
(Table 2). Surprisingly, the average fraction of
domains that also occur as single-domain genes is
lower for the domains that partake in deletion events
than the average for all domains.
The ability of a domain to form autonomous, sin-
gle-domain proteins may be related to its versatility.
We have therefore calculated the domain connectivity
and found that it is highest for the nonterminal
domains. However, as the domains at a nonterminal
position have, on average, two neighbours, whereas
the terminal domains have only one, the averages for
this type of domains must be halved. In that case,
the percentages of domains that form autonomous,
single-domain proteins are higher for domains that
undergo deletions at the termini, and lower for
domains that undergo deletions at a nonterminal
position (Table 2). Again, the numbers of domains
that form autonomous, single-domain proteins are
highest for the domains that are deleted at the
C terminus.
We conclude that the elevated rates of domain dele-
tions at the termini regions are partly related to
domain versatility and their ability to function outside
a multidomain protein (to form single-domain pro-
teins). The events involving domain acquisition ⁄ loss
are twice as frequent at the C terminus than at the
N terminus (Table 2).
Case study: bacterial formate
dehydrogenases
An exemplary cluster of bacterial formate dehydroge-
nase proteins is shown in Fig. 6. This cluster illustrates
several modular events, including domain deletion, a
substitution by a diverged sequence fragment, and ero-
sion (Fig. 6B). A multiple alignment of the protein
sequences can be found in the Supplementary material.
For some of the proteins the structure is known [26].
We analysed the phylogeny of the cluster, as derived
from whole protein sequences (Fig. 6C). The obtained
phylogenetic tree is consistent with the modifications of
the domain arrangements (Fig. 6D), and the revealed
events can be associated with the tree nodes. Significant
rearrangements take place on the sixth position of the
cluster where, in different proteins, we found two differ-
ent ProDom domains, shadow domains and, at one
position (in the protein O59078), a complete deletion.
Further rearrangements are found at the protein C ter-
minus: two proteins have additionally two other
domains. The shadow domains may either be the result
of a substitution by another sequence, or by such a high
accumulation of mutations in a domain that it is no
longer similar to the original sequence.
There are three variable regions in the domain
arrangement of the protein cluster. First, at position 6
in the arrangement, in some proteins there are similar
sequences that were not annotated in ProDom (‘ero-
sion’) or domains which were annotated differently
because of high sequence divergence (‘camouflage’).
Next, at position 8, there is a substitution in two of
the sequences. Finally, the C-terminal part is missing,
truncated or eroded in many sequences, for example in
the illustrated structure (Fig. 6A,B).
Conclusions
Our main conclusions are as follows (a) domain dele-
tion events occur frequently at either of the termini,
(b) the deletions occur domain-wise; that is, in most of
the cases the whole domain is lost, (c) domain losses
correlate with domain versatility (i.e. the number of
different combinations in which a domain occurs), (d)
versatile domains are more frequently found at the
C terminus and (e) clear definitions can be given to
distinguish misannotations from physical deletions.
Eventually the question ‘What is the probability of a
domain deletion?’ can only be answered using domain
phylogenies. However, our study shows that the dele-
tion events are quite frequent; in the collected protein
clusters, the frequencies of proteins in a cluster with a
domain deleted at either of the termini were % 9%
Table 3. Results of the analysis of protein clusters for the ProDom
database. Numbers in the table correspond to the absolute num-
bers of events recorded (% of the events recorded is given in par-
enthesis).
Event
Average
(%) N-terminus middle C-terminus
Total number of
domains
152105 14520 123065 14520
Real events:
Deletions 13925 (9.2) 2998 (20.6) 8077 (6.6) 2850 (19.6)
Substitutions 3034 (2.0) 546 (3.8) 2000 (1.6) 488 (3.4)
Shadow domains 8770 (5.8) 1811 (12.5) 5399 (4.4) 1560 (10.7)
Annotation artefacts:
Camouflage 1557 (1.0) 110 (0.8) 1391 (1.1) 56 (0.4)
Erosion 1235 (0.8) 82 (0.6) 1001 (0.8) 152 (1.0)
J. Weiner 3rd et al. Mechanisms shaping modular protein evolution
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2043
(Table 3), which provides a rough estimate for the
frequency of deletion events in protein–protein com-
parisons.
The fact that the domain deletions are not uniformly
distributed along a protein, but that they nonetheless
follow a distinct pattern of domain deletions, is an
important conclusion in the context of constructing
algorithms for sequence alignments that take into
account domain arrangements of proteins. It also pro-
vides a biological justification for choosing a lower-end
gap penalty in sequence alignment algorithms, such as
clustalw [27].
In conclusion, by analysing the versatility of deleted
domains and their ability to form single-domain pro-
teins, we have found that, while gene fusion and fission
indeed play a significant role in the deletion events at
the termini, the introduction of new start and stop co-
dons also play a major role. The fraction of the dele-
ted domains that can be found as single-domain
proteins was twice as high at the C terminus (Table 2),
as was the connectivity of the C-terminally deleted
domains. This suggests that in a gene fusion or fission
event, the versatile, single-domain protein is more
likely to be found at the C terminus. This may be
explained by the fact that in a gene fusion ⁄ fission
event, or in the case of introduction of new start and
stop codons, the N-terminal part of the coding
sequence remains connected to its promoter region and
regulatory sites. Thus, a versatile domain that is fused
with the C terminus of a much larger protein will not
have an effect on the regulation of the whole protein,
because it will not modify the promoter region and
regulatory sites. Our results suggest such a selective
disequilibrium: the function (and regulation) of the
protein is connected to its N-terminal part, and there-
fore the fusion ⁄ fission events involving smaller, versa-
tile domains will occur more frequently at the
C terminus.
Moreover, we have found that the event of domain
deletion occurs mostly in a modular manner. This can
have two explanations. First, the apparent domain
deletion can be caused by gene fusion or fission. Sec-
ond, a domain fragment truncated (e.g. by a nonsense
mutation) that is no longer functional may be rapidly
eliminated by natural selection. Either way, the
domain deletions effectually respect domain boundar-
ies. These results have further supported the emerging
view that, by and large, the modular evolution of pro-
teins is dominated by two major types of events:
fusion, on the one hand, and deletion and fission on
the other [3,4,21,28]. Exon shuffling and recombination
seem to be rare.
A
B
CD
Fig. 6. Cluster of the bacterial formate dehy-
drogenases. (A,B) The structure of formate
dehydrogenase H (FDHF) from Escherichia
coli. (C) Phylogeny of the analysed proteins
obtained by the parsimony method with 100
bootstraps. (D) The corresponding domain
arrangements of the analysed proteins.
Colour code: (A) is coloured according to the
ProDom annotation, with one colour for
every domain. Colours and arrows on (B)
indicate events identified by analysis of a
cluster of related proteins and correspond
to the coloured arrows on (C) and (D). The
symbols on (C) show a possible attribution
of the events to tree nodes. sub, substitut-
ion; del, deletion ⁄ insertion; colours of the
symbols correspond to the colours (B). The
coloured boxes on (D) correspond to differ-
ent ProDom domains and are the same as
on (A). The black thin boxes on position 6
correspond to ‘shadow domains’.
Mechanisms shaping modular protein evolution J. Weiner 3rd et al.
2044 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
Materials and methods
For the analyses, ProDom [22] version 2004.1 was used. The
main results were confirmed using the Pfam, release17 [29].
Each database contains a number of domain arrangements,
that is, proteins annotated in terms of domains. All supple-
mentary materials can be found on our web page (http://
www.uni-muenster.de/Bioinformatics/services/domdel/).
Overall single deletion statistics
Proteins from the ProDom database and, separately, from
the Pfam database, were divided into sets according to the
number of domains. Each set contained all proteins with a
fixed number of domains, for example ‘set6’ contained pro-
teins with six domains.
Each protein from a given set containing proteins of
length N domains was compared with each protein from
the set containing proteins of length N)1 domains. For
example, a protein with six domains was compared with all
proteins that have five domains. If the shorter arrangement
was identical to the longer one, with the exception of a sin-
gle, missing domain, a deletion was registered. The position
of the deletion within the domain arrangement was recor-
ded. For example, given the five-domain arrangement
ABDEF (where A to E are domains), it is identical to the
six-domain arrangement, ABCDEF, with the exception of
the deleted domain C.
The average deletion frequency was calculated as the
number of all deletion events divided by the total number
of domains in all the examined sequences. The relative
domain deletion frequency at a given domain position in a
set of proteins of a given length was defined as the number
of deletions at this position, divided by the total number of
deletions in this set.
These investigations have been repeated with a nonredun-
dant data set, in which each arrangement was represented
only once. That is, from a set of proteins which had the same
domain arrangement, only one representative was kept.
Overall multiple deletion statistics
For each domain arrangement given, all other arrange-
ments that would be obtained by removal from the given
arrangement of one or more domains were considered. For
example, if A to F are domains, and ABCDEF is the given
arrangement, then we would consider the arrangements
ABCDE, BCD, ABEF, etc.
Similarity of protein arrangements
For the purpose of constructing multiple domain arrange-
ment alignments and domain arrangement-based phylo-
genies, we implemented the Needleman-Wunsch global
alignment algorithm [25] for protein domains, with the
parameters as defined previously [17]: match ¼ 10, mis-
match ¼ )5, gap ¼ )1.
Construction of protein clusters
We constructed clusters of proteins with similarity in their
domain arrangement of > 80%. Only clusters that had at
least six domains were considered. For each protein from
the ProDom database, all proteins were considered that had
one domain less than the given protein. If a given protein
matched the examined arrangement by all but one domain,
a deletion event was recorded. Starting with a single protein,
a number of hits was recorded and added to the cluster; fur-
thermore, these proteins were used to obtain the next set of
hits (i.e. proteins that have one domain less than the protein
that was used in the search). The procedure stopped for a
given cluster when no further similar domain arrangements
were found. Only clusters containing at least 10 proteins
and 10 ProDom domains were used for further analysis.
Additionally, the amino acid sequences of all the sequences
in the cluster were collected. The resulting clusters were sub-
sequently aligned with a simple multiple-domain arrange-
ment alignment algorithm (progressive alignment). The
length (in terms of domains) of a cluster was defined as the
length of the multiple-domain arrangement alignment.
Calculation of the relative event frequency at
different domain positions in protein clusters
For each of the events, e, and for each of the sets of clus-
ters of a given length, l, the frequency of the event at a
position, k, was defined as:
f
e;k
¼ n
e;k
=
X
l
i¼1
n
e;i
;
where n
e,i
is the number of occurrences of the event e at the
domain position i. The average frequency at the middle
positions (that is, all domain positions except the N- and
C termini) was calculated as:
n
e;middle
¼
X
lÀ1
i¼2
f
e;i
=ðl À 2Þ:
Finally, the N-terminal, C-terminal and central position
frequencies for each event were averaged for all sets of
clusters.
Distribution of amino acid sequence length
of the termini
For each of the databases ProDom and Pfam, two sets of
alignments were created: one for N-terminal deletions, and
one for C-terminal deletions. In each set, an alignment con-
tained sequences that had one of the two types of arrange-
ments: either a complete arrangement, or one in which a
terminal domain was missing from the ProDom description.
Alignments were constructed from the whole ProDom
J. Weiner 3rd et al. Mechanisms shaping modular protein evolution
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2045
database. Only alignments which contained at least one
complete sequence and one sequence with a missing domain
(depending on the set, either N- or C terminal) were consid-
ered.
For each alignment in each set, the average size of the
deleted domain was calculated for the proteins with the
complete arrangement. To take into account the variability
of the length of the complete domain, the length of the
N-terminal fragment was definned as the length of the
amino acid sequence preceding the next domain in
the arrangements, expressed as the percentage of the calcu-
lated average length of the deleted domain in this align-
ment. Finally, the distribution of these values throughout
all of the analysed alignments was calculated.
References
1 Patthy L (1999) Protein Evolution. Blackwell Science,
Oxford.
2 Liu J & Rost B (2004) CHOP: parsing proteins into
structural domains. Nucleic Acids Res 32, W569–W571.
3 Bornberg-Bauer E, Beaussart F, Kummerfeld S, Teich-
mann S & Weiner J 3rd (2005) The evolution of domain
arrangements in proteins and interaction networks. Cell
Mol Life Sci 62, 435–445.
4 Voge IC, Teichmann S & Pereira-Lea IJ (2005) The
relationship between domain duplication and recombi-
nation. J Mol Biol 346, 355–365.
5 Voge IC, Berzuini C, Bashton M, Gough J & Teich-
mann S (2004) Supra-domains: evolutionary units larger
than single protein domains. J Mol Biol 336, 809–823.
6 Gough J (2005) Convergent evolution of domain archi-
tectures (is rare). Bioinformatics 21, 1464–1471.
7 Apic G, Gough J & Teichmann S (2001) An insight into
domain combinations. Bioinformatics 17 (Suppl. 1),
S83–S89.
8 Wuchty S (2001) Scale-free behavior in protein domain
networks. Mol Biol Evol 18, 1694–1702.
9 Bornberg-Bauer E (2002) Randomness, structural
uniqueness, modularity, and neutral evolution in
sequence space of model proteins. Z Phys Chem 216,
139–154.
10 Madera M, Voge IC, Kummerfeld S, Chothia C &
Gough J (2004) The SUPERFAMILY database in
2004: additions and improvements. Nucleic Acids Res
32, D235–D239.
11 Doolittle R & Bork P (1993) Evolutionarily mobile
modules in proteins. Sci Am 269, 50–56.
12 Apic G, Huber W & Teichmann S (2003) Multi-domain
protein families and domain pairs: comparison with
known structures and a random model of domain
recombination. J Struct Funct Genomics 4, 67–78.
13 Ponting C & Russel IR (1995) Swaposins: circular per-
mutations within genes encoding saposin homologues.
Trends Biochem Sci 20, 179–180.
14 Ulie IS, Fliess A & Unger R (2001) Naturally occur-
ring circular permutations in proteins. Prot Eng 14,
533–542.
15 Fliess A, Motro B & Unger R (2002) Swaps in protein
sequences. Proteins 48, 377–387.
16 Bujnicki J (2002) Sequence permutations in the molecu-
lar evolution of DNA methyltransferases. BMC Evol
Biol 2,3.
17 Weiner J 3rd, Thomas G & Bornberg-Bauer E (2005)
Rapid motif-based prediction of circular permutations
in multi-domain proteins. Bioinformatics 21, 932–937.
18 Apic G, Gough J & Teichmann S (2001) Domain com-
binations in archaeal, eubacterial and eukaryotic pro-
teomes. J Mol Biol 310, 311–325.
19 Bashton M & Chothia C (2002) The geometry of
domain combination in proteins. J Mol Biol 315, 927–
939.
20 Vogel C, Bashton M, Kerrison N, Chothia C & Teich-
mann S (2004) Structure, function and evolution of
multidomain proteins. Curr Opin Struct Biol 14, 208–
216.
21 Kummerfeld S & Teichmann S (2005) Relative rates of
gene fusion and fission in multi-domain proteins. Trends
Genet 21, 25–30.
22 Corpet F, Servant F, Gouzy J & Kahn D (2000) Pro-
Dom and ProDom-CG: tools for protein domain analy-
sis and whole genome comparisons. Nucleic Acids Res
28, 267–269.
23 Zhang Y, Chandonia J, Ding C & Holbrook S (2005)
Comparative mapping of sequence-based and structure-
based protein domains. BMC Bioinformatics 6, 77.
24 Feng D & Doolittle R (1987) Progressive sequence
alignment as a prerequisite to correct phylogenetic trees.
J Mol Evol 25, 351–360.
25 Needleman S & Wunsch C (1970) A general method
applicable to the search for similarities in the amino
acid sequence of two proteins. J Mol Biol 48, 443–
453.
26 Boyington J, Gladyshev V, Khangulov S, Stadtman T
& Sun P (1997) Crystal structure of formate dehydro-
genase H: catalysis involving Mo, molybdopterin, sele-
nocysteine, and an Fe4S4 cluster. Science 275, 1305–
1308.
27 Thompson J, Higgins D & Gibson T (1994) clustalw:
improving the sensitivity of progressive multiple
sequence alignment through sequence weighting, posi-
tion-specific gap penalties and weight matrix choice.
Nucleic Acids Res 22, 4673–4680.
28 Weiner J 3rd & Bornberg-Bauer E (2006) Evolution of
circular permutations in multi-domain proteins. Mol
Biol Evol 23, 734–743.
29 Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L,
Eddy S, Griffiths-Jones S, Howe K, Marshal IM &
Sonnhammer E (2002) The Pfam protein families data-
base. Nucleic Acids Res 30, 276–280.
Mechanisms shaping modular protein evolution J. Weiner 3rd et al.
2046 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS
Supplementary material
The following supplementary material is available
online:
Fig. S1. Statistics for single domain deletions.
Fig. S2. Statistics for multiple domain deletions.
Fig. S3. Detailed results of the cluster analysis.
Fig. S4. Results for the comparison of eukaryotes and
prokaryotes.
Fig. S5. Pairwise multiple alignment algorithm for
domain arrangements.
This material is available as part of the online article
from http://www.blackwell-synergy.com
J. Weiner 3rd et al. Mechanisms shaping modular protein evolution
FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2047

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay

×