Despite the widespread use of recombination in laboratory evolution (for examples, see the various chapters in Arnold, Ed., 2000), our understanding of recombination as a fitness search strategy is still in its infancy. Methods of laboratory recombination that are annealing-based (Stemmer, 1994; Zhao et al., 1998) or that are sequence-independent and random, (Ostermeier et al., 1999; Sieber et al., 2001) generate libraries of chimeras with unknown sequences and often limited diversity. This restricts our ability to evaluate how efficiently recombination strategy traverses protein sequence space and contributes to functional evolution. Functional analyses of the resulting libraries have not provided data that can be used to systematically determine which chimeras are most likely to retain structure and function, or exhibit completely new functional properties.
We have developed computational tools (the SCHEMA energy function, RASPP) that use structural information to predict which fragments of proteins can be swapped without disrupting the integrity of the three-dimensional structure (Voigt et al., 2002; Silberg et al., 2003, Endelman et al., 2004). Using well-defined libraries of chimeras created by site-directed recombination (Hiraga and Arnold, 2003; Meyer et. al., 2006), we have evaluated the quality of SCHEMA predictions and shown that SCHEMA disruption is valuable for anticipating which chimeras are most likely to retain function (Meyer et al. 2003, Meyer et al., 2006). We are constructing and screening libraries that were designed using these tools to be enriched in folded proteins, in order to explore how recombination contributes to functional evolution.
Based
on the three-dimensional structure of the proteins, pairs of
interacting residues are identified (those within 4.5 Å of
each other), and a contact matrix that represents these
interactions is constructed (see Figure 1). The contact matrix is
then adjusted for the identity between the two proteins; contacts
where one or both amino acids of the interacting pair are identical in
the parental proteins cannot be broken by recombination.
Generating a chimera by recombination breaks some number of
the remaining contacts and this number is designated E,
the SCHEMA energy or disruption.
The SCHEMA disruption is given by (Voigt
et. al., 2002):

To design a site-directed recombination library one needs to choose parental proteins and crossover positions that enrich the libraries in correctly folded proteins. A certain level of diversity (amino acid sequence changes) is also needed for evolution. We developed an algorithm to select crossovers that minimize the average energy of the library, subject to constraints on the length of each fragment. It is the same algorithm used to find the shortest path between nodes in a network (Figure 2). The algorithm takes any pairwise energy function (like SCHEMA) as input and calculates optimal libraries in regard to folding and diversity (RASPP curve; see Figure 4a and Endelman et al., 2004).

One of our early model systems for testing SCHEMA was the class A
β-lactamases. PSE4 and TEM1 share 42% sequence
identity and have only 1.3 Å RMSD over the backbone atoms.
Additionally, a rapid screen using
antibiotic selection can be applied to obtain functional and therefore
folded sequences.
To test the efficacy of the SCHEMA energy, we created a combinatorial
library by allowing thirteen
recombination points between TEM1 and PSE4, generating chimeras with
a wide range of disruption values (Meyer
et al. 2003). The library was selected on
ampicillin to identify chimeras that retain function. We showed that
chimeras
containing the same level of mutation (Hamming distance to
closest parent), but with lower disruption, have a higher probability
of retaining function.
We have since expanded upon these original experiments by using the RASPP algorithm to choose optimal recombination sites and adding a third class A lactamase, SED1, to make a new SCHEMA-designed library (Meyer et al., 2006). SED1 shares 35-40% identity with PSE4 and TEM1. We decided to use seven recombination points between the three genes (for experimental ease) and generated a RASPP curve starting with a minimum fragment size of 5 residues. RASPP generates a set of libraries at differing levels of average mutation (<m>) and average disruption (<E>) (Figure 3A). Upon examination of the recombination sites for each library, it is apparent that some occur frequently, while others are clearly forced into position by the increasing minimum fragment size (Figure 3B).


We divided the libraries into three groups based on their <m> and <E> (Figure 3b). The first (gray) group has very low <m>, and the recombination sites are clustered at the N and C termini.The second (pink) group has much higher <m> and about the same <E>; many of the recombination sites appear in multiple libraries in this group, and the members of this group are attractive for construction. The libraries in the last (turquoise) group have higher levels of <E>, but some gains in <m>. The recombination points are well dispersed over the length of the gene, but the high <E> indicates that many chimeras are probably unfolded.

We chose to construct the pink library with the highest average mutation <m>=67.5 and intermediate average SCHEMA disruption <E>=46.1. We modified two of the recombination points by a few amino acids to accommodate the construction technique (Meyer et al., 2006). Characterization of ~1,000 clones by DNA hybridization yielded ~550 unique sequences, 111 of which retained lactamase activity on ampicillin. The active sequences display significantly lower disruption than the inactive chimeras (Figure 4), further supporting our hypothesis that disruption is a good metric for predicting whether a chimera will retain function (Meyer et al., 2006).
Natural evolution has created a multitude of P450s with activities on an amazing variety of substrates. These homologs retain a common overall fold despite their low sequence identity. We would like to take advantage of the evolvability of this enzyme scaffold to study SCHEMA-directed recombination and functional evolution. By creating a small set of chimeric P450s we calibrated the SCHEMA energy with the P450 scaffold and showed that P450 chimeras can have altered specificities (Otey et al., 2004). Using site-directed recombination guided by SCHEMA we are now creating artificial protein families of cytochromes P450 to explore determinants of structure and function free from many of the constraints of natural selection. We have recombined three bacterial cytochromes P450, creating thousands of new and diverse P450s which differ from any known parent by up to 101 amino acid mutations (Otey et al., 2006; Figure 5).

We believe that libraries such as this one, comprised of sequences largely uncoupled from natural selection, including sequences that fold and function as well as those that don’t, will prove invaluable in the exploration of the sequence-structure-function landscape. Using high-throughput sequencing and folding assays, we generated a data set of 955 unique P450 sequences. 620 of them correctly incorporate the heme cofactor. We used logistic regression (LRA), an analog of linear regression suitable for the binary data generated by our experiments, to analyze the multiple sequence alignments. Underlying our LRA model, is the idea that individual fragments and interactions between fragment pairs contribute to whether a chimera will fold and thus bind heme. LRA fits an energy model containing one and two body terms where the magnitude of each term reflects how strongly that variable affects the likelihood of folding and thus the structure of the molecule. We repeated this analysis to identify fragments or fragment pairs that affect the catalytic activity of the chimeras (Otey et al, 2006; Figure 6).

Although recombination can make many mutations with relatively little structural disruption, we do not know the degree of functional diversity that is accessible to a process which only explores combinations of mutations already accepted during natural evolution. Enzymes of the CYP102 family are comprised of a reductase domain and a heme domain connected by a flexible linker. With a single amino acid substitution the heme domains can function alone as peroxygenases, catalyzing oxygen insertion in the presence of hydrogen peroxide (Cirino et al., 2002). The synthetic CYP102A family was constructed from parental sequences containing this mutation; all of the chimeric proteins can therefore potentially function as peroxygenases. We are also interested in their ability to be reconstituted into functional monooxygenases, utilizing NADPH and molecular oxygen for catalysis, by fusion to a reductase domain. Because the chimeric heme domains comprise sequences from three different parents, it is not obvious that fusion to wildtype reductase will generate a catalytically active holoenzyme, nor is it clear which reductase, R1, R2 or R3, should be used. We therefore selected a set of 14 chimeric heme domains, reconstituted them with all three parental reductase domains, and determined peroxygenase and monooxygenase activities on the eleven substrates shown in Figure 7 (Landwehr et al., in press).

All of the chimeras were successfully reconstituted into functional monooxygenases. In fact monooxygenase activity is consistently higher than peroxygenase activity. The key interactions at the interface of the heme and reductase domains are conserved, providing a possible explanation for the high monooxygenase activities observed in our experiments. This result implies that the heme and reductase domains can be independently optimized for stability, function and more, as long as these key interactions are conserved.
Nearly all the chimera fusions outperformed even the best parent holoenzyme, and chimeric peroxygenases consistently outperformed the parent peroxygenases. The best enzyme for each substrate is in fact always a chimera. The chimeric enzymes also exhibit distinct specificities and can be partitioned into clusters based on their specificities. The clusters were generated by a K-means analysis. One cluster contains parent A1-R1 and all chimeras with A1-like profiles. Another cluster contains low activity chimeras and includes all remaining parental sequences. The remaining clusters represent highly active chimeras that have acquired new specificities.
We were able to partition the substrates into groups based on the linear correlations of substrate pairs (Figure 7). Our results show that we can predict with reasonable accuracy the relative activities of a chimera on all the substrates in a group by testing activity on only one. Given the many important applications of P450s in medicine and biocatalysis, and the lack of high-throughput screens for many compounds of interest, an approach to screening that is based on carefully chosen ‘surrogate’ substrates could significantly enhance our ability to identify useful catalysts.
Arnold, F. H. (Ed.), Advances in Protein Chemistry, Vol. 55, Academic Press, San Diego (2000).