Target Selection

Modeling families (MF) are clusters of sequences sharing 30% or more sequence identity. BIG families are large, structurally uncharacterized families. MEGA families are very large families which have one or more structural representatives but which are very structurally under-represented and generally have a high proportion of functionally diverse relatives with no structural information.

Identification of large, structurally uncharacterized (BIG) domain families

The selection and proposal of target domain families by the MCSG for inclusion into the PSI-2 target lists has initially focused on the use of Pfam-A defined families. We believe that the significant levels of manual curation used in identifying these families means that they offer one of the most reliable sources of domain family information for the genomes. Additionally, the use of hidden Markov models provided by Pfam (each of which is optimized through manual inspection) enables us to gain a high level of family coverage across the genomes.

Target domain families selected and proposed by the MCSG for inclusion into the PSI-2 target lists has continued to focus largely on the use of Pfam-A domain families. In addition, the MCSG has also proposed additional large, species-diverse BIG families from the TIGRFAM domain family resource and the NewFam domain family clustering derived from our own Gene3D database. As such, the MCSG currently has a target list covering 575 BIG families. All proposed BIG families were selected on the absence of any near or distant structural match, for transmembrane helix regions, significant levels of low complexity or disorder and containing at least 10 members (with a minimum of five prokaryotic members). Our work has also included the use of these same criteria to check BIG families proposed by other consortia in the BIG-4 and to identify and resolve cases where proposed families overlap the same regions of sequence space.

Identification of reagent genomes with maximal coverage of BIG families

The SESAMI target selection website has been continually updated to include target data for all the genomes used by MCSG in phase one of the PSI. The searchable target lists held on the website are dynamic, allowing the user to interactively rank and prioritize a given set of targets according to a series of criteria required for a given experiment. The SESAMI resource is closely linked to the Gene3D database, a new version of which has been recently released with a number of significant improvements, and currently holds annotations for over 300 completed genomes. Each protein sequence in SESAMI contains predictions of CATH and Pfam domains and NewFam families. In addition to these domain annotations we also have a comprehensive set of predictions detailing regions of low complexity, disorder, signal peptides, coiled-coils and transmembrane helices. This has enabled comprehensive family information and corresponding phylogenetic distribution for each MCSG target to be evaluated and allows us to compile a list of available orthologues in other MCSG target organisms. This detailed series of information informs our target selection through the SESAMI website.

Identification of BIG family targets

Our protocol identifies target sequences belonging to a given BIG family using a list of reagent genomes that are available for cloning and expression. Target proteins are subsequently assigned to an orthologous grouping (corresponding to target sequences found within 30% sequence identity bi-directional hit). Lower priority is given to paralogous or near-identical sequences identified within a common genome. We have developed our approach to provide a balanced target list in which we hope to provide the highest coverage of sequence space within each BIG family, providing the maximal grouping of orthologous sequences, whilst reducing target redundancy across our families.

Development of strategies for targeting MEGA families

Recently, the focus of our target selection strategies has moved toward the identification of targets within large structurally characterized protein superfamilies (MEGA families). This process will be carried out in parallel to the selection of targets from BIG families. Many of the largest domain superfamilies are highly expanded across the genomes encompassing a large diversity of sequence space, much of which is not within the reach of current homology modeling methods. Whilst these MEGA families may already hold a number of structurally characterized proteins, it is though that the characterization of additional structures in carefully chosen MEGA families will provide new insights into diverse regions of structure and function space. In order to provide a guide for the selection of MEGA families, the MCSG has set up a website (link) available to all PSI consortia in which 891 putative MEGA target superfamilies have been made available, together with associated raw data (sequences, hidden Markov models) to enable further analysis by the BIG-4 consortia. In addition, the web site has been constructed to present a range of data describing each MEGA family, including; MEGA family size, number of modeling families, current PDB coverage, functional diversity, and species distribution. The webpage and associated data has been used extensively by the BIG-4 to provide the basis of our MEGA family target selection. In addition, the MCSG has analyzed each MEGA family in terms of coverage of our MCSG regent genomes.

Homology modeling based on MCSG structures – the GeMMA website

GeMMA (link) is being updated with homology models of relatives of all structures solved by the MCSG. Relatives are derived from the complete UniProt depository. Full chain models are being constructed first, but the MCSG structures are being fast tracked through the new CATH domain chopping pipeline and domain-wise models will follow as soon as possible. Homology models are now being constructed using the latest release of Modeller, (Modeller 9v1) which includes the new DOPE-HR model assessment method. Together with Prosa II and GA341 the quality of the models are now being assessed using three different complementary methods.

Iterative target selection based on GeMMA models

  1. Comparison of secondary structure prediction. A scoring method has been developed and tested to compare the predicted secondary structure content (calculated using PSIPRED) of relatively distantly related protein domains.
  2. Comparison of functionally conserved residues. A scoring method is being developed to quantify the similarity in evolutionarily conserved residues in the active sites of relatively distantly related protein domains. A program has been written to use structural information to identify surface clusters of conserved residues that are likely to correspond to the active sites of functionally uncharacterized proteins.
  3. Comparison of electrostatic potential around protein domain structures and homology models. A method has been developed to compare electrostatic potential calculated by APBS on a grid of points. Protein domains are superposed and an intersection of skins is calculated that either surrounds their whole surface or is localized around the active site. A Hodgkin similarity index is calculated across all grid points within the intersection of the skins.
  4. Profile-profile comparison. PRC and MAFFT have been investigated. While useful for identifying distantly related protein folds these methods do not seem well suited to distinguish functional differences within a superfamily.

Target selection from human pathogens

Differences in organisms’ structural complexity, physiology, and lifestyle result in divergent evolution and the emergence of variants of molecular function, metabolic organization, and phenotypic features. Such differences may be utilized for the development of novel approaches to anti-microbial therapies and deepen our understanding of pathogenic processes. The Bioinformatics Group at MCS ANL has been developing the systems-level strategies for identification of protein targets of biomedical importance and potential antimicrobial drug targets for the needs of the MCSG. These strategies included:

  1. Identification and characterization of essential genes from the metabolic reconstructions from sequence data,
  2. Identification of taxonomy-specific metabolic pathways characteristic of pathogenic organisms,
  3. Identification and characterization of taxonomic and phenotypic variation of enzymes characteristic of pathogenic organisms,
  4. Identification and characterization of microbial signal transduction proteins as potential drug targets,
  5. Identification of pathogenic factors

Selected related publications:

  • Bray JE, Marsden RL, Rison SC, Savchenko A, Edwards AM, Thornton JM, Orengo CA (2004). A practical and robust sequence search strategy for structural genomics target selection. Bioinformatics, 20, 2288-95 Times cited: 4. [PubMed]
  • Lee D, Grant A, Marsden RL, Orengo C (2005). Identification and distribution of protein families in 120 completed genomes using Gene3D. Proteins, 59, 603-15 Times cited: 6. [PubMed]
  • Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA (2006). Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res, 34, 1066-80 Times cited: 5. [PubMed]
  • Watson JD, Todd AE, Bray J, Laskowski RA, Edwards A, Joachimiak A, Orengo CA, Thornton JM (2003). Target selection and determination of function in structural genomics. IUBMB Life, 55, 249-55 Times cited: 10. [PubMed]

For a more exhaustive list of publications see the MCSG publications website.