Tải bản đầy đủ

computational chemogenomics


This page intentionally left blank

edited by

Edgar Jacoby

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2013 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works
Version Date: 20131121
International Standard Book Number-13: 978-981-4411-40-0 (eBook - PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher
cannot assume responsibility for the validity of all materials or the consequences of their use. The
authors and publishers have attempted to trace the copyright holders of all material reproduced in
this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.
copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been
granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
and the CRC Press Web site at


1. Chemogenomics Approaches for the Quantitative
Comparison of Biological Targets
Herbert Koeppen and Michael Bieler
Target-Based Similarity Methods
Ligand-Based Target Comparison
1.3.1 Chemocentric Approaches
1.3.2 Chemoprint Approaches

The Impact of Data Quality on Chemogenomics
1.4.1 Experimental Variations and Errors
1.4.2 Thresholds and Cutoff Values
1.4.3 The Dataset Composition
1.4.4 The Lack of Data Completeness
1.4.5 The Applicability Domain
2. Considerations on the Drug-Like Chemical Space
Jean-Louis Reymond, Lars Ruddigkeit, and
Mahendra Awale
The Chemical Space
2.2.1 Compound Databases
2.2.2 The Chemical Universe Database
2.2.3 Virtual Combinatorial Libraries
2.2.4 Evolutionary Algorithms
Drug Likeness
2.4.1 Small Is Beautiful!
2.4.2 Analysis of Compound Databases
Where to Find New Drugs?
Conclusion and Outlook






3. Chemogenomic Protein-Family Methods in Drug
Discovery: Profile-QSAR and Kinase-Kernel
Prasenjit Mukherjee and Eric Martin
Chemogenomics and Kinase Family Methods
Profile-QSAR and Other Affinity Fingerprint
3.3.1 Profile-QSAR Theory
3.3.2 Profile-QSAR Biochemical Models
3.3.3 Profile-QSAR Cellular Models
3.3.4 Profile-QSAR Selectivity Modeling
3.4.1 SAR Similarity vs. Sequence Similarity
3.4.2 Optimization of the Kinase-Kernel
3.4.3 Primary Interpolation Weights
Optimized by Mixture Design
3.4.4 Genetic Algorithm Optimization of
Binding Site Residue Selection and
Neighbor Count
3.4.5 Most Relevant Residues
3.4.6 Performance Comparison
3.4.7 External Validation
Protein-Family Virtual Screening
Summary and Conclusions
4. Virtual Screening and Target Fishing for Natural
Products Using 3D Pharmacophores
Gerhard Wolber and Judith M. Rollinger
4.1.1 Multitargeted Nature of
Natural-Based Compounds
4.1.2 Daunting Challenges in Handling
of Natural Products
4.1.3 Advent of Data Mining Tools and
in silico Methods
Natural-Product Characteristics Compared
to Synthetic Compounds
3D Pharmacophore Models for Virtual
Screening and Activity Profiling







Resources for in silico Discovery of
Natural Products (Databases)
Success Stories of Pharmacophore-Based
Virtual Screening for Natural-Product
Activity Profiling for Identifying Novel
Biological Activities of Natural Products

5. Computational Analysis of Ligand-Binding Pockets
Felix Reisen and Gisbert Schneider
Software for Binding Site Comparison
Mathematical Models
5.3.1 Alignment-Based Methods
5.3.2 Vector-Based Representations
Protein Structure Representation
5.4.1 Group I Methods
5.4.2 Group II Methods
5.4.3 Group III Methods
Comparison of Software Performance
6. Binding Site Similarity Search to Identify Novel
Target–Ligand Complexes
Didier Rognan
Why Focusing on Binding Sites
Sequence-Based Approaches
6.2.1 From Target Sequences to Ligands
6.2.2 From Target Sequences to
Pharmacophores to Ligands
6.2.3 Simultaneous Usage of Target
Sequence and Ligand Descriptors
Structure-Based Approaches
6.3.1 Pocket Detection and Comparisons
6.3.2 Drug Repurposing by Binding Site 3D
Similarity Search







7. ChemProt: A Disease Chemical Biology Database
Olivier Taboureau and Tudor I. Oprea
Bioactivity and Biological Repositories
7.2.1 Bioactivity Databases
7.2.2 Biological Databases
Prediction Methods
The ChemProt Server: Instructions and Output
Perspectives on ChemProt
8. Scientific Requirements for the Next-Generation
Semantic Web-Based Chemogenomics and Systems
Chemical Biology Molecular Information System OPS
Edgar Jacoby, Kamal Azzaoui, Stefan Senger,
Emiliano Cuadrado Rodríguez, Mabel Loza,
Antony J. Williams, Victor de la Torre, Jordi Mestres,
Olivier Taboureau, Matthias Rarey, Christine Chichester,
Niklas Blomberg, Lee Harland, Barbara Zdrazil,
Marta Pinto, and Gerhard Ecker
8.3.1 Provide All Oxidoreductase Inhibitors
Active <100 nM in Both Humans
and Mice?
8.3.2 Given Compound X, What Is Its
Predicted Secondary Pharmacology?
What Are the On- and Off-Target
Safety Concerns for a Compound?
What Is the Evidence and How Reliable
Is That Evidence (Journal Impact Factor,
KOL) for Findings Associated with a
8.3.3 Given a Target, Find Me All Actives
against That Target? Find/Predict the
Polypharmacology of the Actives?
Determine the ADMET Profile of
8.3.4 For a Given Interaction Profile, Give Me
Compounds Similar to It?
















The Current Factor Xa Lead Series Is
Characterized by Substructure X.
Retrieve All Bioactivity Data in Serine
Protease Assays for Molecules That
Contain Substructure X?
Retrieve All Experimental and Clinical
Data for a Given List of Compounds
Defined by Their Chemical Structure
(with Options to Match
Stereochemistry or Not)?
A Project Is Considering PKCα as a
Target. What Are All the Compounds
Known to Modulate the Target Directly?
What Are the Compounds That May
Modulate Other Members of the PKC
Subfamily of Kinases? That Is, Return All
Compounds Active in Assays Where
the Resolution Is at least at the Level
of the Target Family, Both from
Structured Assay Databases and the
Give Me All Active Compounds on
a Given Target with the Relevant
Assay Data?
Give Me the Compound(s) Which
Hit Most Specifically the Multiple
Targets in a Given Pathway (Disease)?
Identify All Known Protein–Protein
Interaction Inhibitors?
For a Given Compound, Which Targets
Have Been Patented in the Context of
Alzheimer’s Disease?
For My Specific Target, Which Active
Compounds Have Been Reported in the
Literature? What Is Also Known about
Upstream and Downstream Targets?
What Ligands Have Been Described
for a Particular Target Associated with
Transthyretin-Related Amyloidosis,
What Is Their Affinity for That Target,












and How Far Are They Advanced into
Preclinical/Clinical Phases, with Links
to Publications/Patents Describing
These Interactions?
8.3.14 What Compounds Agonize Targets in
Pathway X Assayed in Only Functional
Assays with a Potency <1 μM?
8.3.15 Target Druggability: Compounds
Directed against Target X Have Been
Tested in What Indications? What New
Targets Have Appeared Recently in the
Patent Literature for a Disease? Has the
Target Been Screened against in the
Company Before? What Information
on in vitro or in vivo Screens Has
Already Been Performed on a
8.3.16 What Chemical Series Have Been
Shown To Be Active against Target X?
What New Targets Have Been
Associated with Disease Y? What
Companies Are Working on Target X or
Disease Y?
8.3.17 For a Given Disease/Indication, Give
Me All Targets in the Pathway and All
Active Compounds Hitting Them?
8.3.18 For a Given Compound, Give Me the
Interaction Profile with Targets?
8.3.19 For a Given Compound, Summarize
All “Similar Compounds” and Their
8.3.20 What Compounds Are Known To Be
Activators of Targets Which Relate to
Parkinson’s Disease or Alzheimer’s










In the last decade chemogenomics emerged as a new interdisciplinary biomedical research ϐield that aims to identify systematically
all ligands and modulators for all gene products in order to allow
the accelerated study of their function and the discovery of new
Computational chemogenomics focuses on new in silico
applications, such as compound library design and virtual screening,
to expand the bioactive chemical space, target hopping of chemotypes
to identify synergies within related drug discovery projects, or to
repurpose known drugs, elucidation of the mechanism of action of
compounds, or identiϐication of off-target effects by cross-reactivity
and proϐiling analysis. Both ligand-based and structure-based in
silico approaches, as reviewed in this book, play important roles in
all these applications.
Written by leading scientists involved in chemogenomics
research, this book provides a comprehensive overview of the
current state-of-the-art computational chemogenomics approaches.
Examples pursued in the academia, as well as pharmaceutical setups,
are provided.
Chapter 1, contributed by Dr. Herbert Koeppen and Dr. Michael
Bieler from Boehringer Ingelheim, provides a general introduction
of chemogenomics approaches for the quantitative analysis of
biological targets.
Prof. Jean-Louis Reymond of the University of Berne provides in
chapter 2 a general introduction to chemical space, which is the ϐirst
dimension of the chemogenomics structure–activity relationship
(SAR) space.
In chapter 3, Dr. Prasenjit Mukherjee and Dr. Eric Martin from
Boehringer Ingelheim and Novartis, respectively, illustrate with the
Proϐile-QSAR and Kinase-Kernel approaches new virtual screening
technology applied to the kinase target family, which is the prime
target family of the last decade of drug discovery.
Prof. Gerhard Wolber from the Freie University of Berlin and
Prof. Judith Rollinger from the University of Innsbruck provide in




chapter 4 virtual-screening and target-ϐishing applications of natural
product compounds using the LigandScout three-dimensional (3D)
pharmacophore-searching platform.
The groups of Prof. Gisbert Schneider from ETH Zürich and
Prof. Didier Rognan from the University of Strasbourg provide,
respectively, in chapters 5 and 6 an overview of the computational
analysis of ligand-binding pockets and the application of binding site
similarity search to identify novel target–ligand complexes.
In chapters 7 and 8, the teams of Prof. Olivier Taboureau from the
Danish Technial University and Prof. Tudor Oprea at the University
of New Mexico and the IMI OpenPhacts consortium describe,
respectively, the ChemProt and OpenPharmacologicalSpace
molecular information systems, which are prototypes for future
chemogenomics databases.
All chapter authors are very much acknowledged for their
excellent scientiϐic contributions and their willingness to share their
insights and strategic viewpoints on chemogenomics, which make
this book especially interesting to read.
I also thank Archana Ziradkar, Sarabjeet Garcha, and Stanford
Chong from Pan Stanford Publishing for the invitation to edit this
review book and for their commitment to completely dealing with
the production work.
I’m delighted with this book and hope that you, the reader, will
ϐind it both informative and enjoyable.
Edgar Jacoby
Beerse, September 2013

Chapter 1

Chemogenomics Approaches for the
Quantitative Comparison of Biological

Herbert Koeppen and Michael Bieler
Boehringer Ingelheim Pharma GmbH & Co. KG, Lead Identiϔication and
Optimization Support, Biberach/Riss, Germany



“Chemogenomics” is a term coined about 10 years ago [1]. Traditional
approaches for the identiϐication of bioactive compounds use a
chemical library, a single target protein, and an assay, which allows
us to measure the activity of these compounds against the selected
target. In contrast chemogenomics aims at the identiϐication of the
bioactivity of all these compounds against multiple targets and even
beyond: in a very general sense the goal of chemogenomics is the
exploration of all possible ligand–target interactions, or in other
words the identiϐication of bioactive compounds from the chemical
space for all targets of the biological space.

Computational Chemogenomics
Edited by Edgar Jacoby
Copyright © 2014 Pan Stanford Publishing Pte. Ltd.
ISBN 978-981-4411-39-4 (Hardcover), 978-981-4411-40-0 (eBook)


Chemogenomics Approaches for the Quantitative Comparison of Biological Targets

The compound–target matrix plays a central role in
chemogenomics. Its columns are formed by the set of all possible
targets encoded in the genes of organisms (not necessarily only
human genes), and the rows represent all the compounds that
span the huge chemical space of fragments and lead- or drug-like
compounds. The matrix elements describe the biological interaction,
for example, a classiϐication as active/inactive or a quantitative
description by IC50/EC50 or raw % CTRL values. Each row of this
matrix displays the activity proϐile (the bioprint) of a compound, and
each column displays the compound-binding proϐile of a target (the
Regarding experimental data the compound–target matrix
is and will remain extremely sparse. Given the huge size of the
relevant chemical space and ten thousands of potential targets, it
is obviously impossible to ϐill the matrix with assay data. Hence, in
silico approaches are the alternative to complement the bioprints of
the compounds and the chemoprints of the targets.
Calculating the interaction strength of a wide diversity of
compounds and targets represents a challenging goal, and
computational chemogenomics is by far not yet mature enough to
always provide reliable predictions. Despite this, it is a very attractive
goal for pharmaceutical research. The prediction of the biological
proϐile of compounds would allow the identiϐication of potential
off-targets, which may cause unwanted side effects of a drug. This
information would help to prioritize the targets for the safety
proϐiling and could be used to optimize compounds toward reduced
side effects. Knowledge of the similarity between proteins can pave
the way to chemical starting points or tools for innovative targets.
There is also increasing evidence that most if not all drugs bind to a
variety of targets (called polypharmacology) with relevance for the
therapeutic action of the drugs and/or for the side effects [2]. The
knowledge of the target spectrum of a drug is crucial information
for the so-called drug repurposing where a known drug is applied to
a new disease. It can also help to get better insight into the diseaserelevant targets and pathways and to identify new and better
approaches to treat a disease, for example, by multitarget drugs [3,
4]. Moreover in phenotypic screening the target is mostly unknown
and the activity proϐile of an active compound may be the key to the
identiϐication of the relevant target(s) [5].


The basic assumption that guides all computational approaches
in chemogenomics is that similar compounds bind to similar targets
and therefore show a similar binding proϐile [6]. Conversely targets
that bind similar ligands have similar binding sites. The fundamental
question in chemogenomics is how to measure and compare the
similarity of compounds and targets, respectively. It is worth to
mention that compounds with a similar bioproϐile may nevertheless
have dissimilar structures [7, 8], which is the reason why biological
ϐingerprints have a potential for scaffold hopping. The successful use
of these descriptors in virtual screening goes back to the 1990s and
was one of the earliest applications of the concept of chemogenomics
[9, 10].
An increasing wealth of experimental data about target–ligand
interactions is available in the public domain, which is at least
partly compiled in annotated chemical libraries. Therefore most
of the chemogenomics studies rely on this data even if its quality,
comparability, and completeness may be difϐicult to assess.
The term “chemogenomics” has been deϐined as the discovery
and description of all possible drugs to all possible drug targets [1].
The interaction of chemical and biological matter may be tackeled
starting from either end. Target-based approaches [11] are grounded
on a similarity measure derived from the quantitative comparison of
sequence and/or three-dimensional (3D) structural information on
targets. The subsequent binding proϐile prediction is based on the
assumption that known ligands of a similar target may also bind to
the target of interest, which allows us to make predictions for orphans
without known ligands, too. Docking of compounds to experimental
or calculated protein structures is an alternative approach to use the
3D structure of targets [12, 13]. Whereas sequence data is available
for all interesting targets due to the mapping of the human genome,
3D information is still incomplete but rapidly growing [14].
Ligand-based approaches, which are limited to targets with at
least one known active compound, start from the other end and
can include available information about target–ligand interactions
in different ways. They all build upon existing knowledge about
bioactive compounds. Some approaches just categorize ligands
as active or inactive and compare targets based on the similarity
of their ligand sets. Others consider explicitly the strength of the
target–ligand interaction in the model-building process. Last but not
least both protein and compound information can be used in target–



Chemogenomics Approaches for the Quantitative Comparison of Biological Targets

ligand-based approaches. Some methods combine target similarity
with information about known ligands; others take the details of
the interaction on the atomic level into account to derive predictive
models with machine-learning techniques.
The focus of this chapter is the quantitative comparison of targets
by chemogenomics where the similarity assessment of targets is
based upon their interaction with chemical matter. This provides a
pharmacologically relevant description that can be used to predict
potential off-targets or to use ligands of similar targets as chemical
starting points for a new target.
Section 1.2 presents an introduction to purely target-based
similarity measures, which are treated in more detail in other
chapters of this book. In Section 1.3 selected approaches are discussed
that compare targets based on ligand information. The discussed
examples encompass methods that categorize compounds into active
and inactive ones, as well as methods that use quantitative afϐinity
data. A further category of approaches called proteochemometrics
aims for the characterization of the interaction between ligands and
their targets on a very detailed level. This can be done by employing
ligand descriptors, protein descriptors, and cross terms in one single
model. These approaches are mostly used to predict compound–
target interactions. The quantitative comparison of targets is a very
rare application of proteochemometrics modeling. Therefore these
approaches are not further discussed in this chapter. The interested
reader is referred to a number of recent reviews [15–17].
Whenever experimental data is used to train models and to make
predictions for compounds or targets not yet seen by the models, the
data size, its diversity, and quality are of crucial importance. This is
true both for bioactivity data of ligands and for the structural data
of targets. As already mentioned the compound–target matrix is
extremely sparse, and data completeness becomes an issue for the
quality of predictions. Moreover the quality of the available data needs
to be critically evaluated. This aspect is discussed in Section 1.4.
Section 1.5 highlights some selected publications, which describe
prospective applications of target similarity assessments. A brief
summary in Section 1.6 closes this chapter.
The status of chemogenomics was reviewed in recent publications,
which can be found in Refs. [5, 13, 16–20].

Target-Based Similarity Methods


Target-Based Similarity Methods

Similarity measures, which are grounded on target properties
alone, provide the most direct access to target comparison. Such an
approach does not need bioactive compounds to derive a similarity
relationship between targets. It therefore allows searching for offtargets for a drug without limitation to targets with known ligands.
In the case of orphans ligands of similar targets can serve as a
chemical starting point in deorphanization.
On the other hand ligand-based methods have a direct relationship
to the pharmacological action of chemical matter, while any kind of
similarity derived from sequence or structural data alone needs
to be translated into a pharmacologically relevant scale, which is a
critical and error-prone step [21, 22].
Methods and tools to compare and to hierarchically cluster
proteins based on their sequence similarity are well established,
and the so-called phylogenetic trees not only are the basis for the
reconstruction of evolutionary history of life [23] but also are
routinely employed for the identiϐication of potential targets for
selectivity testing or safety proϐiling in drug research. It is, however,
well known that targets from different families with low sequence
similarity may nevertheless bind similar ligands [24]. One example
is the family of serotonin (5-HT) receptors. They are all activated
by the neurotransmitter serotonin and belong to the superfamily
of G protein–coupled receptors (GPCRs), with the exception of
5-HT3, which is an ion channel with very low overall similarity to
the other members [25]. The same phenomenon can be observed
with the nicotinic and muscarinic acetylcholine receptors. The
inducible cyclooxygenase-2 (COX-2) and the totally unrelated
carbonic anhydrase (CA) show afϐinity to the same inhibitors
celecoxib and valdecoxib [26]. Glycogen synthase kinase 3 beta
(GSK3β) and cyclin-dependent kinase 4 (CDK4) show very similar
structure–activity relationships despite a sequence identity of only
28% [27]. Another example is the frequently observed binding of
compounds to thehuman Ether-à-go-go-Related Gene (hERG) ion
channel. The primary target of the vast majority of these compounds
is phylogenetically unrelated to the hERG channel, but yet persistent
binding to this antitarget is one of the major reasons for the early
termination of drug research projects [28].



Chemogenomics Approaches for the Quantitative Comparison of Biological Targets

It may therefore be misleading to compare the overall sequence of
targets. Instead, the focus should be on those parts that are relevant
for ligand binding. These binding sites can be described by their
(discontinuous) amino acid sequences or by the 3D properties of the
binding pockets. The latter approach requires either experimental
3D structures of proteins or reliable homology models. The accurate
identiϐication of the binding sites is a prerequisite for meaningful
An example for the use of sequence data combined with limited
structural information regarding the location of the binding site
can be found in the work of Surgand et al. [29]. The authors aligned
the transmembrane domain sequences of 369 human GPCRs and
used the X-ray structure of the bovine rhodopsin-retinal complex to
identify 30 discontinuous amino acids that are most likely to form
the ligand-binding site. Clustering of the receptors based on these 30
amino acids yielded a phylogenetic tree that displays the relationship
between the receptors. Gloriam et al. [30] revisited this approach
later by considering the additional GPCR X-ray structures that had
been published in the meantime. They expanded the set of relevant
amino acids to 44, including all previously identiϐied 30 residues.
Even if there are differences in the details both the clustering
published by Surgand et al. and by Gloriam et al. reϐlects very well
the results of the phylogenetic analysis by Fredriksson et al. [31],
which was based on the full sequence of 342 GPCRs. Interestingly
a clustering that is based only on those residues out of the 44 that
most likely interact with the group of bioaminergic ligands yields
different results that better reϐlect the pharmacologically relevant
receptor similarity of the respective GPCRs [30]. Milletti et al. came
to a similar conclusion when they compared binding sites of kinases
[32]. They found that the prediction of the target-binding proϐile of a
compound requires the selection of the appropriate subpockets that
are actually occupied by the compound.
Even a close neighborhood of receptors measured in terms of the
binding relevant amino acid sequence does not always result in a
similar ligand-binding proϐile. The bradykinine receptors B1 and B2
are closely related in all sequence-based analyses. The endogenous
peptidic ligand bradykinine is bound by B2 but not by B1, which
is caused by a single residue exchange between B1 and B2. A Ser
residue in the transmembrane helix 3 of B2 is replaced by Lys in B1.
The positive charge in the binding site of B1 repels the C-terminal

Target-Based Similarity Methods

Arg in bradykinine. In contrast the C-terminally truncated desArg9-bradykinine is bound by B1 since the negative charge of the
C-terminus is attracted by Lys [33]. This example demonstrates
that seemingly minor changes in either binding sites or ligands may
drastically inϐluence the binding proϐile. These “activity cliffs,” that is,
discontinuities in structure–activity relationships, are found when
looking both at targets and at ligands. Activity cliffs have been in the
focus of interest since a number of years, and the interested reader
is referred to a recent review [34].
The comparison of targets using the detailed 3D structure
analysis of binding sites typically involves the following four steps:
(i) identiϐication of binding pockets, (ii) conversion of the residues
lining the pockets into a simpliϐied representation, (iii) alignment
of the patterns, and (iv) quantitative assessment of the similarity
between the patterns by a scoring function.
There are several programs available for the ϐirst step, the
identiϐication of binding pockets, which were recently reviewed [35].
The authors conclude that encouragingly the programs perform
well and can tolerate deviations up to 2 Å (heavy atoms) between
ligand-unbound and ligand-bound protein structures. Limitations
are encountered if binding pockets are very narrow in the unbound
state. The authors also tested the ability of the programs to cope
with homology models and found that the quality of predictions was
comparable to the one found with native proteins in the tested cases.
They observed again that too narrow binding sites in the homology
model are a hurdle for the prediction and noticed that the overall
quality of the structure in terms of root mean square (RMS) deviation
does not correlate with the modeling quality of the binding site. It is
suggested to use molecular dynamics calculations to generate a more
realistic picture of the plasticity of the binding site. An alternative
could be to incorporate the ligand into the homology modeling
process, as described by Dalton et al. [36].
A number of recent reviews focuses on steps 2–4 in the abovementioned binding site comparison process [12, 21, 24]. A multitude
of approaches were developed to tackle the problem, and examples
of the successful identiϐication of similar binding sites not closely
related by sequence are presented for all these methods. In addition
to alignment-dependent approaches alignment-free methods were
also published in recent years, which avoid the time-consuming and
critical alignment step [37–40].



Chemogenomics Approaches for the Quantitative Comparison of Biological Targets

Despite all advances in this ϐield there are still challenges. Most
methods are able to detect highly similar binding sites but may vary
in their performance if the binding sites are of medium similarity
[38]. The deϐinition that residues of a protein are involved in binding
to a particular ligand is not clear without referring to the X-ray
structure of the protein–ligand complex. It was already mentioned
that the degree of the sequence-based similarity between two targets
changes if only the residues that are relevant for binding are taken
into consideration. The same is true if the comparison is grounded
on the 3D properties of the binding site [32]. Ligands do not need to
ϐill a binding site completely to achieve sufϐicient afϐinity but instead
may interact only with subpockets. The ability of an algorithm to
identify similarity of targets on the subcavity level is therefore a
necessary and important feature [26, 40, 41].
Even similar binding sites may exhibit variations in shape upon
ligand binding due to the plasticity of the protein structure. Hence,
algorithms for binding site comparison need to show some degree of
fuzziness, while on the other hand a sufϐicient accuracy is required
[22, 42].
There is no generic and unambiguous similarity score threshold
that separates similar binding sites in terms of ligand-binding
properties from dissimilar ones. Similarity searches in the binding
pocket space produce only an enrichment of true-positive results
among false positives and will miss false negatives. Any cutoff
is a compromise between recall and precision and may be case
dependent, quite similar to searching for bioactive compounds by
virtual screening in the chemical space [43]. Moreover one should
also be aware of cases where a target is predicted to bind a speciϐic
ligand. A target–ligand complex may actually be formed in agreement
with the prediction, but the afϐinity of the ligand may be too weak
(e.g., >10 μM) to be recognized in the particular assay, and the target
will erroneously be considered as a false-positive hit.
No binding site comparison approach can currently successfully
deal with a situation where a ligand binds to two different targets
but in different orientations that ϐit into binding sites that hardly
show any substantial similarity. In principle, docking could be an
alternative but requires a substantial computational effort and a
careful assessment of the docking poses due to the limitations of
current scoring functions. The interested reader is referred to recent
publications [12, 13, 44].

Ligand-Based Target Comparison


Ligand-Based Target Comparison

At ϐirst glance it might appear surprising that ligands, that is, small
molecules, are suited to deduce quantitative similarity information
on targets. The reason for this is the nature of the ligand–target
interaction. Emil Fischer was the ϐirst to postulate in 1894 that the
interaction of a protein with a small molecule can be described in
a simpliϐied analogy by the lock-and-key principle. Although it is
known today that the binding process between ligand and target
is much more complex than Fischer supposed, a complementarity
between both partners is required. Therefore, a single ligand can be
considered as a negative imprint of its own speciϐic interaction site
and the sum of all ligands of a target provides a negative imprint of
the complete binding site. The latter statement, however, is true only
if the known ligands of a target describe the ligand–target interaction
in its entirety.
Regarding the available knowledge that is applied to gain
information from the ligands, three types of target comparison
procedures can be distinguished. Chemocentric approaches use
the ligand sets of targets themselves to deduce knowledge on the
respective target/binding site similarities. Chemoprint comparisons
make use of biological activity data or comparable parameters that
indicate the strength of the relationship between ligand and target,
and proteochemometric approaches describe the interaction of each
ligand with its target on a very detailed level. All approaches require
known ligands for each target to be described and thus cannot be used
for orphans. Despite this limitation the development and application
of ligand-based methods is in the focus of many publications, and
substantial progress has been gained during recent years.


Chemocentric Approaches

In 2007 Keiser et al. introduced the similarity ensemble approach
(SEA) [45]. SEA is an application of the “similarity principle,” which
states that structurally similar compounds have similar biological
activity, reϐlecting the experience medicinal chemists have made since
a long time [6]. The ϐirst systematic studies regarding the validity of
the similarity principle were published in the mid-nineties [46]. In
the following years, different research groups have reached different
conclusions. In a careful analysis of biological data collected at



Chemogenomics Approaches for the Quantitative Comparison of Biological Targets

Abbott, Martin et al. [47] found that there is only a 30% chance that a
compound with a Tanimoto similarity ≥0.85 to an active compound is
active itself. Even if it is much better than random, it is a surprisingly
low value. The similarity principle with respect to the neighborhood
behavior within a combinatorial library was recently revisited by
Horvath et al. [48]. The authors basically conϐirmed the conclusions of
the former study and in particular found a strong and unpredictable
dependence of search results on the employed query, in spite of a
variety of descriptor spaces that were used. These results shed some
light on the limitations of the similarity principle. As pointed out by
Maggiora [49] rugged activity landscapes with activity cliffs are much
more frequent than assumed in the past. For this reason the similarity
principle should not be taken as a fundamental principle but more as
a valuable guideline with exceptions.
SEA goes beyond a simple similarity search. Inspired by the
bioinformatic method BLAST [50], SEA uses a statistical model to
derive an expectation E-value for the comparison of compound
sets rather than using the Tanimoto similarity itself. This E-value
describes the signiϐicance of the pairwise similarity of compounds
or compound sets. Thus, if compound sets for two distinct targets
are compared, the E-value can also be considered as the strength of
the relationship between these two targets. To derive the underlying
statistical model, Keiser et al. analyzed the similarity of random
compound sets of different sizes from the MDL Drug Data Report
(MDDR, Accelrys, San Diego, CA, USA) using Daylight ϐingerprints and
the Tanimoto coefϐicient Tc. For each combination of compound sets
a raw score was calculated as the sum of all pairwise comparisons of
the compounds from set 1 with all compounds from set 2. Raw scores
were calculated for 300,000 pairs of random sets in a size interval
between 10 and 1,000. The mean raw score, which was linearly
dependent on the product of the set sizes as well as the standard
deviation of the raw scores, was ϐit against the product of the set
sizes, resulting in two functions for set-size-dependent expected
mean raw scores and mean raw score standard deviations. Based
on these expectation values and the individual raw scores, Z-scores
were calculated by reproducing the above-described procedure
using Tc thresholds for the calculation of raw scores in the range
between 0.00 and 0.99 and by a ϐit to an extreme value distribution.
The distribution derived from a threshold of 0.57 resulted in the best
ϐit. For this reason only Tc values ≥ 0.57 are used in SEA calculations.

Ligand-Based Target Comparison

Thus, if two ligand sets do not have a single pair of compounds with
a similarity of at least 0.57 the raw score is 0. It is important to
note that this value needs to be recalibrated if other than Daylight
descriptors are used.
With the functions for the expectation values of the mean raw
score, raw score standard deviation, and the Z-score distribution, an
E-value can be calculated for every comparison of two ligand sets.
This E-value quantitatively describes the probability to obtain the
same or a better raw score just by chance. Keiser et al. [45] used this
method to analyze a 246-receptor subset of the MDDR. The pairwise
comparison of all ligand sets as described above yields the result that
the majority of the compound sets had a similarity not better than
random. Only 5% of the calculated E-values could be interpreted as a
statistically signiϐicant similarity between the targets based on their
ligand sets and displayed as a cross-target similarity network. The
authors found that on average any given receptor was similar to 5.8
other receptors with an E-value < 10–10.
Moreover, the ligand-derived E-value from this statistical model
can directly be compared with a sequence-derived E-value from
a BLAST search. Hert et al. [51] selected 193 ligand sets from the
MDDR database with known sequences of the relevant targets. The
pairwise comparison of the ligand sets and targets using SEA and
BLAST, respectively, yielded a rank order of similarities, which were
analyzed by the Spearman rank-order correlation coefϐicient. The
authors found few examples (e.g., serine proteases) where ligand and
target sequence similarities agreed well, but such correspondences
were more the exception than the rule. In general there was no
correlation between the sequence- and the ligand-based similarity.
To further support the validity of the statistical model and of the
SEA approach in a prospective manner, Keiser at al. [45] predicted
novel targets for the three known drugs methadone, emetine,
and loperamide. According to SEA methadone should be an M3
receptor antagonist, emetine should antagonize the α2 receptor,
and loperamide was predicted to be an NK2 antagonist. All three
predictions could be conϐirmed by experiment (methadone: 1 μM
antagonist at M3; emetine and loperamide: micromolar antagonists
of the α2 and the NK2 receptor, respectively).
While SEA describes the similarity of ligand sets and thereby
the similarity of their targets by comparing the ligands themselves,
Bender et al. [52] presented a different approach, introducing the



Chemogenomics Approaches for the Quantitative Comparison of Biological Targets

“Bayes Afϐinity Fingerprint” (BAF). Rather than comparing two
compounds by their binary ϐingerprint, which indicates the presence
or absence of substructures, BAFs describe compounds by the scores
calculated by multiple-activity class-speciϐic Bayesian models.
Bayesian models [53] are predicated on Bayes’s theorem, named
after the English mathematician and Presbyterian minister Thomas
Bayes (~1701–1761). In the implementation used by the authors,
the Bayesian model calculates the probability that any compound,
containing a feature F from the ECFP_4 feature space, belongs to an
activity class A, given the total number of compounds containing the
feature F and the number of compounds with feature F that belong
to activity class A. Bender et al. used the WOMBAT 2005.01 database
containing more than 100,000 bioactivity data points to train
1,003 activity class-speciϐic Bayesian models. The similarity of two
compounds is then expressed by the Pearson correlation coefϐicient
of their activity class scores.
The approach used by Bender et al. is very similar to the
Prediction of Activity Spectra for Substances (PASS) [54]. However,
while PASS calculates the probability of a compound being active
at a given target, Bender et al. use the combined information from
1,003 targets as a compound descriptor to calculate intercompound
similarities, comparable to the approach of Kauvar et al. [55], who
used experimental assay data or the in silico–generated ϐingerprints
of Briem et al. [10], who employed the program DOCK. Applying the
BAFs as a similarity descriptor to a benchmark dataset [56], Bender
et al. improved the retrieval rates by about 24% in the top 5% of the
hit list compared to the ECFP_4 as a descriptor set. These improved
retrieval rates indicate that the transformation from the graphbased compound descriptor into a bioactivity space descriptor
incorporates some knowledge about the chemical space of the 1,003
Moreover, Bender et al. used this approach to compare the
targets themselves. The generated Bayesian models are composed
of a set of ECFP_4 features with positive and negative coefϐicients,
depending on the frequency of the feature in the active and inactive
compounds, respectively, of the training set. Model comparison
and the subsequent target comparison could be achieved by
comparing the coefϐicients of identical features for different activity
classes. Due to the large number of substructural features this
is a computationally very expensive task. Bender et al. therefore

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay