Handbook Of Molecular Descriptors

Molecular Descriptors For Chemoinformatics

Numerical characterization of molecular structure is a first step in many computational analysis of chemical structure data. These numerical representations, termed descriptors, come in many forms, ranging from simple atom counts and invariants of the molecular graph to distribution of properties, such as charge, across a molecular surface. In this article we first present a broad categorization of descriptors and then describe applications and toolkits that can be employed to evaluate them. We highlight a number of issues surrounding molecular descriptor calculations such as versioning and reproducibility and describe how some toolkits have attempted to address these problems. 1 IntroductionComputational methods play an important role in many chemical disciplines ranging from drug discovery to materials science. There are a plethora of techniques that differ in terms of computational complexity, time requirements and so on. However the common requirement underlying all these methods is a formal description of a the molecular structure.

There are many ways to “describe” a molecule. A common approach is to describe the connectivity, taking into account the types of atoms and bonds. In other words, explicit representations of chemical structure such as SMILES, MDL/Symyx SD files and so on. While these descriptions are vital to modern chemical information systems, they do not necessarily allow computational techniques to be directly applied to them.Methods that aim to predict chemical and biological properties generally require a numerical description of chemical structures. Such numerical forms range from a set of 3D coordinates which coupled with appropriate atom types, is sufficient for methods such as quantum mechanical (QM) approaches and docking to more abstract numerical descriptions derived from 2D or 3D representations which can be useful in statistical approaches. It is now possible to evaluate thousands of numerical descriptors of chemical structure.

As will be discussed later, many of these descriptors are closely related or capture the same information, allowing one to be substituted for another. The selection of relevant descriptors is a well-known problem and given a large collection of them, approaches to identify a suitable subset have been discussed extensively in the literature ,. Is a summary depiction of the major types of descriptors and the form of molecular structure information that is required to compute them. The depiction is very general and focuses on small molecule descriptors.

As will be described in the following sections, molecular descriptors can be calculated for many chemical entities, not just small organic molecules. A graphical summary of descriptor types and the type of input information required. As one goes from top to bottom the calculations become more intensive, but the results capture aspects of molecular structure more realistically.In addition to there being many possible descriptors defined in the literature, there are also multiple implementations of a give descriptor. These implementations are available in the form of libraries (which require one to write a program to use them) or complete applications (graphical user interface (GUI) or command line).

As a result, not only must one choose one or more descriptors that are relevant to the problem at hand, but one must be concerned about how they are calculated and whether such a calculation can be reproduced across different implementations of those descriptors. It is easy to understand why two implementations of the same descriptor can lead to different results. The primary reasons being differences in the chemistry model of the framework or toolkit used to implement the descriptor. For example, a descriptor that calculates the number of aromatic atoms may be implemented using two toolkits with differing aromaticity models, and hence it is possible that the values generated by the two implementations will differ. Other sources of differences include parameters that may be involved in the descriptor calculation and reference data values (such as atomic radii, electronegativity values) that are employed during descriptor calculation. While most implementations will employ the same data sources for standard concepts (e.g., atomic weights), slight differences in these types of input data can lead to differences in the final descriptor value. As a result, in most cases, two implementations of a descriptor do not usually give the exact same value, though they are usually quite similar.

Explicitly explaining the differences may or may not be possible (it is usually more difficult in the cases of commercial implementations for which source code is not available). 2.1 Numerical descriptorsWe have noted that a molecular structure can be characterized using many numerical descriptors. In this section we describe a categorization of descriptors. Admittedly, the grouping is somewhat arbitrary but serves as a useful guide. In addition, the categorization also takes into account the nature of the chemical structure being considered - some descriptors are only useful when applied to small molecules, whereas other descriptors are defined specifically for polymers or protein structures.

Broadly, one can classify descriptors based on the nature of structural information they require.Constitutional descriptors only require atom and bond labels and usually represent counts of different types of atoms or bonds. While very simplistic they can play a useful role in a variety of applications ranging from summaries of physicochemical properties to predictive modeling.Topological descriptors take into account connectivity along with atom and bond labels. They consider the molecule as a labeled graph and characterize it using graph invariants. In addition to graph invariants, topological descriptors based on information theory such as entropy have also been defined. An advantage of these descriptors is that do not require any intensive preprocessing steps such as 3D coordinate generation and conformational analysis. In addition, many graph invariants can be rapidly computed for graphs corresponding to small molecule structures (even though they may be computationally intractable on much larger graphs). An example of a topological descriptor is the Wiener index ,.

It is simply the sum of the edge counts in the shortest paths between all pairs of non-hydrogen atoms. From a physical standpoint, the descriptor characterizes the branching in a molecular structure. While this class of descriptors have been used extensively in predictive modeling applications (for example, the Wiener index described above has been useful in models of boiling point data), a key drawback is their general lack of interpretability. The abstract nature of many graph invariants and information theoretic descriptors, their use in predictive models makes it difficult to explain the model in simple physicochemical terms, though various attempts have been made at interpretations ,.Geometric descriptors require a 3D conformation as input and therefore the computations are not as fast as topological descriptors which do not require the 3D coordinate generation step.

Nearly all geometric descriptors work on a single conformation and for cases where multiple conformation must be considered, an averaging procedure is usually employed. Examples of these descriptors include the gravitational index and moment of inertia descriptors. Geometric descriptors encompass many aspects of a molecular structure. For example, a number of them characterize the molecular surface. The simplest such example is the surface area descriptor (and the corresponding volume descriptor). A simple extension of such a descriptor is to characterize distributions of a physicochemical property over the molecular surface such as the Charged Partial Surface Area descriptors (CPSA) that characterize the partial charge distribution over the molecular surface.

Note that surface descriptor calculations based on analytical or tessellation algorithms can be slow and that parameterized methods such as the Topological Polar Surface Area (TPSA) and MOE van der Waals Surface Area (VSA) descriptors that are based on precalculated surface area values derived from a list of functional groups.It is important to note that there is a difference between the information a descriptor captures and how it is calculated. The above categories are based on input requirements. This gives rise to situations where the molecular volume property can be described by geometrical descriptors, e.g. Using an analytical approaches , and with approximate, constitutional descriptors, such as the very fast algorithm use by the van der Waals Volume as a Sum of Atomic and Bond Contributions (VABC) approach which is based on group contributions. The latter descriptor has been implemented in the Chemistry Development Kit (CDK) and correlates well with other approximations to the molecular volume. 2.2 FingerprintsBesides the numerical descriptors outlined so far, another useful approach are those of fingerprints. Traditionally, fingerprint descriptors are represented in the form of bit strings.

These binary fingerprints can be divided into hashed fingerprints, where substructures (such as paths of length n) are converted to a string representation and then hashed into a randomly selected bit positions, or keyed fingerprints, where each bit position corresponds to a unique substructural feature. The nature of the features can range from simple functional groups (hydroxyl, carbonyl) and topological substructures (paths, chains, cycles) to atom environments and pharmacophoric elements. Depending on the definition, fingerprints can work with topology only or may require 3D conformations. The latter class of fingerprints are usually related to 3D pharmacophores and most fingerprint definitions only require connectivity information.Fingerprints are a compact representation of a chemical structure and are used in a variety of ways ranging from database searches looking for similar structures or substructures to virtual screenings studies. In many search applications, the binary nature of fingerprints allow for efficient algorithms to compare molecules and evaluate similarity and a number of heuristics have been developed that allow for rapid similarity searches ,. In virtual screening scenarios, fingerprints can be used as features. In many cases, the entire fingerprint is employed as a set of independent variables.

Since this can lead to poor predictive models, some sort of feature selection is usually employed and there have been studies that address the problem of identifying subsets of bits from a fingerprint that lead to optimal performance in predictive models (albeit, usually in a target-class specific manner). Descriptors for polymeric systemsThese systems represent a challenge for descriptor development.

There have been reports that employed traditional small-molecule descriptors that consider the polymer in terms of the individual monomers. A number of approaches have been described that includes both deterministic, geometry derived descriptors as well as stochastic methods. For the case of polypeptides the descriptors generally tend to be based on the individual amino acids.

Examples include the Vector of Hydrophobic, Steric, and Electronic (VHSE) family of descriptors described by Mei et al , a set of side chain descriptors from Collantes and Dunn , field based descriptors and the AAindex database which allows rapid lookup of amino acid properties. Descriptors for inorganic compoundsInorganic materials are sufficiently different from small molecules that traditional descriptors are not very useful. Recently, a number of descriptors for inorganic materials are very similar to constitutional descriptors - essentially counts of various crystal parameters such as accessible surface areas and volumes ,. Other descriptors characterize the periodic voids of porous materials via tessellation methods ,.Other methods developed for crystal structures have been developed too, for example, using a radial distribution functions , which capture the packing of organic crystal structures. Such approaches have been used to cluster crystal structures and compare properties. Bioactivity as a descriptorWhile the descriptors mentioned so far are derived from some form of chemical structure, an alternative approach is to employ observed biological activities of a molecule as descriptors of that molecule. This approach was taken by Sedykh et al who hypothesized that dose response data points from high throughput dose-response assays could be employed as biological descriptors.

They observed that when such descriptors were combined with traditional chemical structure descriptors, the predictive performance of models developed using the combined set was better than those developed using conventional approaches. A similar approach can be taken using the Prediction of Biological Activity Spectra for Chemical Substances (PASS) methodology , though this is an indirect method - one must predict the PASS profile from the chemical structure, which can then be used in subsequent predictive models. 3 What is a Useful Descriptor?Given that we can generate thousands of descriptors, it is critical to ask, what makes a descriptor useful?

Fundamentally, a descriptor must correlate structural features with some physicochemical property and show minimal correlation with other descriptors. In addition, a generally useful descriptor will be applicable to a broad class of molecules. Furthermore, a descriptor should show minimal degeneracy. That is, the descriptor should generate different values for structurally different molecules, even if the structural differences are small.

An example is the case of stereochemistry. Most constitutional and topological descriptors that are not designed to take this into account will generate the same numerical values for all the stereoisomers of a given chiral molecule. We would consider such a descriptor to be degenerate. In addition to degeneracy, a descriptor should be continuous, in the sense, that small structural changes should lead to small changes in the value of the descriptor.The requirements listed here ensures that subsequent analyses involving descriptors are robust. But there are other aspects of a descriptor that can make it more or less useful. For example, a descriptor that can be rapidly calculated and is not dependent on experimental properties can be considered more useful than one that is computationally intensive and / or dependent on experimental results.

In addition, it is important for many scenarios that the descriptor have some form of physical interpretability. For example, a model that employs descriptors to predict cytotoxicity will be most useful if it is numerically accurate as well as providing insight into which structural features confer toxicity. In such a scenario, abstract descriptors do not allow us to provide a physical explanation of the models predictions. Another scenario is the ability to map descriptor values back to the structure for visualization purposes such as the “Glowing Molecule” depictions of Segall et al. Such visualizations are practical only when descriptor values can be related to structural features. Examples include surface based descriptors such as CPSA’s as well as explicit structural descriptors such as paths, chains etc.However, there are certain scenarios where some of the above characteristics are undesirable. One such scenario is in the safe exchange of structures, such that two parties can perform analyses on the structure-derived data, without having to know the structures explicitly.

In such cases, the ability to go from a set of descriptor values back to a chemical structure should be minimized. It has not been conclusively proven that such a “one-way” descriptor can be defined. For example Masek et al have shown that one can regenerate a target structure (or substructural features) or a set of closely related analogs given a combination of descriptors such as BCUTs. suggest an approach to prevent reverse engineering but avoid the use of descriptors and instead sharing structures of related molecules, rather than the molecules of interest.

One could also employ degenerate descriptors, which would represent a large family of molecules, such that the target molecule would be hidden. This is likely unreliable as it would be difficult to perform analyses (e.g., predictive models) with such descriptors.This discussion has focused on aspects of descriptors that make them useful in computational models. It is also relevant to note that the utility of a descriptor can be dependent on the problem being considered.

More specifically, certain end points may require specific molecular features to be taken into account. Descriptors that do not, will not be useful in developing models for that endpoint. For example, an accurate boiling point model will require descriptors that capture information regarding molecular branching, hydrogen bonding potential and so on. Similarly, a model to predict activity against a cytoplasmic or nuclear target, will likely require that molecular characteristics enabling passage across the cell membrane be taken into account.

In such a case, a descriptor characterizing the lipophilicity of the molecule will be useful. These points also apply to other types of computational approaches, such as clustering. In the end, the utility of a descriptor is determined by whether it encapsulates sufficient and relevant information. In general a single descriptor will not satisfy these requirements and thus one usually works with sets of descriptors. NameLicenseToolkitApplicationDescriptorsFingerprintsCDKLGPLYesYesConstitutional, topological, geometric, electronic and hybridMACCS, PubChem, Kier & Hall Estate keys. Hashed linear and circularCODESSACommercialNoYesConstitutional, topological, geometric, electronic, QM, thermodynamicDragonCommercialNoYesExtensive list of all classes of descriptor types and implements all those described in Ref. JChemCommercialYesYesConstitutional, geometric, topological, electronic, hybridHashed, circular, ECFP/FCFPJOELibGPLYesYesKeys from Xue et al Molconn-ZCommercialNoYestopologicalOpenBabelGPLYesconstitutionalMACCS, hashedPowerMV GPLYesconstitutional, atom pairs, Burden numbers, pharmacophoricRDKitNew BSDYesConstitutional, topological, EState, partial charges and VSA combining EState, partial charges and MRMACCS, hashed, atom pairs, topological torsions.

While library calls for descriptor evaluation is the most flexible approach, it is not useful for non-programmers as well as users who simply need a descriptor matrix for modeling or searching. As a result, GUI or command line wrappers around descriptor APIs enable easier usage. The CD-KDescUI application is a Java Swing application that exposes the CDK descriptor and fingerprint API’s. It allows the users to specify a SMILES or SD file and evaluate a selected set of descriptors or fingerprints and export them to a text file.

The application also has a command line mode that can be run in server environments. A similar application is PaDEL that includes the CDK descriptors and also implements a number of additional descriptors such as the extended topochemical atom descriptors , McGowan Volume , and the Klekota & Roth substructure counts.There are also web interfaces wrapping descriptor calculation tools and some of these interfaces have a programmable feature.

For example, AMBIT2 implement the OpenTox API providing a REST-like interface to descriptor calculation (see ). Another online application programming interface for descriptor calculation is provided by the SADI platform, which, like AMBIT2, uses semantic web technologies for descriptor calculation.

Four tools to calculate descriptors: Bioclipse (top left), which allows selecting descriptor implementations from various independent tools, CDK-Taverna (top right), which is an extension to Taverna to calculate descriptors with the CDK, Ambit2 (bottom left), which implements the OpenTox API providing a REST-based API for descriptor calculation and wraps various descriptor calculation tools. The bottom right screenshot shows the Microsoft Excel plugin LICSS that uses the CDK for 2D diagrams and descriptor calculation. This screenshot is provided by Kevin Lawson, Syngenta, UK, and reproduced with permission.In addition to libraries and their wrappers there are various GUI applications that allow easy descriptor calculations. Examples include Bioclipse (see ), MOE (Chemical Computing Group) and Maestro (Schrodinger, Inc.) as standalone tools, but also as a plugin to Microsoft Excel, called LICSS. Workflow tools are also a useful class of applications for descriptor generation and allow one to easily generate descriptors from multiple sources (see ). Given the focus of this article on Open Source implementations, we have noted the availability in. While a number of implementations are not strictly Open Source according to OSI definitions, the fact that they provide free academic licenses does allow them to be used somewhat freely.Due to our experience with the CDK project we provide a brief overview of the architecture of the descriptor calculations implemented by the CDK.

First, the toolkit explicitly differentiates between atom, bond and whole molecule descriptors. There are a total of 30, 6 and 51 descriptor algorithms for atoms, bonds and molecules respectively. Note that each individual descriptor algorithm may calculate multiple descriptors. For example, the CPSA descriptor evaluates 30 individual descriptor values.

Thus for a given molecule it is possible to generate approximately 300 descriptor values.Every descriptor class implements an interface allowing them to return values in a uniform manner, viz., via a DescriptorValue object. This class can represents the result of a descriptor calculation and includes both the descriptor values (via a IDescriptorResult object) as well as meta-data on the descriptor itself via a DescriptorSpecification object (see below for more details). The descriptor classes are designed such that in the event of a calculation error (invalid, missing or nonsensical input), null values are returned, rather than throwing an exception. This is useful during automated descriptor calculation, where one desires a “rectangular” descriptor matrix, from which one could filter out the undefined descriptors.

While the user can pick and choose specific descriptors to calculate, the API provides a class (DescriptorEngine) to specifically evaluate all or a subset of descriptors for a molecule at one go. Currently the descriptor implementations are not threadsafe, making them unsuitable for parallelization. This is a drawback since many of the descriptors can be slow. For example, the χ descriptors are implemented using SMARTS matching to identify the various paths and chains described by Kier et al., rather than explicitly walking the molecular graph. 4.1 Comparing implementationsGiven the multiple implementations of many descriptors, it is natural to compare their performance. As noted previously, it is desirable, but unlikely, that implementations of the same descriptor using different cheminformatics toolkits will be identical.

Gupta et al employed SMARTS based descriptors from the CDK and MOE to develop decision tree models to predict human liver microsomal metabolic stability. Their results indicated very similar performance between the two implementations. In general, one would expect a high degree of correlation between different implementations of well-defined descriptors.

Here “well defined” indicates that the underlying parametrization or chemistry model is completely specified, so that different implementations use them identically. For example, correct implementations of the TPSA descriptor are expected to correlate very well, and this is shown in for the ACD Labs and CDK implementations. A comparison of Topological Polar Surface Area values generated using the CDK and ACD Labs software, for 57,857 molecules taken from PubChem AID 1996.A number of physicochemical descriptors commonly employed in predictive modeling are themselves models of an experimental property. For example, log P can be experimentally measured and a number of algorithms have been developed to predict log P ,. Given the utility of log P in drug discovery, a number of implementations are available. Given that this descriptor is a surrogate for an actual experimental property, it is reasonable to compare calculated log P values from different implementations to the experimentally observed values, rather than between themselves. Compares computed log P values from the CDK (specifically, an implementation of the XlogP method), ChemAxon and ACD Labs for a set of 10,000 molecules taken from the logPstar dataset.

4.2 Descriptor Naming & VersioningAs noted above, different implementations of the same descriptor can generate slightly different values depending on the underlying chemistry model, differences in reference values and so on. Even within the same software implementation, different versions of the same descriptor can generate different values. Thus to maintain reproducibility in studies that employ molecular descriptors, some form of versioning is crucial. The simplest form of version is to manually note the version of the implementation used for the calculation and report it along with the results. Though useful, it is not amenable to automation. In some cases, toolkits will provide an API call to retrieve the version information, which can then be associated with a descriptor calculation.A topic closely related to versioning is descriptor naming.

The preceding sections have discussed a categorization of descriptors and have used a number of commonly used names. When performing calculations in an automated manner, or employing remote services for descriptor calculations , it is useful to be able to refer to descriptors via a standardized naming scheme.

Currently the QSAR modeling community lacks such a scheme, though the CDK, a Java library for cheminformatics does implement such a scheme. The “descriptor specification” approach defines a set of classes that include information of the implementation title, identifier and vendor. The identifier may or may not include a version number. The actual details of the a descriptor are listed in a dictionary, along with an optional namespace. The dictionary entry is referenced by the descriptor specification and this design thus allows one to seamlessly work with multiple implementations of the same descriptor from different vendors and keep track of which version is employed throughout a calculation or workflow. In addition, the CDK provides a hierarchical annotation of descriptors, grouping them into different classes.

Since this annotation is based on the Resource Description Framework (RDF) , it allows one to go beyond simple groupings and perform reasoning on descriptors. A simple example is to identify functionally equivalent descriptors (such as simple transformations of a base descriptor - square and cube roots of the gravitational index is an example ) in an automated fashion. While this infrastructure is currently only supported by the CDK, adoption of it (or some similar framework) would enable easier interoperability between descriptor implementations and enhance reproducibility.

The framework is, however, actively used by software using the CDK descriptors, such as Bioclipse and Ambit. Moreover, the ideas have been adopted by the CHEMINF ontology for cheminformatics , which is used by the CHESS framework which also uses the CDK for descriptor calculations. 5 SummaryMolecular descriptors play a fundamental role in cheminformatic and chemometric analyses and as we have described in this article it is possible to evaluate thousands of descriptors using a variety of software. Though the bulk of descriptors are well defined in the literature, multiple implementations of the same descriptor can yield different results.

These differences arise from differences in the underlying chemistry models and reference data used by the implementations. As shown by Gupta et al , the performance of predictive models developed using commercial or Open Source descriptor implementations is very similar.

While different tools differ in the specific number and types of descriptors that they calculate, the fact many descriptors are correlated with others suggests that it does not matter too much which set of descriptors are used in an application. Of course, this cannot be a general rule - specific applications may require a specific set of descriptors. For example, a predictive model for boiling points may not perform very well if branching and intermolecular forces are not taken into account by the descriptors in the model. This is not to say that all descriptor tools exhibit similar performance.

For example, it is clear from that the ACD Labs implementation of log P performs significantly better than other implementations. Though this is not a completely rigorous comparison (the methods underlying the ACD Labs, CDK and ChemAxon implementations are different), it does highlight that certain implementations of a descriptor can fare better than others - especially those cases where the descriptor is based on a predictive model.Given these challenges it is imperative that descriptor tools provide access to version information.

This allows the user to provide an exact specification of how the descriptor was calculated. However, one aspect that has not been standardized across different implementations is the naming scheme. In other words, tools that evaluate the same descriptor may name them in a similar but not exactly identical manner. This makes automated comparisons and merging of descriptors from different sources problematic. To address this, the CDK has developed the concept of a “descriptor specification” which is associated with a descriptor value and includes information on the vendor, descriptor title and a reference to an entry in a descriptor dictionary that contains more details of the descriptor in RDF. While the extra descriptor metadata is currently used to implement a simple classification scheme, it is conceivable that in the future the specification approach could be adopted by multiple vendors, allowing for automated reasoning over descriptor implementations.In summary, there is no dearth of tools to generate molecular descriptors. Many of them are available under liberal Open Source licenses, though these implementations do not necessarily cover all the descriptors described to date.

Molecular Descriptors For Chemoinformatics

The tools range from toolkits (which implies that one must write a program to generate descriptors) to self-contained GUI or command line tools. Given the caveats regarding implementation specific differences, it is important to keep track of provenance information when calculating descriptors to ensure reproducibility.

ideapolar

Handbook Of Molecular Descriptors

Molecular Descriptors For Chemoinformatics

티스토리툴바