Quantitative Structure Activity Relationship Prediction of Oral Bioavailabilities Using Support Vector Machine

  • Received : 2014.07.16
  • Accepted : 2014.09.18
  • Published : 2014.12.20


A quantitative structure activity relationship (QSAR) study is performed for modeling and prediction of oral bioavailabilities of 216 diverse set of drugs. After calculation and screening of molecular descriptors, linear and nonlinear models were developed by using multiple linear regression (MLR), artificial neural network (ANN), support vector machine (SVM) and random forest (RF) techniques. Comparison between statistical parameters of these models indicates the suitability of SVM over other models. The root mean square errors of SVM model were 5.933 and 4.934 for training and test sets, respectively. Robustness and reliability of the developed SVM model was evaluated by performing of leave many out cross validation test, which produces the statistic of $Q^2_{SVM}=0.603$ and SPRESS = 7.902. Moreover, the chemical applicability domains of model were determined via leverage approach. The results of this study revealed the applicability of QSAR approach by using SVM in prediction of oral bioavailability of drugs.


Oral bioavailability is an important pharmacokinetic property, which is defined as the fraction of an administered dose of drug that reaches the systemic circulation and is a critical property to be considered during the early stages of drug design. Among the absorption, distribution, metabolism and elimination (ADME) properties of a chemical, unfavorable oral bioavailability is indeed an important reason for stopping further development of the drug candidates.1 The early evaluation of ADME properties in drug research has driven the need for large scale screening methods. In vitro and in vivo ADME assays are lengthy, complex and relatively expensive in terms of resources, reagents and detection techniques. Computational methods have emerged during the past decade as a powerful strategy for the prediction of human pharmacokinetics. In this regard, a variety of useful in silico models has been developed with different levels of complexity for the screening of large data sets of compounds, to creating tools that are faster, simpler, and more cost effective than traditional experimental procedures.2 Among theoretical methods, quantitative structure activity relationships (QSAR) approaches have been successfully established to predict the properties/ activities of chemicals from their structural features. The advantage of this approach over other methods lies in the fact that it requires only the knowledge of chemical structure. The main steps involved in QSAR are: data collection, molecular geometry optimization, molecular descriptor generation, descriptor selection, model development and finally evaluation of the model performance.3

There are some reports about QSAR prediction of oral bioavailabilities of chemicals. For example, Andrew et al. proposed a QSAR model that can achieve the coefficient of multiple correlation of R2 = 0.71 by using 85 molecular descriptors.2 In 2004, Tuner et al. developed QSAR model for prediction of oral bioavailabilities of some chemicals by using the artificial neural network (ANN).4 The training and testing correlation coefficients given by the model were 0.736 and 0.897, respectively. In 2007 Moda and the coworkers reports the application of the hologram quantitative structure activity relationships (HQSAR) technique to construct a prediction model for the prediction of human oral bioavailability.1 The correlation between experimental and calculated values of oral bioavailabilities by their model was 0.93 for training set. The main aim of the present work is developing of some QSAR models by using multiple linear regression (MLR), artificial neural network, support vector machine (SVM) and random forest (RF) as modeling techniques to developing better QSAR models.



Data Set

Data set consist the experimental bioavailabilities of 302 structurally divers chemicals that were reported in ref.1 Some of these chemicals (118) are enantiomers that had the same oral bioavailabilities value so one of each pair of enantiomer was deleted from the data set. Then the principle component analysis (PCA) was performed on the remaining 243 molecules to detecting the dissimilar (outlier) chemical. After this step remaining molecules (216) were considered for further investigations, which their names are shown in Table 1. In other words, the final data is a 216×9 matrix, which corresponds to 216 chemicals and 9 descriptors. The distribution of the human oral bioavailability data for the complete data set is presented in form of a histogram in Fig. 1. As can be seen in this figure values of bioavailability (%) are acceptably distributed across the range of values.

Table 1.In the above table the superscript of ((*)) indicate the test set compounds.

Figure 1.Histogram representation of the distribution of human oral bioavailability for the 216 data set compounds.

Data sets were splitted randomly into two separate groups; the training and test sets, which containing, 171 and 45 members, respectively. The training set was used for the generation of the model and the test set was used for evaluation of the predictive power of generated model. The distribution of training and test set among data set space was shown in Fig. 2, which indicates the structurally diverse molecules possessing oral bioavailability values of a wide range were included in both sets.

Figure 2.Data set, training set, and test set composition.

Descriptor Calculation and Selection

In order to developing a QSAR model, structural features of molecules were converted to the numerical code, which were named molecular descriptors. Molecular descriptor is a result of standardized numerical calculation from logical and mathematical interpretation of chemical information, such as chemical formula, molecular structure, interaction and etc., from a molecule.5 The molecular descriptors used to search for the best model were calculated by the Dragon program6 on the basis of the minimum energy molecular geometries optimized by the Hyperchem package 7 based on AM1 semi empirical method. In addition electronic descriptors were calculated by the MOPAC package (Ver.6).8 During developing of models, great care was taken in order to avoid inclusion of highly collinear molecular descriptors. The collinear descriptors encoded similar molecular information, therefore, it was vital to test descriptors and eliminate those with low variation and those which encoded similar information (descriptors with the absolute value of Pearson correlation coefficient above 0.9). Then the most significant descriptors were selected from the pool of remaining molecular descriptors by stepwise multi linear variable selection method. These descriptors were used as inputs independent variables for developing of QSAR models.

Support Vector Machine

Support vector machine (SVM) introduced by Vapnik9 to solve the classification and nonlinear regression problems. SVM, mapped input data into the higher dimensional feature space by the use of a kernel function then linear regression is performed in the feature space. The kernel functions that were used for nonlinear transformation of the input data can be linear, polynomial or radial basis function (RBF), which the later more commonly used in QSAR studies.The performance of SVM depends on the type of kernel function and parameters of C, ε and γ. The parameter of C is regularization constant that represent the tradeoff between minimizing the training set error and maximizing the margin.10 The parameter of ε is sensitive loss function and the parameters of γ are used to control the width of radial basis function. The optimization of these parameters was done by simultaneous changing of these parameters and monitoring of the root mean squared error (RMS) of SVM. The values of RMS calculate according to the following equation:

where yk and are the experimental and predicted response and ns is the number of compounds in data set. SVM was used as feature mapping techniques in some QSAR studies such as, estimation of selectivity coefficients of univalent anions for anion selective electrode11 and prediction of pharmacokinetic properties of QSAR Prediction of Oral Bioavailabilities Using SVM drugs,12 modeling of blood brain partitioning behavior of chemicals 13 and prediction of the retention of peptides in immobilized metal affinity chromatography.14




In this work, quantitative relationships between oral bioavailabilities of some drugs and their molecular structural descriptors were investigated by using linear and nonlinear feature mapping techniques. After calculations of descriptors and prescreening of them, the method of stepwise multiple linear regression was performed on the remaining descriptors to select the most important of them, which relate to the oral bioavailability of interested drugs. These descriptors are; number of circuts (nCIR), mean square distance index or Balaban index (MSD), number of Al−O−Ar or Ar−O−Ar or R−O−R or R−O−c=x substructure (where R is any group linked carbon, Al and Ar are aliphatic and aromatic groups, respectively and X is any electronegative atom) (O060), number of oxygen double bonds (O058), number of R−N−R or R−N−x groups (N075), 3D MORSE signal 10/weighted by atomic masses (Mor10m), Broto-Moreau autocorrelation of topological structure lage8 (ATS8m), leverage weighted autocorrelation of lage-6/ unweighted (HATS6u), leverage weighted autocorrelation of lage-8/weighted by atomic van der Waals volumes (HATS8v).

The methods for calculations of these descriptors and the meaning of them are explained in the Handbook of Molecular Descriptors by Todeschini and Consonni.5 The inter correlation among these descriptors are shown in Table 2. As can be seen in this table, there is not any high correlation between selected molecular descriptors. In this study sensitivity analysis approach is used to rank descriptors. 15 This method used to determine how different values of an independent variable will impact a particular dependent variable. According to the results of sensitivity analysis on SVM model, the importance order of descriptors was O058 > Mor10m > HATS8v > N075 > O060 > MSD > nCIR > HATS6u > ATS8m. Brief explanations of these descriptors that were utilized in developing of linear and nonlinear models are found in the following.

Table 2.Correlation matrix between selected descriptors.

The descriptor of nCIR is a constitutional descriptor that indicates the number of circuits in a molecule. Constitutional descriptors reflecting the molecular composition of a compound without connectivity and geometry information. Descriptors of HATS8v and HATS6u are geometry topology and atomic weight assembly (GETAWAY) type descriptors.16 These 3D descriptors encode geometrical information given by influence matrix, topological information given by molecular graph and chemical information from selected atomic properties. The next descriptor is Mor10m that is belonged to 3D MoRSE descriptors (3D molecule representation of structures based on electron diffraction). 3D MoRSE descriptors are derived from infrared spectra simulation using a generalized scattering function and mainly reflect the molecular size and 3D information. Three other descriptors in the model are N075, O060 and O058, which were belonged to atom centered fragment (ACF) descriptors. ACF descriptors represent of a single central atom surrounded by one or several atoms that separated from the central one by the same topological distance, which can describe some terms of element and bonding type in a molecule.The next molecular descriptor is mean square distance index or Balaban index. This descriptor introduce by Balaban in 1983 and encode the size and compactness of a molecule.17 The final descriptor is ATSC8m, which is a topological 2D auto correlation indices and calculate by summing the products of atom weights of the terminal atoms of all paths in the considered path length.18 All of these descriptors can encode topological and electronic aspects of molecules, which their variation effects significantly on the oral bioavailabilities of interested chemicals.


Selected molecular descriptors were used as independent variable to developing QSAR models. The methods of MLR, ANN, SVM and RF were used as feature mapping methods. In order to comparison these models, standard errors (SE) and coefficient of multiple determination (R) calculated from below equation.


where yi and were the experimental, predicted and average experimental values of oral bioavailability, respectively and n is the number of data.

The statistical parameters of these models are shown in Table 3. As can be seen in this table the SVM model is superior over other models in terms of standard errors (SE) and coefficient of multiple determination (R) on training and test set. Therefore this model is further explained in the following.

Table 3.Statistical parameters of MLR, ANN, SVM and RF models

In order to developing of a predictive QSAR based SVM model, SVM’s parameters must be optimized. These parameters are the kernel parameter (γ), sensitive loss function (ε) and regularized constant (C). To finding the optimal value for γ, it was varied in the range of 0.1 to 300 and examine the performance of SVM model to get the minimum of RMS.The value of ε is depended on the type of noise in the data. To find an optimal value for ε, the SE for the SVM models with different ε values (in the range of 0.1 to 1) were calculated and its optimum value was selected based on minimizing of SE. The other parameter is the regularized constant, C. If C is too large, the SVM model will over fit to the training set; likewise, if C is too small, deficient stress will be placed on fitting the training set and it will cause an under fit state on the training set. In order to find an optimal value for C, SE for the SVM mod-els with different C values were calculated and the best value for C was selected based on minimizing of SE. According to this procedure, the optimal values for these parameters were C = 10, γ = 7 and ε = 0.1. Then the developed SVM model was used for prediction of oral bioavailabilities of chemicals in test set as well as training set. These values are shown in Table 1. The standard error of this calculation are 5.70 and 10.46 for training and test set respectively. The predicted values of oral bioavailability were plotted versus their experimental values in Fig. 3 which indicate the good agreement between these values (R2train = 0.95, R2test = 0.86). The residuals of this calculation were plotted in Fig. 4. The random distribution of the residuals around the zero line indicated that there were not any systematic errors in this model.

Figure 3.Plot of predicted versus experimental oral bioavailability.

Figure 4.Plot of residuals versus experimental oral bioavailability.

In order to evaluate the robustness and predictive power of SVM model, the leave eleven out cross validation test was performed and the values of the cross validation correlation coefficient (Q2) and standardized predicted error sum of square (SPRESS) were calculated,19 according to the following formulas:


In the above equations k is the number of independent variables in regression equation. The obtained Q2 and SPRESS were 0.603 and 7.902, respectively, that indicated the robustness of developed SVM model. Y randomization test is another validation approach that must be used to investigate chance correlation among data matrix. In order to performing this test, dependent variable is randomly scrambled and a new model is developed using the original independent variables matrix. The average value of R after 30 times Y scrambling was 0.056, which revealed that the proposed model is well founded and not just the result of chance correlations.

Also the domain of applicability of developed QSAR model was investigated. The applicability domain (AD) indicate the scope of model and define the model limitations with respect to structural domain and response space.20 Through the leverage approach, it is possible to verify whether a new chemical will lie within the structural model domain (in this case predicted data can be considered as interpolated and with reduced uncertainty, hence reliable) or outside the domain (so predicted data are extrapolated by the model and must be considered to have increased uncertainty, hence unreliable). Leverage, hi, is defined as:

where xi is the descriptor row vector of the query compound and X is the n × k−1 matrix of k model descriptor values for n training set compounds. The superscript T refers to the transpose of the matrix/vector. To visualize the AD of QSAR models, the standardized residuals were plotted against leverages in Fig. 5 (William plot). In this plot, the horizontal and vertical lines delineate the limits of acceptable values; the former for the Y outliers (i.e. compounds with standardized residuals greater than 3 standard deviation units) and the latter for X outliers, respectively. The limit of X outliers is determined by their warning hat values (h*) calculated by 3p/n, where p is the number of model variable plus one (k+1), and n is the number of training compounds. As can be seen from this figure, all predictions were reliable, except for number 7, 14, 19, 191 and 202 in training set which is not within the cut off value of h* = 0.175 and the number of 12, 103, 173 and 212 in William plot seems to be outlier.

Figure 5.The plot of standardized residuals versus leverage (William plot).

Moreover, diversity analysis was done on the dataset to make sure the structures of the training or test sets could represent those of the whole ones.21 In this way, a database of n compounds generated from m highly correlated chemical descriptors was considered. Each compound, Xi, is represented as following vector:

where xi,j denotes the value of descriptor j of compound i. The collective data base is represented by an n × m matrix of X:

where the superscript T denotes the vector/matrix transpose. A distance score for two different compounds Xi and Xj (dij) can be measured by the Euclidean distance norm based on the compound descriptors:

The mean distances of one sample to the remaining ones were computed as follows:

Then the mean distances were normalized within the interval of zero to one and the resulting values were plotted against oral bioavailability (Fig. 6). As can be seen from this figure, the structures of the compounds are diverse for both training and test sets. Distribution of selected descriptors for SVM model are shown in the radar chart (Fig. 7). A radar chart is a graphical method that displays multivariate data in the form of a two dimensional chart with several quantitative variables represented on axis starting from the same point. The purpose of a radar chart is to compare m options across n parameters so that audience can be convinced that option A is better than say option B. Radar charts are often used when neighboring variables are unrelated, creating spurious connections so in this work instead of use a column chart for show independent variables, we illustrated them graphically because column chart might look cluttered. Each of the independent variables are in individual axes and each line is drawn connecting the data values for each drug.

Figure 6.The results of diversity test.

Figure 7.Radar plot of the distribution of selected descriptors used in QSAR modeling.

Comparison between statistical parameter of developed SVM model with these obtained by other researchers are shown in Table 4. As can be seen in this table over developed SVM model is superior over other models.

Table 4.abordered multi categorical classification method using the simplex technique; bgenetic algorithm; cconjugate gradient.



Multiple linear regression, artificial neural network, support vector machine and random forest methods were used as feature mapping techniques for modeling and prediction of oral bioavailabilities of some drugs from their molecular structural descriptors. The results show that the SVM model exhibit reliable statistical and prediction performance over other models. The structural descriptors resulted from the stepwise multi linear variable selection method can be used to guide chemists on how to design new drugs. The results of this study revealed that quantitative structure-activity relationship approach has a good applicability for accurate prediction of oral bioavailability.


Supported by : the Korean Chemical Society


  1. Moda, T. L.; Montanari, C. A.; Andricopulo, A. D. J. Bioorg. Med. Chem. 2007, 15, 7738.
  2. Andrews, C. W.; Bennett, L.; Lawrence, X. Y. J. Pharm. Res. 2000, 17, 639.
  3. Yasri, A.; Hartsough, D. J. Chem. Inf. Comput. Sci. 2001, 41, 1218.
  4. Turner, J. V.; Maddalena, D. J.; Agatonovic-Kustrin, S. J. Pharm. Res. 2004, 21, 68.
  5. Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; WILEY-VCH Verlag GmbH: 2000, Vol. 11, p 516.
  6. The Dragon Website.
  7. Hyper Chem Release 7.0 for windows; Hypercube, Inc., 2002.
  8. Stewart, J. P. P. MOPAC 6.0, Quantum Chemistry Program Exchange, vol. 455; India University: Bloomington, 1989.
  9. Cortes, C.; Vapnik, V. J. M. l. R. 1995, 20, 273.
  10. Bennett, K. P.; Campbell, C. J. ACM. SIGKDD. 2000, 2, 1.
  11. Fatemi, M. H.; Gharaghani, S.; Mohammadkhani, S.; Rezaie, Z. J. Electrochim. Acta. 2008, 53, 4276.
  12. Yang, S. Y.; Huang, Q.; Li, L. L.; Ma, C. Y.; Zhang, H.; Bai, R.; Teng, Q. Z.; Xiang, M. L.; Wei, Y. Q. J. Artif. Intell. Med. 2009, 46, 155.
  13. Golmohammadi, H.; Dashtbozorgi, Z.; Acree Jr., W. E. J. Pharm. Sci. 2012, 47, 421.
  14. Kermani, B.; Kozlov, I.; Melnyk, P.; Zhao, C.; Hachmann, J.; Barker, D.; Lebl, M. J. Sens. Actuators, B. 2007, 125, 149.
  15. Fatemi, M. H.; Dorostkar, F.; Ghorbannezhad, Z. J. Monatsh. Chem. Chem. Mon. 2011, 142, 1061.
  16. Consonni, V.; Todeschini, R.; Pavan, M. J. Chem. Inf. Comput. Sci. 2002, 42, 682.
  17. Balaban, A. T, 1983. J. Pure & Appl. Chem. 1983, 55, 199.
  18. Broto, P.; Moreau, G.; Vandycke, C. J. Med. Chem. 1984, 19, 66.
  19. Gramatica, P. J. QSAR. Comb. Sci. 2007, 26, 694.
  20. Tropsha, A.; Gramatica, P.; Gombar, V. K. J. QSAR. Comb. Sci. 2003, 22, 69.
  21. Maldonado, A. G.; Doucet, J.; Petitjean, M.; Fan, B.T. J. Mol. Diversity. 2006, 10, 39.
  22. Zhu, J., Wang, J.; Yu, H.; Li, Y.; Hou, T. Chem. High Throughput Screening. 2011, 14, 362.
  23. Turner, J. V.; Glass, B. D.; Agatonovic Kustrin, S. J. Anal. Chim. Acta. 2003, 485, 89.
  24. Wang, J.; Krudy, G.; Xie, X. Q.; Wu, C.; Holland, G. J. Chem. Inf. Model. 2006, 46, 2674.
  25. Kumar, R.; Sharma, A.; Varadwaj, P. K. J. Nat. Sci. Bio. Med. 2011, 2, 168.

Cited by

  1. Computational approaches to find the active binding sites of biological targets against busulfan vol.22, pp.6, 2016,