Quantitative structure-activity relationship, molecular docking, drug-likeness, and pharmacokinetic studies of some non-small cell lung cancer therapeutic agents

Lung cancer has been reported to be among the leading cancer cases in the world. It was also reported to have caused a lot of death every year and accounted for about one-third of the whole cancer deaths in the globe. The main subset of lung cancers that accounts for about 85% of the problems of lung cancer raised above was non-small cell lung cancer (NSCLC). The most common cause of NSCLCs that mostly affects women and cigarette smokers was recognized to be overexpression of epidermal growth factor receptor tyrosine kinase (EGFR TK). Five models on thirty five (35) NSCLC therapeutic agents were developed via quantitative structure-activity relationship (QSAR) technique. The best model among them was selected and reported due to its fitness statistically with the following validation parameters: R2 of 0.8764, R2adj of 0.8370, Qcv2 of 0.7655, R2test of 0.7024, and LOF of 0.3312. Molecular docking was used to elucidate the mode of binding interactions between the thirty five (35) NSCLC therapeutic agents and the binding pose of EGFR tyrosine kinase receptor (3IKA) in this research. Compound 29 was recognized to have the most excellent binding affinity of − 8.8 kcal/mol among others. The drug-likeness and pharmacokinetic properties of all the NSCLC therapeutic agents were predicted using SWISSADME, and none among the molecules under investigation violated more than the permissible limit of the conditions stated by Lipinski’s RO5 filters. Five hit compounds were identified using molecular docking virtual screening. The five (5) hit compounds were further screened and identified compound 16 and 27 as excellent among them using their pharmacokinetic profiles and drug-likeness properties. QSAR technique was used to build five models on thirty five (35) NSCLC therapeutic agents. The best model among them was reported because it is statistically significant with good validation parameters. The molecular docking result has identified five (5) hit compounds. The most common amino acid residues to all hit compounds under investigation were Glu762, Leu718, Lys745, and Val726 which might be responsible for the higher inhibitory activities/binding affinities of the compounds under investigation. Furthermore, these five (5) hit compounds were further subjected to drug-likeness and pharmacokinetic properties prediction to determine which among them have the best pharmacokinetic profile. Compounds 16 and 27 among the hit compounds were observed to have high chance of passive absorption by the gastrointestinal tract while the other three have low tendency of passive absorption. More so, only compounds 16 and 27 have higher bioavailability scores, and none of the two have more than one violation of the RO5 criteria. The cause of efficiency of compounds 16 and 27 might be as a result of good pharmacokinetic profiles and drug-likeness properties possessed by the molecules when compared to other hit compounds.


Background
Among the hurdles faced by medicinal chemists was the discovery of inhibitors for mutant-selective kinase and was among the primary interest for epidermal growth factor receptor (EGFR) tyrosine kinase inhibitors [1]. The treatment of EGFR to control non-small cell lung cancers with the T790M resistance mutation prevails as a vital medical necessity [2].
Lung cancer has been reported to be among the leading cancer cases in the world. It was also reported to have caused a lot of death every year and accounted for about one-third of the whole cancer deaths. The main subset of lung cancers that accounted for about 85% of the problems of lung cancer was NSCLC [3]. The most common cause of NSCLCs that mostly affects women and cigarette smokers was recognized to be overexpression of EGFR tyrosine kinase. It was found in about 10-15% and 30-40% of the population of patients in Caucasia and Asia [3].
NSCLC therapeutic agents manifest a very high response rate in patients with stimulating changes of EGFR. NSCLC therapeutic agents are categorized into two different classes: the first class is reversible NSCLC therapeutic agents (firstgeneration EGFR inhibitors) which include gefitinib and erlotinib. The second class is referred to as irreversible inhibitors and consists of the second and third generation NSCL C therapeutic agents. The second and third generation NSCLC therapeutic agents include afatinib and osimertinib. All these classes of drugs mentioned were designed purposely for the treatment of NSCLC. Most especially, the first-generation reversible NSCLC therapeutic agents were designed to manage EGFR L858R mutations. The secondgeneration irreversible NSCLC therapeutic agents were designed for the treatment of EGFR T790M mutations. And the third-generation irreversible NSCLC therapeutic agents were designed for the medication of EGFR T790M/L790M double mutations [1,[4][5][6].
QSAR is a computer-aided molecular modeling technique which quantitatively relates experimentally determined biological activities (response variable) of a molecule and its physicochemical properties (molecular descriptors) [7]. In addition, QSAR modeling is used to develop a model which could be used to predict the activities of newly designed small molecules [8]. Molecular docking is an in silico virtual screening method applied in computer-aided drug design used to elucidate how ligand and receptor interact with one another using their individual 3D structures [9]. The drug-likeness and pharmacokinetic properties of a drug give an insight on how the body responds to the administration of this drug. Therefore, there is a need to study the druglikeness and pharmacokinetic properties of this drug before it reaches the final (clinical) stage [10].
The aim of this work is to develop a model with good predictive power using QSAR modeling technique, to screen and identify hit among the compounds under investigation (by elucidating the mode of binding interactions between the NSCLC therapeutic agents (ligands) and the EGFR tyrosine kinase enzyme) using molecular docking simulation, and also to predict their druglikeness and pharmacokinetic properties.
2.2 Stable structure generations and structure sketching Before stable structure generations, the sketching of the 2D structures of the studied NSCLC therapeutic agents must be done, and this was achieved using the Chemdraw software version 12.0.2 [13]. The Spartan 14 software was used to transform the 2D structures of the sketched NSCLC therapeutic agents to 3D structures before energy minimization (it is achieved by direct importation of the 2D structures to the interface of the software). Also, prior to stable structure generations, there is need to remove constrain from the generated 3D structures, and this was achieved via energy minimization [14]. Stable structure generation is a process of determining the optimum structure of a compound, and this was performed by utilizing the Spartan 14 software. The determination of the optimum structure of all the NSCLC therapeutic agents was achieved adopting density functional theory method at B3LYP/6-311G* level of theory [15].

Independent variable (descriptors) computation, removal of constant/redundant variables, and data separation
For the computation of the independent variables (descriptors), the most stable structures generated were saved in SDF, a file format that is recognized by the software used in computing descriptors (PaDEL descriptor tool kit) [16]. PaDEL descriptor tool kit was used to compute both fragment count descriptors, topological descriptors, and geometrical descriptors [17]. Pretreatment of data is very vital in QSAR modeling which helps in eliminating constant and redundant descriptors from the data before model development so as to allow GFA select most significant descriptors. In present study, data pre-treatment was performed manually. Another crucial point in QSAR modeling is development of model building (training) and validation (test) sets. As the name implies, model building set is used to develop the model, and the validation set is used in verifying the built model. Data division software retrieved from DTC lab was moreover utilized in splitting the data into model building set which contains 30 molecules and validation set of 5 molecules utilizing Kennard-Stone algorithm in this regard [18].

Building of the model
In developing the models, the actual pGI 50 was used as the response parameter while the descriptors were used as independent parameter. Variable selection is very important in building QSAR models. In view of this, the models were built by adopting multi-linear regression (MLR) analysis using genetic function approximation (GFA) method in which it creates an original population of descriptor sets and determines the most suitable set from it by utilizing evolutionary crossover and mutation speculators which generates a succeeding derivative population of descriptor sets [19]. One of the distinct characteristics of GFA is that it selects highly significant independent variables to generate thousands of models so as to choose the most significant among the generated models [20]. Equation 2 below presents the MLR-GFA equation for the model: where A's are the descriptors, b's are the coefficient of the corresponding descriptors, and C is the regression constant.

Validation of the model built
Validation of QSAR model is of utmost importance. This is why a QSAR model is not considered valid unless it undergoes so many assessment, which if it passes then it can be used. The parameters used in evaluating or validating the quality of a QSAR model were the squared coefficient of correlation for the training set (R 2 training ), adjusted , and squared coefficient of correlation for the test set (R 2 test ). The equations for the mentioned assessment parameters are given below [21]: where x obs. , x pred. , and x training represents the actual, estimated, and mean activities of the model building set. The R 2 value was established to rely on the number of descriptors in the model. Therefore, the R 2 value must be adjusted. The adjusted R 2 is computed utilizing Eq. 4 below: where b represents the number of descriptors used in the model and a represents the number of compounds in the model building set.
where Y exp: ; Y pred: ; and Y are trial, foretold, and the mean inhibition activity values of the training set compounds [22]. The generated model can then be validated externally to confirm its predictive power and reliability. It is achieved using the validation set compounds. The external predictive power of the model was estimated using the expression shown below [23]: where x pred: test and x exp: test are the estimated and actual activities of the validation set, and x Training is the mean of actual activity of the model building set compounds. Due to some reasons, the values of these parameters are okay and important but not enough to justify the reliability of a model [24]. In view of this, the model has to be subjected to other test such as applicability domain, variation inflation factor, and mean effect.
The multi-collinearity of all the independent variables in the reported model is ascertain by computing the variation inflation factors (VIF) for each. The VIF help in identifying whether these independent variables correlate with one another or not. There is no correlation between the descriptors if their estimated VIF values are equal to 1; there is high possibility of accepting the model if their estimated VIF values are between 1 and 5; and if their estimated VIF values are greater than 10, then the model is therefore rejected not accepted [25]. The VIF value can be calculated using the equation below: In order to evaluate the individual contribution and participation of each descriptor to the selected model, the mean effect (ME) of each descriptor is therefore calculated. The equation used in calculating the ME is shown below: where ME represents the mean effect of a descriptor j in a model, the coefficient of the descriptor J is represented by β j in the model and the value of the independent variables for each compound in the training set is d ij , n is the number of compounds in the training set, and m is the number of descriptor that appear in the model [26].

Evaluating of applicability domain
The domain of applicability is studied to ensure the reliability of the prediction of the built MLR model. It is also useful in identifying compounds that are distinct to the training set compounds (influential compounds) or response outliers (compounds with standardized residual outside the square area of the model). The method adopted in this research was the leverage approach which is the plot of the standardized residual against the leverages for both the training and test set compounds. The reported model was subjected to AD using the leverage approach [27].

Docking simulation
For the docking simulation, the virtual screening software used in this research were AutoDock Vina of Pyrex, Discovery studio, and UCSF Chimera on a Dell computer system Latitude E6520 to screen and identify hit compounds by elucidating the binding mode between the binding pose of the target receptor and the NSCLC therapeutic agents.

EGFR tyrosine kinase enzyme and ligand preparation for the docking simulation
The EGFR tyrosine kinase enzyme with pdb code: 3IKA in complex with WZ4002 was downloaded from the Protein Data Bank (https://www.rcsb.org) and used as the target receptor for the NSCLC therapeutic agents in this research. Discovery Studio Visualizer version 16.1.0.15350 was adopted in preparing the EGFR tyrosine kinase enzyme for the docking simulation. The preparation process of the target receptor started by adding hydrogen, then followed by the elimination of co-ligands, water molecule, and heteroatoms from the structure of the target receptor and saved in protein data bank file format. The prepared structure of the target receptor is shown in Fig. 1. The NSCLC therapeutic agents were prepared by saving the already determined optimum structures in 2.2 above saved in protein data bank file with the help of the Spartan'14 wave software [14]. The prepared structure of one NSCLC therapeutic agent among the dataset is shown in Fig. 2.

Execution of the docking simulation
The docking simulation of the NSCLC therapeutic agents into the binding site of the target receptor (Met793, Ala743, Met790, Leu844, Leu844, Leu718, and Val726, these binding sites were determined by visualizing the co-crystalline structure of WZ4002 in the binding site of the enzyme) was carried out using AutoDock Vina of Pyrex software [28]. Re-coupling of the complexes for further investigation was achieved with the help of the UCSF Chimera software [29]. For further investigation of the binding mode interactions of the complexes, a discovery studio visualizer software was used to elucidate the 2D structures of all the reported complexes [30,31].

Drug-likeness and pharmacokinetic property prediction
The drug-likeness and pharmacokinetic properties of these NSCLC therapeutic agents were predicted utilizing a free online web tool (SwissADME) (http://www. swissadme.ch/index.php) used in predicting druglikeness and pharmacokinetic properties of drugs [32]. The input file format for SwissADME is simplified molecular input line entry specification (SMILES) which contains a unit compound by line separated by a space with a title or without a title. The computation can be setup when the molecule is ready by clicking on the "Run" button [32].
Lipinski's rule of five filter is mostly used as the criterion to ascertain whether a molecule is impermeable or badly absorbed. A molecule is considered to be orally bioavailable if it does not violate more than 2 of the RO5 [33].

QSAR study
The results of the QSAR study are given in Tables 2, 3

Docking simulation
The results of the docking simulation are presented in Table 6 and Fig. 6.

Drug-likeness and pharmacokinetic property prediction
The results of the drug-likeness and pharmacokinetic property prediction are shown in Tables 7 and 8 and Figs. 7 and 8 respectively.

QSAR studies
Using the model building set, five (5) different models were built using MLR-GFA method. Among these five models, the best model was selected and reported since it has passed the minimum requirements for the evaluation of a valid QSAR model as reported by Veerasamy et al. as presented in Table 2 [23]. The descriptions of the descriptors contained in the reported model are shown in Table 3. The negative coefficients of MATS8s, GATS5p, and VR1_Dt descriptors clearly indicated their negative contribution to the inhibitory activities of the NSCLC therapeutic agents. It means that when the amount of these independent descriptors is reduced in the structures of these NSCLC therapeutic agents under investigation, there might be an improvement in the potency of these NSCLC therapeutic agents toward their target receptor (EGFR tyrosine kinase enzyme) and reverse is the case. On the other side, the positive coefficient of ATSC8c, minssCH2, RDF120m, and RDF125m descriptors in the model gave the positive contributions of these independent descriptors to the inhibitory activities of the NSCLC therapeutic agents under investigation. It means when the amount of these descriptors in the compositions/structures of these NSCLC therapeutic agents are increased, there might be an improvement in the potency of these NSCLC therapeutic agents toward their target receptor and vice versa.

Description of the descriptors that appear in the reported model
ATSC8c is an average centered Broto-Moreau autocorrelation; the ATS descriptor is a graph invariant describing how the property considered is distributed along the topological structure and can be seen as a special case in which other types of descriptors can also be derived from [34]. The recognized spatial autocorrelation on a molecular graph G is given as MATS8s is a Moran autocorrelation which is applied to a molecular graph. Moran coefficient usually takes value in the interval [− 1, + 1]. Positive autocorrelation corresponds to positive values of the coefficient whereas negative autocorrelation produces negative values [34]. It can be defined as Radial distribution function descriptors (RDF descriptors) were suggested based on a radial arrangement function distinct from that generally used to determine molecular changes I (s) (Hemmer, Steinhauer et al., 1999). The radial distribution function chosen here is that one frequently utilized for the description of the diffraction patterns gotten in powder X-ray diffraction experiments.
Ideally, the radial distribution function of a collection of atoms B may be described as the possible occurrence to obtain an atom in a spherical volume of radius R. The common mode of the radial distribution function is expressed by the equation below Table 4 shows the pGI 50 , predicted pGI 50 , and the residual values for all the molecules under investigation. The high predicted power of the reported model was confirmed by the low residual values observed between the experimental and predicted pGI 50 in the table (meaning that the reported model was reliable with high predicted power). Furthermore, Fig. 3 presents the plot of the predicted pGI 50 versus actual pGI 50 for the test and training sets compounds, the distribution of the predicted pGI 50 and the actual pGI 50 of the test and training set compounds throughout the line reaffirmed the reliability of the model. More so, the R 2 values of both the internal validation (0.8175) and that of the plot (0.8764) agreed with one another which further confirmed the stability and reliability of the reported model. On the other hand, Fig. 4 presents the scatter plot of the residuals against actual pGI 50 in which the unusual occurrence of these residuals of both sets on the upper and lower sides of zero on the plot confirm that the reported model was free from methodological error (systematic deviations).  The correlation statistical analysis on the independent descriptors in the reported model shown in Table 5 indicated that no relationship exists between the descriptors contained in the reported model. This clearly portrayed the high performance of the descriptors used in developing the reported model.
The variation inflation factor (VIF) values were further used to confirm if there is multi-collinearity problem or not in the descriptors of the training set used in building the model. The VIF of all descriptors in the training set were estimated and realized to be within the acceptable range presented in Table 5 (meaning the values are less than 10 for all the descriptors). This confirms the absence of multi-collinearity problem in the descriptors used in building the reported model.
The mean effect (ME) values for all the descriptors were computed to ascertain the participation and individual contribution of a descriptor in opposition to other ones in the selected model and presented in Table 5. The indicator for either increase or decrease in potency of the molecules is the sign of the coefficient of each descriptor in the model. If a descriptor in the model has a positive coefficient it means that an increase in such descriptor may increase the potency of the molecules. But when a descriptor has negative coefficient, it indicates that an increase in such descriptor may decrease the potency of the molecules. Whereas the coefficient of the descriptors indicate the degree of contribution of each descriptor in the model. It is observed that from the model and ME values (Table 5), ATSC8c descriptor gave the highest positive contribution both in the model and ME analysis with + 2.797519677 and + 0.796318. MATS8s gave the lowest negative contribution in both the model and ME analysis with − 1.977464485 and − 0.43128.
The applicability domain (AD) of the reported model was achieved by the plot of the standardized residuals against leverages of both the test and training sets (Williams' plot) as shown in Fig. 5. The AD is carried out to identify compounds with standardized residuals greater than three standard deviation unit (outliers) and compounds with leverage values greater than the warning leverage h* (influential) in the data used in building the model. Apart from that, it is also used to ascertain the   Table 1). Five hit compounds were identified by the virtual screening technique. Compound 29 was identified to have the highest binding affinity of − 8.8 kcal/mol ( Table 6). The compound was seen to have interacted with the binding pose of EGFR tyrosine kinase receptor through hydrophobic bond interaction with Phe795, Gly796, Ala743, Leu844, Leu718, Val726, Ala743, and Lys745 amino acid residues of the EGFR tyrosine kinase receptor. More so, it also interacted with EGFR tyrosine kinase receptor via electrostatic bond interaction with lys745 amino acid residue. Next compound in the trend with higher binding affinity (− 8.7 kcal/mol) was compound 12 as shown in Table 6. The interactions of the compound in the binding pose of the EGFR receptor were through hydrogen bond with UNK1 Arg841, Asp855, Glu762, and Glu762 amino acid residues with bond distances of 2.49587 (Å), 3.7987 (Å), 3.25863 (Å), 3.72606 (Å), and 3.72287 (Å). It also interacted with the active site of the EGFR receptor via hydrophobic interactions with Leu718, Leu718,   Table 6. Figure 6 showed the 2D structures of compounds 29, 12, and 16 in complex with the receptor (3IKA). Based on the molecular docking results, the most common amino acid residues to all of the hit compounds under investigation were Glu762, Leu718, Lys745, and Val726 (Table 6), and these important amino acid residues might be responsible for the higher binding affinity of the reported compounds.

Drug-likeness and pharmacokinetic property prediction
The drug-likeness and pharmacokinetic properties of all the NSCLC therapeutic agents were predicted and presented in Supplementary Table 2 and 3. Based on the results of the molecular docking, the drug-likeness properties of the hit compounds ware reported and presented in Tables 7 and 8. From the table, no molecule among these reported ones violated more than the permissible limit of the conditions stated by Lipinski's rule of five filters (MW˂ 500, HBD ≤ 5, HBA ≤ 10, Log p ≤ 5, and PSA ˂ 140 Å2). As such, these molecules are expected to be very active pharmacologically. The bioavailability radar plot of lipophilicity, size, polarity, solubility, saturation, and flexibility further reaffirmed the drug-likeness properties of all the reported molecules (Fig. 7). The painted pink area shows the range for each property (XLOGP3 between − 0.7 and + 5.0, MW between 150 and 500 g/mol, TPSA between 20 and 130 A 2 , log S not higher than 6, fraction of carbons in the sp3 hybridization not less than 0.25, and no more than 9 rotatable bonds). Based on the condition mentioned, all the molecules might be orally bioavailable even though they were all too flexible and lipophilic. Table 8 presents the pharmacokinetic properties of the reported molecules. From the table, only molecules 16 and 27 have high probability of passive absorption by the gastrointestinal tract while others have low tendency of passive absorption. None of the reported molecules was found to have high probability of brain penetration. Molecules 12, 27, and 29 were predicted to be actively effluxed by P-gp and the other two were predicted as non-substrate of P-gp. Also, only molecules 16 and 27 have higher bioavailability scores (this confirmed the oral bioavailability and permeability of these molecules among the reported molecules and also they have low toxicity level and good absorption properties). The boiled-egg plot (Fig. 8) of TPSA against WLOGP was used to portray the graphical presentation of the brain penetration and gastrointestinal absorption of the reported molecules. From Fig. 8, it can be clearly seen that all the NSCLC therapeutic agents were outside BBB region (yellow) but some were within the GI absorption region (white color) and some were predicted to be actively effluxed by P-gp (blue in color) and then some were predicted as non-substrate of P-gp (red color).

Conclusion
In conclusion, QSAR technique was used to build a model with a very high predictive power on some thirty five (35) NSCLC therapeutic agents. The reported model was found to be statistically fit by passing validation techniques employed on it with the validation parameters: R 2 of 0.8764, R 2 adj of 0.8370, Q cv 2 of 0.7655, R 2 test of 0.7024 and LOF of 0.3312 such as internal and external validations and AD. The molecular docking results showed that the most common amino acid residues to all of the reported complexes were Glu762, Leu718, Lys745, and Val726, and these important amino acid residues might be responsible for the higher inhibitory activities/binding affinity of the  reported compounds. More so, the drug-likeness and pharmacokinetic properties of all the NSCLC therapeutic agents were predicted using SwissADME and indicated that molecules 16 and 27 among the hit have high probability of passive absorption by the gastrointestinal tract while the other three have low tendency of passive absorption and also none of the reported molecules was found have high probability of brain penetration. Also, only molecules 16 and 27 have higher bioavailability scores. Based on this finding, it is suggested that when designing new NSCLC therapeutic agents these hit compounds with good binding affinity and pharmacokinetic profile should be considered for structural modifications. And also, in vivo and in vitro assay for the ADME properties should be validated experimentally.