Rational in silico drug design of HIV-RT inhibitors through G-QSAR and molecular docking study of 4-arylthio and 4-aryloxy-3-iodopyridine-2(1-H)-one derivative

Human immunodeficiency virus infection and acquired immune deficiency syndrome (HIV/AIDS) is a spectrum of conditions caused by infection with the human immunodeficiency virus (HIV). Antiretroviral therapy (ART) against HIV infection offers the promise of controlling disease progression and prolonging the survival of HIV-infected patients. Reverse transcriptase (RT) inhibitors remain the cornerstone of the drug regimen to treat AIDS. In this direction, by using group-based QSAR study (G-QSAR), identification of the structural need for the development of lead structure with reverse transcriptase inhibition on 97 reported structures was carried out. Docking analysis was performed further and suggested the structural properties required for binding affinity with the receptor. The molecules in the data set were fragmented into six (R1, R2, R3, R4, R5, and R6) by applying the fragmentation pattern. Three G-QSAR models were selected based on the statistical significance of the model. The molecular docking study was performed to explain the structural properties required for the design of potent HIV-RT inhibitors. The statistically validated QSAR models reveal the presence of higher hydrophobic groups containing single-bonded –Br atom, 2 aromatic bonded –NH group with less electronegativity, and entropic interaction fields at R2 essential for better anti-HIV activity. The presence of a lipophilic group at R3, oxygen and sulfur connected with two aromatic bonds at R4, and –CH3 group at R5 was fruitful for reverse transcriptase inhibition. Docking studies of the selected inhibitors with the active site of reverse transcriptase enzyme showed hydrogen bond, Van der Waal’s, charge, aromatic, and π–π interactions with residues present at the active site. The results of the generated models provide significant site-specific insight into the structural requirements for reverse transcriptase inhibition during the design and development of novel anti-HIV compounds. Molecular docking study revealed the binding interaction between the ligand and the receptor which gave insight towards the structure-based design for the discovery of more potent compounds with better activity against HIV infection.


Background
Human immunodeficiency virus (HIV) attacks the body's immune system, specifically the CD4 cells (T cells), which help the immune system fight off infections. HIV reduces the number of CD4 cells in the body, making the person more likely to get other infections or infection-related cancers [1,2]. Over time, HIV can destroy so many of these cells that the body cannot fight off infections and leads to acquired immunodeficiency syndrome (AIDS), the last stage of HIV infection [3]. It is one of the world's most significant public health challenges, particularly in low-and middle-income countries. In 2017, approximately 36.9 million people (35.1 million adults) were living with HIV and 1.8 million people became newly infected, globally. Nearly 1 million people died from AIDS-related illnesses in 2017 [4]. The medicine used to treat HIV is called antiretroviral therapy (ART) [5,6]. There are various antiretroviral drugs available in the market such as entry or fusion inhibitors, nucleoside or non-nucleoside reverse transcriptase inhibitors (NRTI/NNRTI), integrase inhibitors (IN), protease inhibitors (PI), and maturation inhibitors [7]. The resistance of the virus to the available antiretroviral drugs is the biggest challenge for ART, and the discovery of new anti-HIV agents to overcome this resistance is continually required [8]. Human immunodeficiency virus (HIV) is a retrovirus because of the presence of an enzyme reverse transcriptase (RT), which possesses both RNA-dependent DNA polymerase (RDDP) and ribonuclease-H (RNase H) activities that work in tandem to convert viral genomic single-stranded RNA to double-stranded DNA by the process reverse transcription and by retrotransposon, mobile genetic elements are then integrated into the DNA of the infected host cell to cause disease [9][10][11]. Hence, reverse transcriptase inhibitors both nucleoside reverse transcriptase inhibitor (NRTI's) and non-nucleoside reverse transcriptase inhibitor (NNRTI's) are active against HIV which inhibits the process reverse transcription that leads to inhibition of formation of DNA from viral RNA ( Fig. 1) for insertion into the host DNA sequence cause treatment of the disease [12,13].
Traditional drug development methods are based on random screening, which has a major disadvantage like a lengthy, expensive, and intellectual method. To overcome these disadvantages in the recent decade, the emergence of computer-aided drug development (CADD) processes are taken place [14]. These methods are run at a lower cost and give a viable option for the screening of potential drug candidates. The number of methodologies is involved in the CADD, out of this quantitative structure-activity relationship (QSAR) is one such efficient chemoinformatic tool, which aims to identify and quantify the relationship between molecular structure and certain physicochemical properties for the development of predictive models [15]. QSAR models like 2D and 3D are having their own merits and demerits. Generated 2D QSAR models are only correlating the physiochemical properties of the molecules with their biological activity. These models do not specify the site at which modification is required for enhancement of activity. For this purpose, 3D-QSAR models are suitable for the prediction of the activity of the compounds based on their 3D grid points generated around the aligned set of molecules. One of the major limitations of the 3D-QSAR method is its dependency on molecular alignment and conformers chosen for the alignment. This becomes crucial when the information of confirmation is absent or when the molecular framework is not rigid. Hence, it is clearly understood that there is a requirement of a QSAR method which will allow flexibility to study molecular sites of interest and not depend upon conformational analysis and alignment of the molecules to provide information about sites and nature of interaction required for activity. Fragment-based G-QSAR has shown promise in current drug discovery and lead optimization. This method involves the calculation and evaluation of descriptors only for the substituent groups or molecular fragments instead of the whole molecule [16]. The biggest advantage of this method is it can be applied for both congeneric and non-congeneric series. G-QSAR method is helpful over 2D-and 3D-QSAR methods because this method provides useful information about the significant substitution sites, their chemical nature, and overall interaction that affects the therapeutic activity of the compounds [17,18]. The focus of the present study was to develop fragmentbased G-QSAR model correlating the biological activity of the molecular fragments with the 2D fragment-based descriptors and molecular docking study to interpret the structural requirement and the mechanism of the binding interaction between the ligand and the receptor on a congeneric series of 97 compounds taken from reported literature. The information derived from such predictive models could be utilized for accurate prediction of biological activities (dependent variables) of molecular fragments based on their chemical structure and properties (independent variables) that can be utilized as the building blocks for designing of novel anti-HIV molecules before synthesis and experimental testing [19,20]. The result of the docking study can improve the binding process of ligands with its receptor and provide insights into the structural features related to the activities of the new drug compounds.

Data set
To perform the present computational study, a data set of 97 compounds having reported IC 50 values was taken from the literature [21]. The selected compounds for the study shared the same activity and assay procedure with significant variations in their structure and potency. Inhibitory potencies of the compounds in the data set have IC 50 value ranges from 0.63 to 19.5 nm which were further converted to pIC 50 by using the mathematical formula given as Eq. 1: The chemical structure of the congeneric dataset was prepared by using MarvinSketch. The conversion of 2D structures to 3D structures and energy minimization of 3D compounds were performed by the help of force field batch minimization modules of the VLifeMDS software [22]. Energy minimization is performed to optimize the molecules up to their lowest stable state of energy. A template which is the representative of the entire molecules of the dataset under study was drawn with the presence of a dummy atom at the substitution sites. The template has 6 substitution sites depicted as R 1 -R 6 in Fig. 2. The study was performed by using VLife MDS version 4.6 from VLife Sciences Technologies Pvt Ltd, Pune, India.

Calculation of descriptors for G-QSAR modeling
This step is performed by using the G-QSAR module of VLife MDS. The common scaffold (Fig. 2) was used as a template for the generation of fragment-based G-QSAR models. The optimized molecules were imported into the QSAR worksheet, and their pIC 50 values were incorporated manually, followed by the calculation of various physiochemical descriptors for the different substitution sites of the compounds [23]. A total of 321 physiochemical descriptors for classes like individual physiochemical descriptors like molecular weight, hydrogen bond donors and acceptors, retention index (chi), atomic valence connectivity index (chiv), chain path count, cluster, path cluster, kappa, estate numbers, estate contributors, and information theory index were calculated for the groups to present at the substitution site in each of 97 molecules presented in Table 1.

Data selection and building of G-QSAR model
The data set was divided into a training set of 75 compounds and a test set of 22 compounds based on Sphere Exclusion algorithms so that the activity of the selected test set is uniformly distributed throughout the activity column of the compounds [19]. Unicolumn statistics is performed for both training and test series to check the spread of data. The results of the unicolumn statistics study are presented in Table 2. From the result, the test set is interpolative, i.e., the activity of the test set is derived within the activity range of the training set. The mean and standard deviation of the training and test sets provide insight into the relative difference of mean and point density distribution of the two sets. The average value of the test set is higher than the training set shows the presence of relatively more active molecules as compared to the inactive ones.
For the building of G-QSAR models, a simulated annealing algorithm (SA) was utilized. After that, multiple G-QSAR models were generated using multiple linear regression (MLR), partial least squares regression (PLSR), and principal component regression (PCR).

Validation of the developed G-QSAR model
For validation of the developed G-QSAR models, the data set is divided into two sets as training and test sets. This division is based on the substitution groups and the inhibition of compounds. The training set is employed to produce the QSAR model, and the test set is used to validate the quality of the developed models [24]. The statistical parameters of the developed models and internal and external validations are adopted for testing the fitness, stability, and predictive ability of the QSAR models [17]. The models are validated by considering many statistical parameters such as the number of compounds in regression (n), the number of variables (k), degree of freedom, squared correlation coefficient (r 2 ), cross-validated correlation coefficient (q 2 ), Fisher's value (F test), and r 2 for the external test set, (pred_r2) for external validation. For the internal predictive ability of the model, leave-one-out (LOO) method is used, showed as the value of q 2 (cross-validated explained variance). External validation of the developed QSAR models is performed by measuring the predictive power of the current models on the external test set by calculating the pred_r2 value as given in Eq. 2, which gives the statistical correlation between predicted and actual activities of the test set compounds where y i , y _ i , and y mean are the actual, predicted activity of the ith molecule in the test set, and the average activity of all the molecules in the test set, respectively.
Internal validation of the developed QSAR models is performed by calculating the q 2 value as given in Eq. 3, which gives the statistical correlation between predicted and actual activities of the training set compounds.
where y i , , and y mean are the actual, predicted activity of the ith molecule in the training set, and the average  activity of all the molecules in the training set, respectively. The generated models were considered having a significant predictivity when the squared correlation coefficient (r 2 ) between physiochemical descriptors (independent variable) and activity (dependent variable) was over 0.6. The developed model possesses significant internal and external predictivity when the crossvalidated correlation coefficient of the leave-one-out method (q 2 ) > 0.6, the correlation coefficient of the training set (pred_r2) > 0.5, and higher value of F test.

Molecular docking study
A molecular docking study is a computational approach for searching a ligand that can fit both geometrically and energetically into the binding site of a target to show biological activity [25]. Docking study helps to predict drug/ligand or receptor/protein interactions by identifying the suitable active sites in the protein, getting the best geometry of ligandreceptor complex, and calculating the energy of interaction for different ligands to design more effective ligands with good binding affinity against RT enzyme [26].

Protein preparation
For protein preparation, three-dimensional crystallographic structures, and the coordinates of the target protein (PDB-ID: 1S6G) having resolution 3.0 Å is retrieved from the RCSB PDB database (https://www.rcsb. org). The protocol for protein preparation was performed by deleting the bounded ligand, inserting missing atoms in incomplete residues, deleting alternate conformations, and modeling the missing loop regions with the help of Biopredicta (homology modeling) modules of the VLife-MDS software [27]. The final 3D structure of RT was evaluated using Biopredicta modules; Ramachandran plot showed that 87.95% of residues presented in most favored regions (Fig. 3). Before performing docking bond, orders of the ligands are assigned, hydrogen atoms are added, and the water molecules which do not involve in the interaction are deleted.

Protein-ligand docking
The molecular docking study was performed by using the Biopredicta tool of the V-Life MDS software version 4.6. The optimized and symmetrical crystalline protein structure got from Protein Preparation Wizard was used for docking study. The energy minimization of the crystal structure was carried out to relieve the steric clashes among the residues due to the addition of hydrogen atoms [28]. Crystallographic water molecules (water molecules without H bonds) were deleted. Hydrogen bonds corresponding to pH 6.8 were added considering the appropriate ionization states for both the acidic and basic amino acid residues. A grid box was generated at the centroid of the active site for docking, and the active site was defined with a 10 Å radius around the ligand present in the crystal structure [29]. To test the docking parameters, all compounds were docked into the catalytic pocket of the RT enzyme (PDB-ID: 1S6Q). Finally, the best-docked structure was selected using the dock score function [30].

Development of G-QSAR model
In the present work, effective G-QSAR model for 97 molecules belonging to the congeneric series with viability in the biological activity is generated by calculating fragment-based molecular descriptors. By using the VLife MDS software, a total of 984 descriptors are calculated, and after removal of invariable columns, a total of 321 descriptors are used to generate G-QSAR models. The sphere exclusion method with a dissimilarity value of + 1 resulted in a training set of 75 and a test set of 22 compounds. The test set was selected as maximum pIC 50 value of the test set compounds if it was less or equal to that of the training set and the lowest pIC 50 value of the test set compound was more than or equal to that of the training set so that the test set has been derived from the maximum-minimum range of training set. Multiple G-QSAR models were built using a simulated annealing algorithm (SA) coupled with multiple  linear regression (MLR), partial least squares regression (PLS), and principal component regression (PCR). Several models were generated, and the best three out of them are selected based on the statistical values like r 2 , q 2 , pred_r 2 , F test, and standard error. The statistical parameters of each model are shown in Table 3.
The fitness plot between actual and predicted activity for training and test set compounds provides an idea about how well these models are trained and how well they predict the activity of the external test set. Further, the distribution curve of actual and predicted activity for training and test set compounds of the developed models is depicting closeness between the actual and predicted activity of the compounds for training and test sets. Descriptors like SK-hydrophobic area, SaaN-count, Chi-3-cluster, SscHE-index, SaaNE-index, Delta epsilon-C, SsBr count, SaaNH count, smr, SssScount, SaaOcount, SsCH3count, 1-path count, Id-average, Chi v3, and Chi v2 are showing contribution on the respective substitution sites.

Interpretation of model-I (SA-MLR)
The G-QSAR model-I was obtained by using a simulated annealing algorithm (SA) coupled with multiple linear regressions (MLR). The correlation equations between activity (pIC 50 ) and the selected parameters are expressed by Eq. 4; The generated model-I was statistically significant with predictivity, r 2 = 73.11%, internal (q 2 ), and external  (pred_r 2 ) validation revealed a predictive power of 68.74% and 73.25% respectively. The fitness plot between actual and predicted activity for training and test set compounds is given in Fig. 4a, which provides an idea about the predictivity of the model for training and test set compounds. Further, the distribution curve of actual and predicted activity for training and test set compounds for model-I are presented in Fig. 5a, b, depicting closeness between the actual and predicted activity of the compounds for training and test set. The contribution of different physiochemical descriptors towards activity is shown in Fig. 6a.

Interpretation of model-II (SA-PLS)
Model-II obtained by simulated annealing algorithm (SA) associated with partial least square regression (PLS) expressed as Eq. 5 explains an improved correlation coefficient of r 2 = 85.24%, internal (q 2 ) and external (pred_r 2 ) The fitness plot and the distribution curve of actual and predictive activity for training and test sets are given as Fig. 4b, and Fig. 5 c and d show that the minimum difference between the actual and predicted values of the compounds is a measure of the high quality of the model. The contribution curve of descriptors towards activity is presented as Fig. 6b.

Interpretation of model-III (SA-PCR)
The third model, model-III, was obtained by a simulated annealing algorithm (SA) associated with principal component regression (PCR) expressed as Eq. 6 explaining the biological activity as a function of some physiochemical descriptors like SKHydrophobi-cArea, SaaNcount, Chi3Cluster, SsCH3count, SaaScount, and SaaOcount at their respective fragmented substitution sites. The generated model has correlation coefficient r 2 = 71.24%, internal (q 2 ) and external (pred_r2) predictive ability of 64.21% and 70.25% respectively.
The fitness plot and the distribution curve of actual and predictive activity for training and test set (Fig. 4c and Fig. 5e, f) shows a correlation between actual and predicted activity. Contribution curve of descriptors towards activity is shown in Fig. 6c.

Molecular docking
The docking studies were carried out for 97 data set compounds into the catalytic pocket of the prepared target RT enzyme (PDB-ID: 1S6Q) by the V-Life MDS software. This study is useful to identify the binding potency and poses of active molecules that reveal the molecular mechanism of action. The compounds during docking showed several poses, orientation, and configurations. Each configuration is characterized by a combined score of van der Waal's forces, hydrogen bonding, pi interaction, charge interaction, halogen bond interaction, and salt bridge interaction. The docking scores of the listed compounds are presented in Table 4. The docking study revealed that interactions were dominated by the hydrophobicity or π-aromaticity due to the presence of aromatic and hetero atomic rings significant for stacking interactions. The docking interaction result of compound 51 given in Table 5 reveals that the interactions were dominated in the region of LYS-374, GLN-572, TRP-573, PRO-574, VAL-609, PHE-610, GLU-945, and GLU-948 amino acid residues due to the presence of the active site in the region (Fig. 7a). The 2D-docked pose of compound 51 in the active site of the target receptor is shown in Fig. 7b.

Discussion
All three models developed during the present study are more significant and found to have good predictivity. Model-I developed by SA-MLR is statistically significant and shows good predictivity for both training and test set of compounds. The contribution plot of descriptors towards pIC50 for model-I shows the positive contribution of SKHydrophopic area (34.03%), indicates the substitution of the higher hydrophobic group like substituted aromatic or heterocyclic rings at R2 increase activity, while the negative contribution of SssSHE-index (− 13.67%) and SaaNE-index(− 2.57%) at R2 position shows -NH2 connected to one bond and -SH group connected with over one single bond are conducive for anti-HIV activity. The positive contribution of H-acceptor count (27.95%) at R4 and XlogP (21.76%) at R3 indicates the presence of Hbond forming elements like oxygen and sulfur at R4, and higher lipophilic group at R3 increases drug activity by increasing its binding with the receptor. G-QSAR equation (model-II) developed by using SA-PLC algorithm shows improved statistical study results for both training and test set of compounds. The contribution plot shows the positive contribution of molecular weight (5.25%) at R1 which indicates that the substitution of the bulkier group at R1 increases drug activity. The positive contribution of descriptors like SKHydrophobic area (24.89%), SsBrcount (1.08%), SaaNHcount (2.38%), and 1Pathcount (4.98%) at R2 shows substitution of a higher hydrophobic group containing more -Br atom connected with one bond, the total number of -NH group connected with 2 aromatic bonds, and fragment R2 of first-order increase anti-HIV activity. Descriptors like DeltaEpsilonC (− 42.91%) and IdAverage (− 7.03%) showing significant negative contribution towards activity indicate a decrease in the contribution of electronegativity and entropic interaction fields of the fragment at R2 increase biological activity. The positive contribution of slogP (2.80%) at R3 suggesting the presence of the lipophilic group at R3 enhances inhibitory activity. Atomic valence connectivity index ChiV3 (− 2.93 %) and ChiV2 (− 2.77%) at R4 contributes negatively towards activity indicates the atomic connectivity index of fragment R4 of order 3 and 2 decrease activity. The descriptor smr (− 2.93 %) at R5 deleterious towards activity shows the presence of a group having low molecular refractivity increase enzyme inhibition. Model-III developed by SA-PCR is also significant in predicting the role of descriptors towards activity for both training and test set. The contribution plot for descriptors and pIC50 shows a positive contribution of SKhydrophobicArea (25.98%) and Chi3Cluster (28.02%), and negative contribution of descriptor SaaNcount (− 6.7%) at R2 means the presence of the group having higher hydrophobicity, 3rd order cluster chi index, and total nitrogen not connected with two aromatic bonds increase anti-HIV activity. The positive contribution of Subaccount (19.81%) and SaaOcount (7.17%) at R4 indicates the presence of oxygen and sulfur connected with two aromatic bonds at this site increases enzyme inhibition activity. Descriptor SsCH3count (12.29%) at R5 contribute positively towards activity, indicates the presence of -CH3 group increases drug activity by increasing its binding efficiency with the target site of the receptor. Molecular docking was performed for these compounds present in the data set with the active site of the target RT enzyme to derive the ligand-receptor interaction mechanism. The 2D-dock pose of compound 51 evident that the generated map of hydrophobic and hydrophilic fields, where benzene ring and heteroatomic (Pyridine and furan) rings are present in the chemical structure, is buried in the hydrophobic pocket. Further docking analysis shows the amide group of fragment R2 of the ligand forms a hydrogen bond between amino (-NH2)    interaction, and amino acid TRP-573 involved in both aromatic and charged interaction with the ligand molecules increase the stabilization of inhibitor at the active site of RT enzyme. The result of the G-QSAR and molecular docking study provided a molecular level understanding to infer that identified compounds are promiscuous and might be a potential inhibitor of RT enzyme.

Conclusion
In the present study, an attempt was made to generate novel fragment-based QSAR (G-QSAR) models for a congeneric series of 97, 4-arylthio, and 4-aryloxy-3-iodopyridine-2(1H)-one derivative with known anti-HIV activity. The whole data set of compounds were divided into training and test sets, and three G-QSAR models were developed by a simulated annealing algorithm (SA) coupled with multiple linear regression (MLR), partial least squares regression (PLS), and principal component regression (PCR). All generated models I, II, and III were statistically significant and provide site-specific clues for the design of new reverse transcriptase inhibitors. Model-II developed by SA-PLS was more significant among all; the result of the statistical parameters were r 2 = 85.24%, q 2 = 69.25%, pred_r2 = 74.21%, the higher degree of freedom (89.35), F test (55.037), and low standard error (pred_r2se = 0.3078) values during validation study for both training and test set compounds fulfilled the conditions for predictive and robustness. This model indicates the presence of higher hydrophobic substituent containing single-bonded -Br atom, 2 aromatic bonded -NH group with less electronegativity, and entropic interaction fields at fragment R2 essential for better anti-HIV activity. Similarly, the presence of a lipophilic group at R3, oxygen, and sulfur connected with two aromatic bonds at R4 and -CH 3 group at R5 increases inhibition activity by increasing binding efficiency with reverse transcriptase enzyme. G-QSAR method allows ease to interpretation, unlike conventional QSAR method which could only suggest important descriptors but does not reflect the site where it has to be optimized for further design of new compounds. Molecular docking results of the G-QSAR-generated compounds showed the interactions between the fragmented groups reported for these compounds and the residues located at the binding site. Dock pose of the selected compound reveals that the presence of a group at fragmented site R1, R4, R5, and R6 are responsible for hydrophobic interaction, and group at R2 is essential for both H-bond and hydrophobic interaction with the amino acid residues at the active site of the target enzyme. The findings got from G-QSAR and docking studies were utilized for designing newer RT-inhibitor anti-HIV agents. It is therefore concluded that the molecular manipulations at appropriate sites suggested by structure-activity relationship data will prove beneficial for identifying particular chemical variation at specific substitution sites and mathematical models for prediction of biological activity of newly designed molecules.