Molecular docking and QSAR theoretical model for prediction of phthalazinone derivatives as new class of potent dengue virus inhibitors

Dengue fever is a key public health unease in various tropical and sub-tropical regions. The improvement of existing agents that can inhibit the dengue virus is therefore of utmost importance. In this work, the QSAR study was carried out on 25 molecules of phthalazinone derivatives which have been reported to possess excellent dengue virus inhibitory activity. Density functional computational technique was used in the optimisation of the molecules with the basis set at theory level (B3LYP, 6-31G*) respectively. The multiple linear regression (MLR) model was built using genetic function approximation (GFA) in the material studio software package. Also, in this study, molecular docking simulation was carried between dengue virus serotype 2 protease (PDB CODE: 6mol) and some selected phthalazinone derivatives (compounds 1, 2, 7, 11, and 21). The model was robust as evidenced by validation and robustness statistical parameter which include predicted R2pred., adjusted R2adj., cross-validated Q2 and R2 regression coefficient, etc (R2pred. = 0.71922, R2adj. = 0.939699, Q2CV = 0.905909, R2 = 0.955567) respectively. The molecular docking studies conducted in this study have outlined the binding affinities of the selected compounds (1, 2, 7 11, and 21) which are all in good correlation with their respective pIC50 values. The free binding affinities of the selected compounds were found to be (− 8.7, − 8.8, − 8.7, − 8.3, and − 8.9 kcal/mol) respectively, compound 21 with the binding affinity of − 8.9 kcal/mol had the best binding free energy with the protease relative to other compounds under consideration. The MLR-GFA model study alongside with the molecular docking analysis has essentially provided a valuable and in-depth understanding as well as knowledge for the development of novel chemical compounds with enhanced inhibitory potential against the dengue virus serotype 2 (DNV-2). Hence, the developed model can be applicable in predicting the anti-dengue activity of a new set of chemical compounds that fall within its applicability domain.


Background
Dengue infection is a mosquito-borne infection caused by a virus called Dengue virus (DNV) a member of the Flavivirus found predominantly in tropical and subtropical areas around the world [1]. DNV spreads among humans by the infected Aedes aegypti or Aedes albopictus specifically female of the Aedes genus [2,3].
They are classified into four different but closely interrelated serotypes (DNV-1, DNV-2, DNV-3, and DNV-4). Infection with one serotype confers life-long immunity; however, secondary infection by a different serotype can increase the risk of developing severe dengue, because cross-immunity to the other serotypes is only partial and temporary [4].
Dengue virus serotype 2 (DNV-2) is responsible for the major infections and accounts for the largest death rate and hence considered as the virulent strain among other serotypes of the four [5]. The risk of developing dengue shock syndrome (DSS) and dengue hemorrhagic fever (DHF) is associated with infection by multiple serotypes as a result of antibody-dependent enhancement [6].
The dengue virus has about seven non-structural (NS) proteins, NS-1, NS-2A, NS-2B, NS-3, NS-4A, NS-4B, and NS-5. The Flavivirus proteases are evolutionally conserved and exceedingly stable. In NS-3, there is the presence of N-terminal serine protease domain, although inactive but becomes active upon complexation with NS-2B. Also, this protease can assume a structural conformation of either open or closed. In the closed state that is catalytically active, NS-2B is completely tied round NS3 and becomes a component of the active site. In the open and inactive conformation, NS-2B has partially bound to NS-3 and far away from the active site and hence inactive. The highly conserved Flavivirus NS-2B/NS-3 protease is necessary for viral replication and hence a druggable target [7].
In recent times, dengue infection has been reported in the Caribbean region, South America, and Europe [8]. Thus, DNV infection constitutes a serious threat globally. Approximately 40 to 100 million people are infected by DNV annually and more than 50% of the population of the world are at high risk of the infection by this virus [8]. These infections, in some persons, can progress into a more acute stage known as (DHF) and (DSS) [9][10][11], thus constitute a serious fatal threat in major dengue cases, around 2.5% from 500,000 clinical cases [9].
Despite these fatal consequences of DNV infection likewise the possible imminent outbreaks, there have been no antiviral drugs to prevent or treat DNV infections [12][13][14]. This problem is also worsened by the long-lasting dispersal of these viruses to diverse geographical regions as foretold more than 20 years ago [15]. The present certified dengue vaccine, Dengvaxia, has upraised alarms about the efficacy and increased danger of severe syndrome for seronegative persons at the phase of clinical trials [16].
Anti-dengue potentials of synthetic and medicinal plants have been described in the literature [17,18].
Searching of biochemical libraries of these compounds is a real stride in the right track for the design of potent drug candidates against these viruses.
Quantitative structure-activity relationship (QSAR) is essential in drug improvement as it investigates the properties of the drug through its models which characterize mathematical equations correlating the response of chemicals (i.e., biological activity) with their structural and physicochemical information in the form of numerical quantities named descriptors [19]. QSAR studies are directed at developing correspondence models through a response of chemicals and chemical information data in a statistical approach.
For the reliability of QSAR models, they are subjected to various authentication tests to check for the consistency of the developed correlation models. After its development, a QSAR model is usually verified by employing multiple statistical validation strategies giving an estimation of its predictive strength and stability [19,20].
QSAR analysis is an effective process for improving lead compounds and designing new drugs of the desired property. It is also used in predicting the biological activity of compounds based on the molecular descriptors of compounds recognized in the appropriate mathematical models.
The goal of this investigation is to obtain a model, to forecast the activity of the selected dataset and hopefully able to predict new compounds with improved activities capable of mitigating dengue viral replication.
A better understanding and insight of the structural necessities for the design of effective and specific inhibitors against flaviviral protease would contribute to the development of targeted therapies for infections by these viruses.

Data collection
The dataset used in this study was phthalazinone derivatives reported in a published literature to experimentally possess anti-dengue activity [21]. It is recommended that biological activities values such as IC 50 and EC 50 used for the building of any given QSAR model should be obtained from the same species using the related procedures [22,23].

Molecular geometry optimization
The two-dimensional (2-D) structures of the obtained compounds presented in Table 1, 2, 3 and 4 were drawn using the ChemDraw software [24]. The spatial conformations of the compounds were exported from 2-D structure to three-dimensional (3-D) structure using the Spartan 14 V1.1.4 by Wavefunction programming package. The 3-D structures were geometrically optimized by minimizing energy. In the process, the chemical structures were first of all minimized by a molecular mechanics force field to remove tension energy of the molecules' conformation. Density functional theory (DFT) technique was further employed using the Becke's three-parameter exchange functional (B3) hybrid alongside the Lee, Yang, and Parr correlation functional (LYP), termed as B3LYP hybrid functional, for thorough geometric optimization of the structures. The Spartan files of all the optimized molecules were then saved in SD file format, which is one the readable input format in PaDEL-Descriptor software [25].

Biological activities (pIC 50 )
The obtained biological activities of phthalazinone derivatives against cytoplasmic DNV-RNA replication measured in IC 50 (μM) were converted to the logarithm unit (pIC 50 ) using the Eq. (1) to increase the linearity of activity values and approach normal statistical distribution. The observed structures and the biological activities of these compounds are presented in Fig. 1a-d and Tables 1, 2, 3 and 4 respectively.

Molecular descriptor generation
Molecular descriptors which are the mathematical values describing the properties of a molecule we determined. Quantum chemical descriptors calculation for all the 25 molecules of phthalazinone derivatives were calculated using PaDEL-Descriptor software V2.20. A total number of about 1870 molecular descriptors were calculated and combined with those obtained from the 3-D structure by the Spartan program software.

Splitting of data-set into modeling train and external evaluation test sets
To build the QSAR models, the data set which is the chemical compound was separated into two sets in the ratio of 80:20, the train set and test set respectively. The train set is used for building the QSAR model; it contains 80% of the entire chemical compounds under consideration. While the test set which constitutes the remaining 20% of the total chemical compound data set was not involved in the building of the QSAR model but to ascertain the analytical quality of the built model [26].

MLR-GFA model building
Statistical analysis by genetic function approximation (GFA) techniques of the Material Studio software 8.0 version was used to build the models based on multiple linear regression (MLR). The MLR is used to establish a direct relationship between a dependent variable Y (pIC 50 ) and independent variable X (molecular descriptors). The model fits well such that sum of the square difference between the experimental and predicted pIC 50 values is lessen. In regression analysis, a contingent mean of dependent variable (pIC 50 ) Y relies on (Descriptors) X. MLR examination utilizes and also lengthens this idea to combine more multiple autonomous variables, and regression equation assumes the form: where Y is the dependent variable, "k"s are regression coefficients for corresponding "x"s (independent variables), and "C" is intercept or a regression constant. The GFA calculates a fitness function identified as a lack of fit (L.O.F). This fitness function is not used by the system for indicating equations that are the best model, rather estimate the superiority of the models previously built by the system thus helping in deciding the models to use based on quality. Quality of the model is inversely proportional to the LOF value and it is computed using the mathematical expression: In this equation, LSE is the least-squares error of the model, c is the number of descriptors in the model, d is the smoothing parameter (which has a default value of 1.0), p is the sum of all descriptors, and M is the total number of compounds involved in the model building [27].

Model quality assessment
Predictive capacity and the robustness of the developed model was appraised internally and externally using statistical parameters such as R 2 (square correlation coefficient), Q 2 CV (cross-validation coefficient), R 2 pred. (external test set correlation coefficient), cR 2 p (coefficient of determination for Y-randomization), etc. The statistical validation parameters were compared with the minimum value suggested for a generally satisfactory QSAR model [28] presented in Table 5.

Validation of the QSAR model
The authentication of a QSAR model is mainly accomplished based on the chemical compound used in model development. It comprises activity estimate of the studied compounds and subsequent estimation of some validation parameters for verifying the accuracy of model predictions capacity. To judge the quality and goodness-of-fit of the model, internal validation is an ideal technique. Internal validation, which is regularly used to select a better model among contending models, was done using the data that create the models. The following internal validation parameters were calculated: the cross-validated squared correlation coefficient (R 2 CV or Q 2 ): Y obs. and Y pred. are the experimental and predicted response values respectively and Y obs: is the average of the experimental biological activity value for the train set data. A satisfactory predictive model should have Q 2 value greater than 0.5 [29].
Also, another important parameter R 2 , known as the determination coefficient: is square of the correlation coefficient between the observed and predicted response values of the training set compounds. It is the most used parameter and may be computed based on the following expression: Given that experimental and predicted response values of biological activity have been designated as Y obs . and Y pred . respectively, while Y represents the average response value of the training set. R 2 measures the explanatory power of the model describing the variation in the activity value of molecules used in building the model. A perfect model has an R 2 value of unity (1) and as the value deviates from unity, the fit quality of the model declines. A good model is expected to have an R 2 value at least equal to the threshold value of greater than or equal to 0.5 [30].
R 2 adj , known as explained variance: It is an adjusted form of determination coefficient which accounts for the effect of new explanatory variables in the model, by incorporating a degree of freedom to the model [30]. To reflect the described variance in a better way, R 2 adj is the candidate of choice since the inclusion of either relevant or irrelevant independent variables in multiple regression analysis often produces non-decreasing R 2 value [31]. It may be computed with the following expression: where N gives the number of molecules in the data, R 2 is the determination coefficient, p is the number of descriptors in the model, and N-1-p is the degree of freedom [31].
The most essential consideration is the assessment of the generated model is external validation. Usually, prior to generating a QSAR model, the whole data set is shared into the train and test sets based on different algorithms The test set compounds are not involved in the training of the QSAR model and, hence, are used in external authentication procedure. The most recommended criteria for external validation are evaluated. In this, the biological activities of the test set compounds are predicted for determining the predictive power of the model. The most commonly used parameter for evaluating the predictive performance of the model is a coefficient of squared correlation (R 2 pred.) for the test set that is evaluated by the following expression: where Y obs. (test) , Y pred.(test) , and (train) are observed, predicted, and average values of biological responses for test and train sets, respectively. R 2 value varies from 0 to unity (1), and it is recommended that it should not be less than 0.6 [32,33].

Statistical Y-scrambling evaluation
In this evaluation, random MLR models are created by haphazardly shuffling the dependent variable while keeping the independent variables untouched. The new QSAR models are expected to have considerably low R 2 and Q 2 values for numerous trials, which confirm that the developed QSAR models are robust. In the process, a very important parameter, cRp 2 is likewise considered which should exceed a value of 0.5 for scaling through this test as recommended [29].
where R 2 is the square correlation coefficient for the regression analysis of non-randomized model and R 2 r is the average of the square correlation coefficient for the regression analysis of all randomization scores.

Evaluation of the applicability domain of the model
The built QSAR model was also appraised based on the applicability domain (AD) method to prove that the model is robust and reliable to predict the (pIC 50 ) of compounds [34]. The leverage method was involved in defining and describing the applicability domain of models built [28]. Leverage of a given chemical compound, hi, is defined by Eq. 9: where Z is the descriptor matrix and Z T is the transpose of Z, and standardized residual (SDR) was obtained as follows: where the experimental and predicted activity values for either of the datasets are represented by y andŷ respectively and m is the number of molecules in the set under consideration for each case. Also, model AD is defined by the boundary 0 < h i < h* and − 3 < SDR < 3. Meanwhile, h* indicates cautionary leverage value. The cautionary leverage (h*) is also the boundary of values for X outliers and is defined as: where g is the number of descriptors in the model and n is the number of compounds that are comprised of the train set used in building the model. A summary graphical evaluation of the model AD is the plot of SDR versus leverage h i called William's plot was made [35].

Multi-collinearity test
The existence of a high degree of correspondence between the descriptors contained in the best descriptors arrangement reported by GFA was calculated with the variance inflation factor (VIF) value for each descriptor: where R 2 ij is the correlation coefficient of the multiple regression between the descriptor i and the remaining j descriptors in the model [36].

Mean effect
MFj is defined as the mean effect for the considered molecular descriptor j, while bj is the coefficient of the descriptor j, Rij represents the values for the target descriptors of each molecule, and m is the total number of descriptors in the model. The ME values demonstrate the relative implication of a descriptor, associated with other descriptors in the model. Its sign shows the variation direction in the estimations of the model as an effect of the descriptor values.

Molecular docking studies
To gain a detailed understanding of the nature of the interaction of compounds with the DNV-2 NS2B-NS3 protease, molecular docking was accomplished with the help of Auto Dock Vina of PyRx v software tool. The binding energy determination and visual analysis of the docked compound were accomplished using AutoDock Vina of PyRx and Discovery Studio visualization software, respectively. The crystal structure of the DNV protease was obtained from the protein data bank (PDB Code 6mol). All the heteroatoms associated with the receptor were removed from the three-dimensional structure of the DNV-2 (NS2B-NS3) receptor (Fig. 2a) and its structure was minimized, protonated, and saved in PDBQT format. Also, the 3D structures of the optimized compounds were converted to PDBQT format with the aid of AutoDock 4.2 software. The protein-ligand interaction was analyzed and visualized with the aid of Discovery studio visualization software [37].  14)), it can be deduced that the five (5) most significant descriptors includes: ATS6e, AATSC6m, GATS2v, VR1_Dzv, and SpMax3_ Bhv.

QSAR model quality
The plot of predicted pIC50 against experimental pIC50 values is displayed in Figs. 3 and 4, which shows close agreement between the predicted activity of the test set and that of the train set. Table 6 gives a detail view of the numerical values of the train and the test sets as well as the respective predicted value which show minimal residual value thereby entailing good predictive strength of the model. Table 7 provides the statistical internal validation parameters of the model obtained from the material studio program package. Table 8, indicating a robust model evidenced by its parameters.

The result of the Y-randomization test is shown in
The domain within which the models can predict the biological activity (pIC 50 ) and the absence of outlier as well as influential compound is depicted by Fig. 5. Table 9 provides a detailed description of the descriptors in the model as well as their quality in terms of chance correlation and degree of contribution to the model. Table 10 summarizes the docking result presenting the binding scores, protease residue with interaction distance as well as interaction type. Figures 2a, b, 6a-d, and 7a,b showed prepared structure of the target (NS2B-NS3) and 3D structure of the prepared ligand 21, 2D interaction type for ligand 1, 2, 7, and 11 with different amino acids in the active site of protease and 2-D interaction type and Hbond molecular interaction between ligand 21 and the target respectively.

GA-MLR model (QSAR)
The GFA model was successfully built from 20 train set compounds of 25 and 5 descriptors were contained in the model. The built model was subsequently used to predict the biological activity values for both the train and test reported in Table 6. The multiple linear regression of genetic function algorithm (mlr-GFA) was used to produce three models; model (M1) was selected for its statistical significance as the best model with the following statistical parameters values (LOF = 0.088598, R 2 adj = 0.939699, R 2 pred = 0.71922, cR 2 p = 0.749517, Q 2 CV = 0.905909, and R 2 = 0.955567). Nevertheless, the statistical significance of this model is based on the suggested authentication standard as contained in Table 5. Though, based on the model parameters above, the model (M1) has satisfied    Fig. 4. It is also noticeably from Fig. 3 that the calculated values for the pIC 50 were in a pact with those of the experimental value, which entails the absence of error as observed in the model. A good correlation between experimental pIC 50 compared to the estimated pIC 50 of the compounds in the train set molecules was observed as demonstrated by Fig. 3, evidenced by the good correlation value (R 2 = 0.955) in Table 7 which is in agreement with the required validation threshold as suggested in Table 5 is an indication of the robustness of the built model [26,34].

Y-randomization test
The outcome of the Y-scrambling test is depicted in Table 8 in which the values of R 2 and Q 2 are within the standard recommend statistical values. Also, the recommended value of greater than 0.5 was obtained for cR 2 p which shows that the model has good predictive capacity.

Applicability domain
The applicability domain evaluation process as shown in Fig. 5 displays William's plot of the dataset, for which standardized residuals for both the train and test dataset were plotted against their respective leverage values identified no outlier for the compounds as all the data points were inside the limit of ± 3 domain. However, one outlier (compound 1) was observed to have exceeded the precautionary leverage of (h* = 0.9). Furthermore, a close assessment revealed that it was not an actual outlier as disregarding this compound did not result in any perfection of the model statistical parameters and predictive strength and as such, it was retained. Furthermore, the other reason for not eliminating this compound from the test set was to avoid the use of excessively slight sort of endpoint values.   Table 9 shows the list of descriptors, descriptions, classes, and other related statistical parameters (VIF) that possess a relevant influence on selected relevant descriptors. For all the five descriptors, the numerical values of the VIF were all less than 10 indicating that the specifications of the model were coronal, and the model's consistency is of great significant [19,[38][39][40].

Descriptors interpretation and mean effect
To  Table 9.
ATS6e is the Broto-Moreau autocorrelation-lag 6/ weighted by Sanderson electronegativities. It measures the strength of the relationship between relative electronegativity of two atoms in a molecule which are separated by 6 bonds; it has a positive correlation coefficient. Increment in its numerical value favors the increase in anti-dengue activity of the compounds. Also, these observations suggest that electronegativity of atoms that made up the compound had a substantial effect on the activity (pIC 50 ).
Also, AATSC6m is the centered Broto-Moreau autocorrelation-lag 6/weighted by mass. It measures the strength of the relationship between relative atomic mass of the atom pairs in a molecule separated by 6 bonds; it has a positive correlation coefficient. Therefore, increment in its numerical value would lead to increment in pIC 50 as well. The BrotoMoreau autocorrelation descriptors (ATS) are given by where n is the atom number, δ ij is the Kronecker delta function (if dij = d, zero otherwise, then δ ij =1), d is the considered topological distance (the lag in the autocorrelation parameters), and wi and wj are the normalized atomic properties for atoms i and j respectively. The normalized van der Waals volume, atomic mass, and electronegativity can be appropriated for the atomic property.
GATS2v which is the Geary autocorrelation-lag 2/ weighted by van der Waals volumes, which has a negative mean effect is suggested to contribute negatively to anti-dengue activity. It is evaluated or determined in the same way as the ATS but with the introduction of Geary coefficient; it measures the strength of the relationship between van der Waals volumes of two atoms in a molecule that are eight bond apart.
The Geary autocorrelation descriptors are given by where Ŵ represents the average coefficient of the considered property for the molecule and Δ is the number of vertex pairs from a distance equal to d.
Furthermore, VR1_Dzv is the Randic-like eigenvectorbased index from the Barysz matrix/weighted by van der Waals volumes. It is negatively correlated to the antidengue activity meaning that decrease in its value enhances the activity of the compounds. They are based on the coefficients eigenvector associated with the largest negative eigenvalue of the distance matrix of a molecule.
However, SpMax3_Bhv is the largest absolute eigenvalue of Burden modified matrix-n 3/weighted by relative van der Waals volumes. It is obtained from modified connectivity matrix whose diagonal element is replaced by relative van der Waals volume of the atoms in the molecules. It also quantifies the topology of the chemical structure on the basis of connectivity of atoms present in the structure. This descriptor contributed the largest in determining the inhibitory activity which suggests its significance in the model as evidenced by its positive mean effect value.  The descriptors with positive mean effect value is an indication that an increase in the value of such descriptor will lead to an increase in the DENV-2 inhibitory activity (pIC 50 ) while a negative value indicates negative influence and as such, decrease in such value will enhance the activity (pIC 50 ) also.

The docking studies
In this study, the molecular docking studies of the phthalazinone with the NS2B-NS3 protease (PDB CODE: 6MO1) was investigated using AutoDock Vina of PyRx and Discovery Studio visualization software for energy grid calculations and visual analysis of the docking pose respectively for a detailed understanding of the nature of the described interaction of inhibitors (compounds 1, 2, 7, 11, and 21) with the DNV-2 protease. The docking studies showed that these compounds docked well with the target and the binding affinity (− 8.7, − 8.8, − 8.7, − 8.3, and − 8.9 kcal/mol) of the 5 ligands under consideration with the target are all in close agreement with their respective pIC 50 values (6.193, 6.21, 6.00, 4.64, and 6.886) for the targets respectively. Amino acids, numerical data of interaction distances, and binding free energies (ΔG) between the compounds, and NS2B-NS3 protease are shown in Table 10. However, the result showed that compound 21 had the best binding interaction than the remaining 4. Compounds 1, 2, 7, and 21 were considered for the docking studies due to their high experimental activity (pIC 50 ) values while compound 11 was included to establish a basis for the variation in the observed biological activity with the ones with the best biological activity based on binding affinity due to its low pIC 50 . Furthermore, it can be seen that in all the compounds docked, their binding energy corresponds with their inhibitory activity which shows that these compounds have great potentials. It can be seen in Table 10 Table 10 also, it is likely to verify among the compounds that the increase in the amount of halogen bond interactions and the type would result in the lowering of binding free energy, which indicates a higher degree of the spontaneity of the interactions, which is also evidenced by the absence of halogen in compound 11 despite having two conventional hydrogen bond as (compound 1), the observed high binding affinity in compounds 1, 2, 7, and 21 could be attributed to fluorine. Also, compounds 2, 7, and 21 form key conventional hydrogen bond with the carbonyl of the carbamate core and the fluorine of the phenyl ring on the phthalazin core in which they both act as a hydrogen acceptor, and this entail relevance of such functional group. Figure 7 depicts the hydrogen bond interaction formed between the compound 21 and the target. The hydrogen bond interaction formed between ligand 21 with the highest binding affinity of − 8.9 kcal/mol suggests that the observed biological activity is not obtained by chance since it forms the most stable complex and as such, it can be utilized as a model compound for improving anti-dengue activity of the phthalazinone derivatives.

Conclusion
The present study targeted to produce a highly predictive MLR-GFA model capable of revealing the structural requirements for the experimental pIC 50 of phthalazinone derivatives against dengue virus; the results from the acceptably validated model showed that the pIC 50 of the studied molecules against dengue virus is determined by the descriptors ATS6e, AATSC6m, GATS2v, VR1_ Dzv, and SpMax3_Bhv. The molecular docking simulation study reveals that among the studied compounds, compound 21 had the best binding energy (− 8.9 kcal/ mol) and the binding energy of all the studied compounds correspond with their dengue virus inhibitory activity (pIC 50 ) and the most common interaction formed with the amino acid in all the studied compound with the receptor are hydrogen and hydrophobic interactions; the presence of fluorine has a significant effect as observed in those with better activity (pIC 50 ). The information provided by the QSAR model may simplify further design of novel and highly potent dengue virus inhibitors. The studies also revealed that the compounds docked well with the targets suggesting that the ligands are efficacious in the treatment of DNV-2 infection.