In silico QSAR and molecular docking simulation of some novel aryl sulfonamide derivatives as inhibitors of H5N1 influenza A virus subtype

This research provides a comprehensive analysis of QSAR modeling performed on 25 aryl sulfonamide derivatives to predict their effective concentration (EC50) against H5N1 influenza A virus by using some numerical information derived from structural and chemical features (descriptors) of the compounds to generate a statistically significant model. Subsequently, the molecular docking simulations were done so as to determine the binding modes of some potent ligands in the dataset with the M2 proton channel protein of the H5N1 influenza A virus as the target. In building the QSAR model, the genetic algorithm task was employed in the variable selection of the descriptors which are used to form the multi-linear regression equation. The model with descriptors, RDF100m, nO, and RDF45p, showed satisfactory internal and external validation parameters (R2train = 0.72963, R2adjusted = 0.67169, Q2cv = 0.598, Rpred2=\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {R}_{\mathrm{pred}}^2= $$\end{document} 0.67295, R2test = 0.6860) which passed the model criteria of acceptability. Docking simulation results of the more potent compounds (ligands 2, 3, and 8) revealed the formation of hydrophobic and hydrogen bonds with the binding pockets of M2 protein of influenza A virus. The results in this study can help to advance the research in designing (in silico design) and synthesis of more potent aryl sulfonamides derivatives against H5N1 influenza virus.


Background
Influenza infection is one of the most commonly known acute viral respiratory disease, which is usually spread by the influenza virus. The influenza infection is also called flu, and there are three (3) key types of the influenza virus namely, A, B, and C which are also classified into different subtypes. The circulating seasonal viruses presently are influenza A (H1N1, H3N2) and influenza B viruses [1]. The European Centre for Disease Prevention and Control (ECDPC) in 2019 published a surveillance report on influenza virus characterization which summarizes the percentage of influenza virus detections in the WHO European Region. The report shows that the cumulative detections have increased from 18,049 to 197,027, with type A (99.1%) prevailing over type B (0.9%) viruses, unlike the 2017-2018 season when type B predominated over type A at the beginning of the season [2]. The monthly risk evaluation of influenza at the human-animal boundary published by the World Health Organization in 2019 showed that several influenza A(H5Nx) subtypes continue to be detected in birds in Africa, Europe, and Asia. In addition, the WHO reported that there are over 860 reported cases of humans infected by the H5N1 virus, and 50% among them are dead since 2003, also establishing that H5N1 virus has higher death rate when compared with other influenza A virus subtypes [1]. There are two (2) main classes of anti-influenza which battles with influenza pandemic and epidemic. However, rimantadine and amantadine are M2 proton channel inhibitors that inhibit uncoating of virus-related ribonucleoprotein [3].
The development of more newly potent compounds is very expensive or high-price trial and it is timeconsuming. Computational chemistry techniques such as computer-aided drug design (CADD) might save time and reduce the cost of synthesis of the new potent drugs [4]. Ligand-based drug design used quantitative structureactivity relationship (QSAR) approach in designing new compounds from the best statistical model containing some structural features in numerical data (molecular descriptors) which predicts the properties of the compound such as activity, toxicity [5]. The molecular docking simulation is also commonly employed in structure-based drug design (SBDD) which predicts ligand's conformation and interactions with the active pocket of a protein or enzyme (receptor). This study focused on combining both molecular docking approach and QSAR modeling method to the assessment of 25 aryl sulfonamide derivatives as a novel of H5N1 inhibitors.

Computer hardware and software
Dell computer system, with processor properties of Intel ® Core i3-6100U CPU Dual @ 2.30 GHz, 12 GB (RAM), was used to carry out this computational study. The software packages installed includes Spartan 14 V 1.

QSAR analysis 2.2.1 Collection of dataset and optimization
Twenty-five (25) already synthesized aryl sulfonamides derivatives together with their tested activity concentrations on H5N1 virus were obtained from the literature [3]. The activity concentrations was measured as the concentration that effectively inhibited the virus plaque formation by 50% (EC 50 ) in μM. The H5N1 inhibitory concentration values of these molecules were further converted to the negative logarithmic scale (pEC 50 = EC 50 /10 6 ) in order to reduce skewness in the activity values [4]. The chemical structures of the aryl sulfonamide derivatives along with their observed logarithmic effective concentration were shown in Table 1.
The molecular structures of the aryl sulfonamide analogs showed above were properly drawn using Chem-Draw Ultra level software V12.0.2, then saved in (*cdx format). Subsequently, the structures were exported to Spartan 14 software so as to compute their equilibrium geometries at ground state with density functional (DFT/B3LYP/6-31G**) in vacuum, starting from the initial molecular geometry [6]. In principle, geometry optimization is an iterative process whereby the energy and its first derivative with respect to all geometrical coordinates are calculated from a guess geometry or starting geometry, then used the information to project new geometry. Thus, the process continues until the lowest energy or optimized structure of the molecule is achieved.

Descriptors computation and data normalization
The twenty-five (25) optimized structures from Spartan 14 were accordingly saved as SDF format, and then exported to PaDEL descriptors software which is a product of Pharmaceutical Data Exploration Laboratory, developed by Yap Chun Wai [6]. This software allows QSAR users to compute diverse molecular descriptors and fingerprints of a molecule, including electrostatic, topological, spatial, autocorrelation, geometrical, constitutional, and thermodynamic descriptors [7]. Data normalization is a technique often applied as part of data preparation and the goal of the normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values scaling from 0 to 1 [8].

Data pretreatment and division
PaDEL descriptor output in MS Excel sheet was subjected to the variable reduction method so as to eliminate constant and highly inter-correlated descriptors based on user-specified variance and correlation coefficient cutoff values using Data Pretreatment GUI 1.2, downloaded from Drug Theoretics and Cheminformatics (DTC) Laboratory website. In order to establish a rational selection of training set and test set, the Dataset Division GUI 1.0 software was used by engaging Kennard-Stone's algorithm division technique [4].

QSAR model generation
In this research, a Small Dataset Modeler version 1.0.0 tool downloaded from the DTC website; https:// dtclab.webs.com/software-tools was used to generate a multi-linear regression model (MLR) based on the genetic algorithm technique in variable selection. This tool is very helpful in performing QSAR modeling, especially for small datasets. It employs an exhaustive double cross-validation approach and a set of optimal model selection techniques including consensus predictions for performing the small dataset QSAR modeling [9].

Internal validation
The QSAR model generated was internally validated using cross-validation technique. In essence, this technique provides adequate information about the predictive reliability of the QSAR equation. The leave-one-out cross-validation technique was adopted in this research and the cross-validated Q 2 cv was evaluated based on expression (4) below [10,11].
where Y tr is the average observed the concentration of training set, Y is the observed concentration, and Y pred is the predicted concentration in the training set respectively.
The squared correlation coefficient (R 2 ) was determined for the comparison between the predicted concentration by the QSAR equation and observed concentrations from the experiment. Furthermore, the R 2 values are proportional to the number of descriptors in the model which is not reliable in determining the predictive response of the model. As such, R 2 is adjusted which is defined as in; where p is the number of the descriptor in the equation and n is number of compounds in the training set.
The numerical changes between the R 2 and R 2 adj are less than 0.3 which shows that the number of descriptors used in the QSAR model is acceptable and vice versa [12]. In addition, it is important to note that good R 2 values are not enough measures for validating the model. Therefore, more parameters must be established to point out the predictive capability of the models. The crossvalidated Q 2 cv is ordinarily smaller than the R 2 value of the QSAR model because of its diagnostic means of evaluating the predictive power of the model [13].

External validation
The optimal combination of the training set and test data (i.e., Compd ID, descriptor matrix, and response) was subjected to MLRplusValidation 1.3 program to internally and externally validate the model generated including the cross-validation method (leave-one-out) and test set validation based on model acceptable criteria. The following statistical features of the test set were proposed by Golbraikh and Tropsha for a robust QSAR model with good predictive potential [7].
where r 2 is the squared correlation coefficient between the observed and predicted activities, r 2 0 is the squared correlation coefficient between the predicted and observed activities, and k and k′ are the regression slopes passing through the origin.
where Ypred test and Y obs test are the predicted and observed activity of test set compounds respectively. Y training is the average values of observed activity of the training set compounds

Development of applicability domain (AD)
The applicability domain (AD) of the developed model is defined as the chemical space of compound structure and response where the model predictions are highly reliable. This method is used to detect structural and response outliers from the test set and training set respectively [10,14]. The leverage approach of determining the applicability domain also known as Williams plot was obtained by plotting a scatter plot of standardized residual and leverage values of both training set and test set. By definition, leverage value is determined based on Eq. (4) [15].
where X(i) represents the vector of descriptors of compound (i), X represents the descriptors matrix, and x T represents matrix transpose of X.
The threshold leverage (F*) is given by the equation: where p is the number of molecules in the training set and m is the number of molecular descriptors used in the model. In addition, compounds with higher leverage scores which are greater than threshold leverage (i > F*) tend to have unreliable predictions. However, compounds whose leverage scores are less than the threshold score (i < F*) and the standardized residuals are not greater than ± 3α (3 standard deviation units) are said to fall within the applicability domain. Similarly, the Euclidean approach of the applicability domain was also determined based on mean distance scores computed by the Euclidean distance. As such, the Uzairu plot was determined by plotting the standardized residuals against normalized mean distance scores whose ranges are from 0 to 1. The normalized mean distance score for training set ranges from 0 which is for least diverse, and 1 which is for the most diverse training set. However, the normalized mean distance score for test compounds with scores outside the range of 0 to 1 is regarded as outliers which imply that the compounds are outside the applicability domain [11].

In silico docking study
The optimized structures of compounds with the best activity were saved in PDB format (Protein Data Bank), and then docked with the NMR structure of H5N1 virus M2 protein with PDB ID: 2KQT. The docking simulation was carried out using Auto Dock wizard of PyRx virtual screening tool [16,17]. Subsequently, the docking results were visualized using the Discovery Studio Visualizer v16.1.0.15350 to study the kind of interactions in the stable complex formed. Figure 1 shows 3D structures of the ligand and target respectively.

In silico QSAR analysis
Based on the genetic algorithm of the descriptors, a multi-linear regression model was developed containing three (3) optimum descriptors using Small Dataset Modeler. The selected MLR-GA model is:  6)), it can be deduced that the three (3) most significant descriptors includes -number of oxygen atoms (nO), radial distribution function − 045/weighted by relative polarizabilities (RDF45p), and radial distribution function − 100/ weighted by relative mass (RDF100m). Table 2 provides a detailed description of the descriptors in the model. The value of R 2 train = 0.72963 and R 2 test = 0.6860 confirms the good extrapolation between the training set and test set respectively. In addition, the QSAR model is robust because of the small difference between R 2 and Q 2 cv (< 0.5%). The plot of predicted pEC 50 against experimental pEC 50 values is displayed in Figs. 2, 3, 4, and 5. Visibly, it could be seen that the values of the test sets are in close agreement with the training set values. Table 6 summarizes the docking output from PyRx virtual screening and discovery studio visualizer showing the binding affinity scores, distance, and interaction chemistry from the ligand to the receptor or vice versa. Figures 6a-d, 7a-d, 8a-d, and 9a-d showed ligand's conformation as well as the kind of interaction at molecular level including the hydrogen and hydrophobic interaction types with the active pockets of the M2 proteins of H5N1 influenza virus A in 3D and 2D respectively.

Discussion
The key step of building a statistically significant QSAR model is obtaining descriptors that are a portrayal of changes in the structural feature of the compounds. The structural descriptors of all compounds were generated using PaDEL software as stated earlier. A total sum of 1875 descriptors was generated in MS Excel (.csv) format, and the result was exported to a DTC lab software for the normalization and pretreatment process. In the pretreatment process, non-  Table 3. The regression statistics (Table 4) show p value and t values of the model which suggests that the coefficients of the descriptors are statistically significant at 95% confidence interval. Furthermore, the QSAR model was assessed based on the multi-collinearity among the descriptors by computing the variation inflation factor (VIF) of the three (3) descriptors, which can be computed using the equation: where R represents the correlation coefficient of the regression between variables in the model. VIF values corresponding to unity depict no inter-correlation among each variable, or if the VIF scores range between 1 and 5, the model is acceptable and stable. But, if the VIF scores are larger than 10, it means that the model in question is unstable and unacceptable [18]. Tables 5  and 6 show the correlation matrix and VIF scores of the descriptors used in forming the selected model. It could be observed that there is no inter-correlation among the descriptors since correlation coefficients are less than 0.5 and VIF scores are not greater than 1 for the three (3) descriptors in the model. The scatter plot of standardized residual versus experimental EC 50 (Fig. 3) revealed an unsystematic scattering of data points above and below the baseline of zero data point of standardized residual which signifies the non-existence of systematic error [13]. The scatter plot of standardized residuals against leverage scores also known as Williams plot revealed two (2) response outliers (compounds 2 and 23). This is because there leverage scores are greater than the threshold leverage score of (F*) of 0.66, which may be due to the changes in substitution arrangement of the substituent in the dataset. However, the remaining compounds whose leverage scores are less than the threshold score are said to be within the applicability domain of square area of ± 3. Also, the Uzairu plot showed that all compounds fall within the chemical space of the model which confirmed its predictive capabilities.
In silico docking was conducted on the most active compounds (Comp 2, 3, and 8) with solid-state NMR structure of H5N1 virus M2 protein (target) so as to study the structure-binding relationship of the complex formed. The result showed a binding score of − 5.55 kcal/mol for complex 2, − 5.35 kcal/mol for complex 3, and − 5.22 kcal/mol for complex 8. Ligand 2 formed four (4) hydrogen bond interactions where the thiophene moiety and cyano side chain fill into hydrogen pocket of M2 protein target. The sulfur in thiophene of ligand 2 formed H-bond with HIS37 (2.4406 A°), oxygen of sulfonyl (SO 2 ) serves as H-acceptor in forming H-bond with HIS37 at bond distance of 2.0507 A°, while C-terminal segment of 3081 amino acid residue formed two (2) H-bonds with nitrogen of cyano group and fluoro substituent of benzene ring at distances of 2.2368 A°and 2.4815 A°respectively. Also, the sulfur in sulfonyl group of ligand 2 formed two (2) Π-sulfur interaction with C-terminal segment of HIS37 (4.53254 A°) and D-terminal of HIS37 (5.88837 A°), while sulfur of thiophene moiety also formed Π-sulfur interaction with A-terminal segment of HIS137 at bond distance of 4.42545 A°. Amide-π stacked hydrophobic interaction type was formed between the amide group and delocalized electrons of thiophene ring (ligand 2) at distance 4.2498 A°, also π-orbitals of thiophene fill into

Conclusion
The present research attempted to use quantitative structure-activity relationship (QSAR) and molecular docking simulations as in silico modeling techniques. The MLR model with molecular descriptors, RDF100m, nO, and RDF45p, obtained from the genetic algorithm task was accepted based on the QSAR model acceptability criteria. The statistical metrics of the descriptors showed that the model is robust and statistically significant. Molecular docking results further revealed the binding modes existing in the stable   complex formed between the best ligands with higher activity and the solid-state NMR structure of H5N1 virus M2 protein as the receptor. In addition, the binding affinity scores − 5.55 kcal/mol for complex 2, − 5.35 kcal/mol for complex 3, and − 5.22 kcal/mol for complex 8 are higher than standard drug (amantadine). The outcome of this investigation can help to further the research and development of more potent derivatives of aryl sulfonamides as H5N1 influenza virus inhibitors