2.1 Dataset
In this work, a series of 37 compounds were used to generate the relationship between chemical traces of the compound and its antifungal activity. These 37 compounds of novel pyrazole-furan and pyrazole-pyrrole carboxamide were obtained from the previous work [24]. The logarithm of the measured EC50 (μM) against antifungal activity given by pEC50 (p EC50 = − log 1/EC50) was taken as a dependent variable; therefore, the data was linearly correlated with the independent variables (descriptors) [10]. The dataset was represented in Table 1 and Fig. 1.
2.2 Optimization
The dataset was optimized at a level of density function theory (DFT) by applying Becke’s three-parameter read-Yang-Parr hybrid (B3LYP) function together with a “6-31G *” basis set of Spartan14 software [6]. Graphical-user-interface of Spartan14 software was used to draw the 2D molecular structures of the dataset which were later exported in the form of 3D. The optimized structures were then taken to PaDEL descriptor software to generate the quantum molecular descriptors [25].
2.3 Molecular descriptors calculations
Molecular descriptors are the properties of the molecule in numerical/mathematical values. PaDEL descriptor software was used to further calculate additional energy of those low-energy conformers, where a total of 1875 descriptors were calculated [1].
2.4 Data division
To get a validated model, the dataset was divided into training and test sets (3:1). Following the Kennard-Stone algorithm method, the division was performed in such a way that the compounds forming the training set (70% of the data) and the test set (30% of the data) were shared within an entire descriptive space filled by the complete dataset [7].
2.5 Model building and validation
The generated molecular descriptors were taken for regression analysis, with experimental activities as dependent variables and the molecular descriptors served as independent variables. Using the Genetic Function Approximation method (GFA) incorporated in the Material Studio 2017 software [9], the training set compounds were utilized to develop the QSAR model. Four QSAR models were built, and the best model was chosen according to the one with the lowest score of lack of fit (LOF) given as follows:
$$ \mathrm{LOF}=\mathrm{SSE}{\left(1-\frac{c+ dp}{M}\right)}^2 $$
(i)
where SSE represents the sum of squares of errors, d is a smoothing parameter defined by the user, c is the number of terms a model possessed in addition to the constant term, M gives the number of samples present in the training set, and p is the overall number of descriptors present in all terms of the model excluding the constant term [11].
2.5.1 Internal validation
The generated model was validated internally by the following parameters:
-
a.
The correlation coefficient (R2): explain the division of overall variation ascribed to the built model. The accepted value of R2 ranges from 0.5 to < 1 and the more the value of R2 approaches 1.0 the better the model. Though there are other analyses that the model must pass before we can consider it a good model, being the most common internal validation pointer, R2 is expressed as follows:
$$ {R}^2=1-\frac{\sum {\left( Yexpt- Yperdt\right)}^2}{\sum {\left( Yexpt-\overline{Y} train\right)}^2} $$
(ii)
where Yexpt, Ypredt, and \( \overline{Y} \)train represent the experimental, predictive, and average activities of the training set [3].
-
b.
Adjusted R2: The value of R2 is inconsistent to evaluate the power of the built model; thus, R2 is adjusted to restore and stabilize the model. This adjusted R2 is defined in Eq. iii as:
$$ {R}^2\mathrm{adj}=\left(1-\mathrm{R}2\right)\frac{\left(n-1\right)}{n-P-1}=\frac{\left(n-1\right)\left({R}^2-P\right)}{n-P+1} $$
(iii)
where p presents the number of descriptors that constituted the model, while n gives the number of training set compounds [14].
-
c.
Cross-validated R2: The validity of the models was identified by a cross-validation test measured by predictive Q2cv. For a leave one out (LOO) cross-validation, a data point is eliminated (left-out) in the set and the model is readjusted and then compared the predicted value of the eliminated data point to its real value. This is repeated until each data removed. We can then calculate the value of Q2cv using the sum of the squares of these elimination residues as in the below equation:
$$ {Q}_{cv}^2=1-\frac{\sum {\left( Ypredt- Yexpt\right)}^2}{\sum {\left( Yexpt-\overline{Y} train\right)}^2} $$
(iv)
where Yexpt, Ypredt, and \( \overline{Y} \)train represent the experimental, predictive, and average activities of the training set [2].
2.5.2 External validation
The prediction capacity of the model was examined by an external validation through the ability of the model to predict the activity values of the test set compounds as well as its application in the calculating the predicted value of R2pred according to the equation below:
$$ {R}^2=1-\frac{\sum {\left( Ypredt- Yexpt\right)}^2}{\sum {\left( Yexpt-\overline{Y} train\right)}^2} $$
(v)
where Ypredt and Yexpt are the test set’s experimental and predicted activities while Ytrain gives the average activity of the training set [5].
2.6 Statistical analysis of the descriptors
2.6.1 Variance inflation factor (VIF)
It was defined as the measure of multicollinearity amongst the independent variables (i.e., descriptors). It quantifies the extent of correlation between one predictor and the other predictors in a model.
$$ VIF=\frac{1}{\left(1-{R}^2\right)} $$
(vi)
where R2 gives multiple correlation coefficient between the variables within the model. If the VIF is equal to 1, it means there is no inter-correlation in each variable, and if it ranges from 1 to 5, then it is said to be suitable and acceptable. But if the VIF turns out to be greater than 10, it indicates the instability of the model and needs to be reexamined [16, 19].
2.6.2 Mean effect (ME)
The average effect (mean effect) correlates the effect or influence of given molecular descriptors to the activity of the compounds that made up the model. The sign of descriptors shows the direction of their deviation toward the activity of compounds. That is to say, an increase or decrease in the value of the descriptors will improve the activity of the compounds. The mean effect is defined by the following:
$$ \mathrm{Mean}\ \mathrm{effect}=\frac{B_j{\sum}_i^n{D}_j}{\sum_j^m\left({B}_j{\sum}_i^n{D}_j\right)} $$
(vii)
where Bj and Dj are the j-descriptor’s coefficient in a model and the values of each descriptor in training set, while m and n stand for the number of molecular descriptors and the number of compounds in the training set. To evaluate the significance of the model, the ME of all the descriptors was calculated [11].
2.6.3 Applicability domain
To confirm the reliability of the model and to examine the outliers as well as the influential compounds, it was very important to evaluate its domain of applicability. It aimed to predict the uncertainty of a compound depends on its similarities to the compounds used in building the model and also the distance between the training and test sets of the compounds. This could be achieved by employing William’s plot which was plotted using standardized residuals versus the leverages. The leverages for a particular chemical compound was given as:
$$ {h}_{\mathrm{i}}={\mathrm{Z}}_{\mathrm{i}}{\left({\mathrm{Z}}^{\mathrm{T}}\ \mathrm{Z}\right)}^{-1}\ {{\mathrm{Z}}_{\mathrm{i}}}^{\mathrm{T}} $$
(viii)
where hi = leverage for a particular compound and Zi = matrix i of the training set. Z = nxk descriptor-matrix for the training set compounds. ZT = transpose of the Z matrix. The warning leverage (h*) that is the boundary for usual values of Z outliers is given by;
$$ {h}^{\ast }=3\frac{\left(p+1\right)}{n} $$
(ix)
where n = number of compounds in the training set and p is the number of descriptors present in the model [15].
2.7 Ligand and receptor preparation
From the RCSBPDB (www.rcsb.org), the PDB format of the receptor was successfully downloaded. This was then taken to the discovery studio for an appropriate preparation whereby all the residues associated with the downloaded receptor (such as a ligand, water molecules, and other traces) were removed. The ligands (the optimized compounds) which were in the SDF file were transformed into a PDB file format. The prepared structure of S. sclerotiorum and prepared compounds were docked using Autodock Vina 4.2 [22]. Discovery Studio Visualizer was also used to visualize the docking results. Figures 2 and 3 showed the prepared receptor and ligand [15].
2.8 Optimization method of structure-based design
This refers to the optimization of known molecules through evaluating its proposed analogs within the binding cavity [17]. Discovery studio was used to visualize the receptor-ligand interactions in which different interactions such as H-bond and hydrophobic interaction formed between compound 7 and the receptor (PDB ID: 2x2s) were studied. Based on the knowledge of this interaction, the designed compounds were proposed in which they were drawn, optimized, converted to PDB, and later docked with the receptors to record their potency.