### 2.1 Dataset collection

Thirty nine (39) sets of EGFR^{WT} inhibitors with their corresponding inhibitory activities (IC_{50}) in nanomolar were retrieved from the work of [17] and used in this research. The inhibitory activities (IC_{50}) of these molecules were then converted to their corresponding negative logarithms (pIC_{50}) using Eq. 1 [3].

$$ {\mathrm{pIC}}_{50}=-\log\;{\mathrm{IC}}_{50}\times {10}^{-9} $$

(1)

### 2.2 Structure generation and stable geometry calculations

The initial step in any QSAR modeling study after data collection is drawing of the structures of the studied molecules. For this reason, the structures of all the studied molecules were generated utilizing the ChemDraw software [10]. After structure generation of the studied molecules, constraint in the structures was reduced via energy minimizing before finding the most stable structures of the studied molecules on potential energy surface using the Spartan 14 software. DFT at B3LYP/6-311G* level of theory was used in finding the most stable structures of all the studied molecules on global minima on the potential energy surface (PES) [15].

### 2.3 1D, 2D, and 3D descriptors generation, data pre-treatment, and dataset splitting

For the generation of the independent variables (descriptors), the most stable structures obtained in section 2.2 above were saved in a file format (SDF) that has been recognized by the software used in generation of descriptors, PaDEL descriptor tool kit [26].

The dataset was pre-treated manually to eliminate redundant and constant descriptors. After pre-treating the data, the Data division software was further used in dividing the data into training set and test set utilizing the Kennard-Stone algorithm [14]. The model building/training set were used for the generation of the models, and the validation/test set were used for assessing the generated models [11].

### 2.4 Model development

The models were developed utilizing the genetic function approximation (GFA) method with the actual pIC_{50} as the response variable and the descriptors as independent variables. In the case of variable selection, GFA selects most highly correlated descriptors to develop so many models which is one of the distinct characteristic of GFA.

### 2.5 Validation of the selected model

The most widely used assessment terms for QSAR models are the following; square correlation coefficient of the training set (*R*^{2}_{training}), adjusted *R*^{2} (*R*^{2}_{adj}), cross-validation coefficient (*Q*_{cv}^{2}), and square correlation coefficient of the test set (*R*^{2}_{test}). The high value of these parameters appears to be necessary but not enough [21].

In-view of this, the inter-correlation between descriptors can be detected using their variation inflation factors (VIF), to see whether these descriptors are highly correlated with one another or not. If the computed VIF values is up to 1 it means there is no inter-correlation between the descriptors; if it falls between 1–5, the model can be accepted, and if it is higher than 10, the model cannot be accepted. It can be calculated using the equation below:

$$ \mathrm{VIF}=\frac{1}{1-{R}^2} $$

(2)

where *R*^{2} is the correlation coefficient of the selected model [5].

The evaluation of significance and contribution of each descriptor to the selected model is performed using the value of mean effect of each descriptor. The mean effect is defined by the equation below:

$$ {\mathrm{MF}}_{\mathrm{j}}=\frac{B_j{\sum}_{j=1}^{i=n}{d}_{ij}}{\sum_j^m{B}_j{\sum}_i^n dij} $$

(3)

where MF_{j} is the mean effect of a descriptor j in a model, *β*_{j} is the coefficient of the descriptor J in that model and *d*_{ij} is the value of the descriptor in the data matrix for each molecule in the model building set, *m* is the number of descriptor that appear in the model and *n* is the number of molecules in the model building set [4]

To assure the robustness of a QSAR model and that the model was not obtained by chance correlation Y-Scrambling test was perform. It is done by reshuffling the actual activities and keeping the descriptors unchanged to generate new QSAR models for several trials, the new built QSAR models were anticipated to give low *Q*^{2} and *R*^{2} value. The validation parameter for this test is cR_{p} (cR^{2}_{p} > 0.5) [12].

### 2.6 Applicability domain

A QSAR model is considered valid and void, if it is subjected to the applicability domain (AD) and found that the model can make good prediction of new activities of the training and test molecules. As such, the model is subjected to AD to find out whether there are influential or outliers molecules in the studied ones [22]. One of the methods used in assessing the AD is leverage approach and is given as *h*_{i}:

$$ {h}_i={x}_i{\left({X}^TX\right)}^{-K}{x}_i^T\kern1em \left(i=A,\dots, \kern0.5em Z\right) $$

(4)

where the training set matrix *I* is given by *x*_{i}, *n × k* descriptor matrix of the training set is represented by *X*, and *X*^{T} is the transpose matrix *X* used in generating the model. The threshold for the value of *X* is the warning threshold (*h*)* which is presented in the equation below:

$$ {h}^{\ast }=3\left(x+1\right)/q $$

(5)

where the number of chemicals of the model building set is given by *q*, and the number of the descriptors in the model under evaluation is represented by *x*.

### 2.7 Molecular docking

A Dell Latitude E6520 computer system, with the following specification: Intel ® Core™ i7 Dual CPU,M330 @2.75 GHz 2.75 GHz, 8 GB of RAM was utilized to explore the nature of interactions between the active site of EGFR enzyme and five most active EGFR^{WT} inhibitors (ligands) with the help of the Pyrex virtual screening software, Chimera, PyMOL, and Discovery studio.

Before the docking analysis, ligands were prepared from the optimized structures in section 2.2 above and saved in pdb file format using Spartan’14 [1]. The 3D structure of EGFR enzyme was downloaded from the protein data bank (with pdb ID: 4zau). The enzyme was prepared with the help of Discovery Studio Visualizer for the docking analysis; in the course of the preparation, hydrogen was added. Water molecule, heteroatoms, and co-ligands were eliminated from the crystal structure saved in pdb file.

The docking of the ligands to the active site of EGFR enzyme was achieved with the help of the Pyrex software using Autodock vina [11]. After successful docking protocol, re-formation of the complexes (ligand-receptor) for further investigation was also achieved utilizing the Chimera software. Discovery studio visualizer and PyMOL were used to investigate the interactions of the complexes.

### 2.8 Pharmacokinetics

Pharmacokinetics studies of five (5) most active compounds among the data set was carried out using SwissADME a free web tool used in evaluating ADME and drug-likeness properties of small molecules [8]. The Lipinski’s rule of five is useful at pre-clinical stage of drug discovery which state that if any chemical violate more than 2 of these criteria (molecular weight ˂ 500, number of hydrogen bond donors ≤ 5, number of hydrogen bond acceptors ≤ 10, calculated Log *p* ≤ 5, and polar surface area (PSA) ˂ 140 Å^{2}), the chemical is said to be impermeable or badly absorbed [13].