Characterization of squalene synthase gene from Gymnema sylvestre R. Br.

Squalene synthase (SQS) is a rate-limiting enzyme necessary to produce pentacyclic triterpenes in plants. It is an important enzyme producing squalene molecules required to run steroidal and triterpenoid biosynthesis pathways working in competitive inhibition mode. Reports are available on information pertaining to SQS gene in several plants, but detailed information on SQS gene in Gymnema sylvestre R. Br. is not available. G. sylvestre is a priceless rare vine of central eco-region known for its medicinally important triterpenoids. Our work aims to characterize the GS-SQS gene in this high-value medicinal plant. Coding DNA sequences (CDS) with 1245 bp length representing GS-SQS gene predicted from transcriptome data in G. sylvestre was used for further characterization. The SWISS protein structure modeled for the GS-SQS amino acid sequence data had MolProbity Score of 1.44 and the Clash Score 3.86. The quality estimates and statistical score of Ramachandran plots analysis indicated that the homology model was reliable. For full-length amplification of the gene, primers designed from flanking regions of CDS encoding GS-SQS were used to get amplification against genomic DNA as template which resulted in approximately 6.2-kb sized single-band product. The sequencing of this product through NGS was carried out generating 2.32 Gb data and 3347 number of scaffolds with N50 value of 457 bp. These scaffolds were compared to identify similarity with other SQS genes as well as the GS-SQSs of the transcriptome. Scaffold_3347 representing the GS-SQS gene harbored two introns of 101 and 164 bp size. Both these intronic regions were validated by primers designed from adjoining outside regions of the introns on the scaffold representing GS-SQS gene. The amplification took place when the template was genomic DNA and failed when the template was cDNA confirmed the presence of two introns in GS-SQS gene in Gymnema sylvestre R. Br. This study shows GS-SQS gene was very closely related to Coffea arabica and Gardenia jasminoides and this gene harbored two introns of 101 and 164 bp size.


Background
Gymnema sylvestre R. Br., traditionally known as madhunashini, is one of the important medicinal plants of India. It belongs to the family Asclipiadaceae-milkweed. The economic plant part is the leaf. The phytochemicals present in the leaf causes loss of taste of sweetness. Because of this reason, it is called madhunashini meaning "killer of sugar." It grows in the tropical forests of India and has been used for more than 2000 years in the traditional system of medicines [1]. This plant has also found application in pharmaceuticals. The whole plant is rich in secondary metabolites, which impart medicinal uses to this plant. Ethanolic extract of leaves contains tannins, gum, flavonoids, proteins, and saponins. The principal constituent, gymnemic acid, is found in the aqueous leaf extract of Gymnema. It is one of the common plants used in the Indian system of medicine [2]. Various parts of the plant are used in the treatment of different diseases like skin problems, bronchitis, eye disease, cancer, and diabetes. It possesses medicinal properties like digestive, diuretic, emetic, expectorant, laxative, stimulant, and stomachic. It is also used for its antifungal properties and in urinogenital infection [3]. It has anti-diabetic, anti-sweetener, and antiinflammatory activity [4]. Despite the potential medical importance, little is known about the molecular biology of triterpene biosynthesis in G. sylvestre. Recently, biosynthetic pathway of gymnemic acid [5] and polyoxypregnane glycoside [6] and putative lncRNA and genes regulating terpenoid biosynthesis pathway [7] along with 13 potential miRNA [8] have been reported in G. sylvestre. In all the plants, triterpenes are synthesized via the mevalonate pathway, which involves the sequential conversion of farnesyl diphosphate (FPP) to squalene and then to 2,3-oxidosqualene, followed by a series of cyclization, oxidation, and reduction reactions [9]. Squalene synthase (SQS) and squalene epoxidase (SQE), both rate-limiting enzymes, are necessary to produce pentacyclic triterpenes. There co-exist a positive correlation among level of expression SQS with the quantity of triterpenoids that are produced [10]. SQS is a bifunctional enzyme which is membrane bound and it undergoes condensation of two molecules of C 15 allylic farnesyl pyrophosphate to form a 30-C precursor which is linear called squalene, which acts like a precursor for both sterol and triterpenoid. The process occurring in two stages involves the first step to be the formation of pre-squalene diphosphate by head-to-head condensation reaction of two FPP molecules. In the second step, there occurs subsequent reduction to squalene in NADPHdependent manner and this step requires divalent cations [11]. Thus, SQS is an especially important enzyme producing squalene molecules required to run the steroidal and triterpenoid biosynthesis pathways working in competitive inhibition mode [12]. Squalene molecules are precursor to many important secondary metabolites known for their medicinal, chemical, and pharmaceutical values. Squalene synthase is a key enzyme responsible in producing squalene molecules. The secondary products formed from squalene molecules include saponins, triterpenoids, and polyoxypreganens in G. sylvestre. The triterpenoids include olenane and dammarene in leaves of G. sylvestre. Olenane saponins include gymnemic acids as well as gymnemasaponins. In case of dammarene, saponins possess gymnemasides [13,14]. These terpenoids, besides a role in plant defense, are also involved in various clinical properties like anti-viral, anti-tumor, antiinflammatory, immune activation, and cholesterol lowering. Thus, SQS is a crucial enzyme in regulating triterpenoid biosynthesis [12]. Reports are available on information pertaining to SQS gene in several plants, namely, Arabidopsis thaliana, Glycine max, Magnolia officinalis, Panax ginseng, Panax notoginseng, Salvia miltiorrhiza Bunge, Chimonanthus zhejiangensis, Tripterygium wilfordi, and Taraxa cumkoksaghyz [15,16]. However, information is not available on the SQS gene in G. sylvestre. Availability of such information at gene sequence level may provide scope of cloning of this gene and further overexpression of the enzyme squalene synthase intended to harvest higher quantity of phytocompounds whose precursor molecule is squalene. Therefore, attempt was made to characterize the GS-SQS gene in this high-value medicinal plant.

Transcriptome profiling and validation of SQS
The entire process of transcriptome analysis was followed as per recent transcriptome study in Gymnema sylvestre [3]. The consensus CDS representing GS-SQS predicted from this transcriptome analysis of leaf, flower, and fruit was selected for further characterization.

Prediction of protein model and phylogenetic analysis
Protein model was generated through the SWISS-MODEL online module and MolProbity score, QMEAN and Cβ, and Ramachandran plot analysis were recorded [17]. GS-SQS protein sequences were obtained from published reports. After sequences were aligned and configured for the highest accuracy, phylogenetic trees were constructed by PHILP method [18]. The bootstrapping method was used to assess the reliability of internal branches.

Validation of GS-SQS gene
To ascertain information at sequence level in GS-SQS gene at genomic DNA level, primers were designed from flanking regions of the CDSs sequences resenting GS-SQS gene ( Table 1). The amplification was carried out through a long-range PCR master mix (Biolabs, Inc.) using genomic DNA as template. Genomic DNA was extracted from fresh leaves of DGS 22 genotypes of G. sylvestre. A total of 0.1 g leaf material was used for DNA was isolated using DNeasy Plant Mini Kit (QIAGEN, Hilden, Germany) following the manufacturer's instructions. The DNA concentration was estimated by 0.8% agarose gel electrophoresis using DNA standard. Quantification of DNA was done by Nanodrop-2000 spectrophotometer.

Long-range PCR to amplify full-length SQS gene
For PCR amplification, 10-μl reaction mixtures containing 20 ng of template DNA, long-range PCR master mix (Biolabs, Inc.), and 0.25 μM of each primer was used. Thermal cycling was carried out on Bio-Rad make Thermal Cycler. The PCR steps used were a pre-denaturing (95°C for 5 min) followed by denaturing (95°C for 30 s), annealing (55-60°C for 45 s), extension (72°C for 45 s) for 35 cycles, and a final extension at 72°C for 10 min.
Amplified PCR products were initially visualized to check the product size was confirmed on 0.8% agarose gel. The amplified product of GS-SQS gene gave a single band which was compared with long-range DNA ladder, and the approximate size of the product was guessed to be 6.2 kb. This PCR product was purified using 1X Agencourt AMPure XP (Beckman Coulter Genomics: A63882) DNA beads to remove dimmers and enzymes of PCR reactions. Purified PCR product was analyzed on 0.8% agarose gel (loaded 3 μl) for the single intact band. The voltage and time to run gel were 110 V and 30 min, respectively. One microliter of the sample was used for determining concentration using a Qubit® 2.0 fluorometer.

Sequencing, library, and de novo assembly preparation
The paired-end sequencing library was prepared using the NEBNext Ultra DNA Library Prep Kit for Illumina. The PCR product sample was mechanically sheared into smaller fragments by Covaris followed by a continuous step of end-repair where an 'A' is added to the 3′ ends making the DNA fragments ready for adapter ligation. Both ends of the DNA fragments were ligated with adapters suitable to the Illumina platform. With a high-  fidelity amplification step employing HiFi PCR Master Mix, maximum yields were ensured from initially limited product. Bioanalyzer 2100 (Agilent Technologies) was used to analyze an amplified library employing High Sensitivity (HS) DNA chip. After obtaining the Qubit concentration for the library and the mean peak size from the Bioanalyzer profile, the library was loaded onto the Illumina platform for cluster generation and pairedend sequencing.
High-quality paired-end data was assembled with CLC genomics workbench-6 with reads map back option restricting minimum contig length of 200 bp and mismatch cost, insertion cost, and deletion cost as 2, 3, and 3, respectively. The length fraction and similarity fraction were 0.5 and 0.8, respectively. Further gap closure was used for filling the gaps existing in CLC assembly which resulted in improvement of assembly.

Identification of scaffold representing squalene synthase and confirmation
To identify the scaffolds representing the GS-SQS gene, a similarity search was carried out for scaffolds representing SQS against NCBI's non-redundant (NR) protein database using blastX algorithm. For the identification of intronic gaps, the scaffold representing the GS-SQS gene was aligned with the CDS representing GS-SQS. For the confirmation and validation of the introns, two sets of specific primers are designed as per details in Table 1. Genomic DNA and cDNA were used as templates to amplify the introns. Total RNA was isolated using a total RNA purification kit (Sigma-Aldrich) from the leaves. cDNA was synthesized by using a first-strand c-DNA synthesis kit (Thermo scientific). The first set of primers was designed to amplify the first intron with a product size of 152 bp. Likewise, the second intron was amplified with the corresponding product size of 217 bp. Both the intron together was also amplified with the corresponding products of 415 bp. Thermal cycling was carried out on Bio-Rad make thermal cycler. The PCR steps used were a pre-denaturing (95°C for 5 min) followed by denaturing (95°C for 20 s), annealing (55°C for 30 s), extension (72°C for 40 s) for 35 cycles, and a final extension at 72°C for 10 min. Amplified PCR products were initially visualized to check the product size was confirmed on 0.8% agarose gel.

Protein model
Total three CDS representing GS-SQS predicted from this transcriptome analysis of leaf (L2), flower (F1), and fruit (F2) were deposited to the NCBI GeneBank with the CDS names as GS-SQSCDS_10191-L2 (leaf sample), GS-SQSCDS_64527-F1(flower sample), and GS-SQSC DS_35012-F2 (fruit sample) with GeneBank accessions numbers as MT812194, MT812195, and MT812196, respectively. All these CDSs had the same sequences of 1245 bp length including start and stop codons. The consensus CDS encoding protein sequence of the GS-SQS was used to predict the protein model through SWISS-MODEL online window (Fig. 2). The MolProbity is a structure-validation web service that provides broadspectrum solidly based evaluation of model quality at both the global and local levels for both proteins and nucleic acids. MolProbity Score for the generated model was 1.44 and the Clash Score 3.86. The quality estimates revealed that QMEAN and Cβ value were −2.53 and −1.79, respectively. Statistical score of Ramachandran plots analysis (Fig. 3) showed that the percentage of residue within the most favored φ, ψ regions were 96.27%, whereas the Ramachandran outliers were 0.31% which indicate that the homology model is reliable.

Comparison of GS-SQS with SQS sequences from other sources
Innovations in molecular biology and protein sequencing techniques have enabled to characterize the proteome of numerous organisms. Similarly, computational biology methods are also being used routinely to analyze protein sequences and structures in detail at the molecular level [19][20][21][22]. In the present study, an attempt has been made to investigate sequence similarities between Gymnema sylvestre and other plant species using bioinformatics methods. Phylogenetic studies of the protein sequences of Gymnema sylvestre have provided valuable evidence about their taxonomy, protein makeup, plant systematics, DNA barcoding, and common ancestor. For SQS proteins, the phylogenetic trees were constructed, and their reliability was checked by assessing the bootstrap values. We display the Maximum Likelihood (ML) method consensus trees in Fig. 4. The phylogenetic analysis result of SQ Sgene shows common similarity to Coffea arabica and Gardenia jasminoides.

Quality check and quantification of PCR product
Quality of the PCR purified product was checked on 0.8% agarose gel. The size of the product was approximately 6.2 kb (Fig. 5). Quantification of the sample was done using a Qubit fluorometer. Concentration of the product was 96.6 ng/μl.

NGS data generation and de novo assembly preparation
The library was prepared from the culture sample by NEBNext Ultra DNA Library Prep Kit. The average size of the library is 355 bp. The library was sequenced on the Illumina platform (2 × 150 bp chemistry) and 2.32 GB data was generated. Generated raw data is available at NCBI data repository with SRA accession number SRR10829862 under project ID PRJNA599051. The statistics of data and assembly elements derived using inhouse Perl script is provided in Table 2. The number of scaffolds identified through NGS was 3347. The mean scaffold size and N50 value of the elements were 461 and 497, respectively. The maximum size of the scaffold was 23,583 bp and the total scaffold size in terms of bp was 1,543,977 bp. The scaffold identified representing SQS gene was searched for similarity at domain level using the CDD database. During the blast search, the E value threshold was fixed at 0.01, and the maximum number of hits was 500. The results along with interval and E value are given in Fig. 6. The sequences of scaffold_3347 representing the GS-SQS gene is deposited at the NCBI gene repository with submission ID as BlankIt2375151 with accession MT892813.

Alignment of scaffold and CDS representing GS-SQS and analysis of the data
After aligning the scaffold and the CDS representing GS-SQS, the UTR sequences on both the sites of the scaffold were eliminated, and the remaining sequences of the scaffold were compared and analyzed with the CDS encoding GS-SQS. The alignment and analysis showed that there were a total of 06 sites (four with one and two with two nucleotide) mismatches on the scaffold side which seems to be the erroneous reading during the sequencing. At the same time, it is also revealed that the CDS encoding GS-SQS had two predicted intronic regions bigger than 100 bp size. The first intron was with 101 bp length (657 to 758 bp of CDS) and the second with 164 bp size (864 to 1028 bp of CDS) (Fig. 7). The amplification occurred for the genomic DNA as template and not for the cDNA as template, which confirmed the presence of two introns of 121 and 171 bp size, respectively in the GS-SQS gene in Gymnema sylvestre R. Br (Fig. 8).

Discussion
From the result, it was observed that scaffold_3347 of length 3926 bp was showing similarity against Panax ginseng, AJK30629.1 of 444 amino acid length. Simultaneously, blastN of de novo assembled scaffolds was carried out against NCBI's non-redundant nucleotide (NT) database. It was noted that scaffold_3346 of length 1244 bp was showing similarity against Olea europaea var. sylvestris squalene synthase-like (LOC111412627), mRNA of 1204 bp length. The CLC gap closed de novo assembly was searched for similarity against the transcriptome data. From 3347 scaffolds, there were a total of three scaffolds found significant alignments against the CDS representation GS-SQS gene. Scaffold_3347 and scaffold_3346 had 281 and 3e-76, 171 and 6e-43-bit scores, and E value respectively. Although the scaffold did not cover complete CDS, the middle portion of the CDS was covered by scaffold_3347. Domain search revealed that sequences of scaffold_ 3347 representing GS-SQS were having farnesyldiphosphate farnesyltransferase domain. There were a total of twelve conserved sites representing five different genes. The maximum number of conserved sites was covered under the farnesyl-diphosphate farnesyltransferase gene which was represented by accession TIGR01559. This family is related to phytoene synthases. The C-terminal predicted transmembrane  region is absent in archaeal homologs, not included in this model [23]. The scaffold had three conserved sites for the gene Trans-Isoprenyl Diphosphate Synthases (Trans_IPPS). It was represented by accession cd00683. The head-to-head (HH) (1′-1) condensation is carried out by Trans_IPPS. This conserved domain encompasses two genes, viz., squalene synthases and phytoene synthases [23]. These residues mediate binding of prenyl phosphates. The enzymatic process of squalene production is a two-step reaction. A stable intermediate, cyclopropylcarbinyl diphosphate, is formed by squalene synthase with the help of two molecules of FPP. The squalene molecules are produced from this intermediate product by biochemical processes like heterolysis, isomerization, and reduction with NADPH. Therefore, it is a two-step reaction. Phytoene, a precursor of beta-carotene is produced by phytoene synthase (CrtB) causing condensation of two molecules of geranylgeranyl diphosphate. These enzymes, having a wide spectrum presence across eukaryote, bacteria, and archaea, are responsible for biosynthesis of many triterpene and tetraterpene precursors. Chain of these enzymes produce the triterpene and tetraterpenes in plants.
Triterpenoid alkaloids and steroids are further produced from these triterpenes and tetraterpene. Another two conserved sites belong to squalene/ phytoene synthase represented by pfam00494 and a pytoene/squalene synthetase represented by ERG9 COG1562 each. After aligning the scaffold and the CDS representing GS-SQS, the analysis of the nucleotide composition of the predicted introns revealed that A+T content was more than 63%. Both the introns had AG dinucleotides sequences at 3′ splice site to facilitate the second step of the splicing event. The initial nucleotide sequences of the intron GT in the first intron whereas it was TT in case of the second intron. Thus, both the introns had common conserved branch point fitting the requirement of the spliceosome to act upon. To amplify the intronic region, primers designed from flanking regions of the introns on the scaffold encoding GS-SQS. The housekeeping gene EFTU was successfully amplified with the use of cDNA as well as the genomic DNA as template. However, the amplification of intronic region took place with the primers designed from adjoining outside regions of the introns on the scaffold encoding GS-SQS only when genomic DNA was used as template and not when the cDNA was template confirmed the presence of these two introns in GS-SQS gene in Gymnema sylvestre R. Br. Being important features of eukaryotic genes, introns are usually non-coding sequences and are removed from pre mRNA [24]. In general, the boundary sequences of introns are usually conserved with GU in the 5′ end and AG in the 3′ end. This is because these may be important for intron splicing in pre-mRNA [25]. Introns are classified into several types. The genes of chloroplasts, mitochondria, and bacteria are reported to have introns [26,27]. The type I intron is the most occurring type of introns reported to be present in majority of the eukaryotic nuclear genes. Since introns are preserved  [24,28]. They may function in the cells like regulation of gene expression and the increase of protein diversity by alternative splicing [25,29]. Sequences of the whole intronic region are not conserved, and therefore, accumulation of mutations in such region becomes easier [30]. A wide variation in size of introns is reported. It may be as longer than dozens of kilobase pairs (kbp)-to as shorter than 10 bp. In Arabidopsis, as revealed in Arabidopsis Genome Initiative 2000, the majority of introns are small with size of a few hundred bp. The smallest exon in Arabidopsis was found to be 1 bp [31].
Chlorophytum borivilianum, Euphorbia tirucalli, Euphorbia pekinensis, Lotus japonicus, Oryza sativa, and Taxus cuspidate are the plants in which single SQS genes exist. Two paralogs exist in case of Arabidopsis thaliana, Glycyrrhiza glabra, Glycine max, Malus domestica, Nicotiana tabacum, Salvia miltiorrhiza Bunge, and Withania somnifera. There are two SQSs, SQS1 and SQS2, reported in A. thaliana. The SQS 1 was found to be broadly expressed in every tissues that are involved in the development of plant whereas the SQS2 was profoundly expressed in hypocotyl of seedlings as well as vascular tissue of cotyledon and leaf petiole. Squalene was not synthesized from recombinant SQS2 from FPP even in the presence of NADPH and Mg 2+ or Mn 2+ , whereas in the presence of SQS1, under the same conditions and equivalent preparation, it was able to generate SQ; hence, we can say SQS1 is the ultimate functional SQS present in Arabidopsis thaliana. Three SQS paralogs exist in case of Panax ginseng [32,33]. Three SQS genes found in P. ginseng, SS1, SS2, and SS3, were found to be capable of converting yeast erg9 mutant to ergosterol prototrophy despite the divergence in sequence yeast, and similarly, in the case of Glycine max, which possesses two SQS, GmSQS1 and GmSQS2 were capable of converting yeast sterol auxotrophy erg9 mutant to sterol prototrophy. The product sterols were also found to be raised in Arabidopsis seed, due to overexpression of Glycine max GmSQS1. A similar observation was found in W. somnifera SQS that possesses 2 SQS, WsSQS1 and WsSQS2, in which cDNA investigation was performed, and finally, preliminary enzyme activity as well as recombinant expression was reported [16,33]. It is also noted that the accumulated phytosterol and triterpenoid compounds in Bupleurum falcatum, Eleutherococcus senticosus, Panax ginseng, Solanum chacoense, and Withania somnifera were elevated with the overexpression of SQS genes [16].

Conclusion
G. sylvestre is one of the most important medicinal plants producing triterpenes. CDS encoding GS-SQS gene with 1245 bp length was predicted from transcriptome data and was analyzed for reliability for quality scores through modeling SWISS protein structure. Long-range PCR performed with primers designed from flanking regions of this CDS successfully amplified 6.2kb-sized product against genomic DNA as template. Total 2.32 Gb data and 3347 number of scaffolds with N50 value of 457 bp were generated from NGS of the purified PCR product. The alignment of scaffold_3347 representing GS-SQS gene revealed that it harbors two introns of 101 and 164 bp size in GS-SQS gene. Primers designed from adjoining outside regions of the introns on the scaffold gave amplification when the template was genomic DNA and failed when the template was cDNA which confirms the presence of these two introns in GS-SQS gene in Gymnema sylvestre R. Br.