Comparative analysis of Rosetta stone events in Klebsiella pneumoniae and Streptococcus pneumoniae for drug target identification

Drug target identification is a fast-growing field of research in many human diseases. Many strategies have been devised in the post-genomic era to identify new drug targets for infectious diseases. Analysis of protein sequences from different organisms often reveals cases of exon/ORF shuffling in a genome. This results in the fusion of proteins/domains, either in the same genome or that of some other organism, and is termed Rosetta stone sequences. They help link disparate proteins together describing local and global relationships among proteomes. The functional role of proteins is determined mainly by domain-domain interactions and leading to the corresponding signaling mechanism. Putative proteins can be identified as drug targets by re-annotating their functional role through domain-based strategies. This study has utilized a bioinformatics approach to identify the putative proteins that are ideal drug targets for pneumonia infection by re-annotating the proteins through position-specific iterations. The putative proteome of two pneumonia-causing pathogens was analyzed to identify protein domain abundance and versatility among them. Common domains found in both pathogens were identified, and putative proteins containing these domains were re-annotated. Among many druggable protein targets, the re-annotation of EJJ83173 (which contains the GFO_IDH_MocA domain) showed that its probable function is glucose-fructose oxidoreduction. This protein was found to have sufficient interactor proteins and homolog in both pathogens but no homolog in the host (human), indicating it as an ideal drug target. 3D modeling of the protein showed promising model parameters. The model was utilized for virtual screening which revealed several ligands with inhibitory activity. These ligands included molecules documented in traditional Chinese medicine and currently marketed drugs. This novel strategy of drug target identification through domain-based putative protein re-annotation presents a prospect to validate the proposed drug target to confer its utility as a typical protein targeting both pneumonia-causing species studied herewith.


Background
A protein domain is a well-defined region within a protein that performs a specific function. Thus, the duplication of a protein domain may enhance the function of the protein. The fact that numerous proteins contain duplicated domains indicates that multifarious present-day proteins have evolved from simple proteins mainly through domain duplication. Recombination as a reason for domain duplication and domain shuffling is imaginably the most important forces driving protein evolution culminating in the complex proteome. The gene duplications and domain coding-exon duplications have resulted in an increased abundance of domains in the proteome, while domain shuffling increases versatility which is the number of discrete contexts in which a domain can occur [38].
Two polypeptides in one organism are likely to interact if their homologs express as a single polypeptide is called Rosetta stone protein. Such events help link different proteins together, leading to functional interactions between linked proteins, which may be the reason for local and global relationships within the proteome. These relationships help us to understand the role of proteins within the context of their associations and facilitate the assignment of functions to uncharacterized proteins based on their linkages with proteins of known function. Every genome projects aim to annotate the proteins coded by the genome under investigation. However, when a genome sequencing project is completed and released into public domains, researchers take a second look at the original annotation of proteins to curate them using various annotation methods. This is referred to as "re-annotation" [2]. The re-annotation of putative proteins has been attempted previously and resulted in identifying novel drug targets for many infectious diseases [26].
Pneumonia is caused by many classes of microbes, and each microbial manifestation of the disease is attributed to some specific protein interaction. Streptococcus species and Klebsiella species are known to cause the highest proportion of pneumonia. Pneumonia is an inflammatory condition of the lung primarily affecting the alveoli. It affects approximately 450 million people globally (7% of the population) and results in about 4 million deaths per year [20,27]. In this study, we have considered two such species whose pathogenesis pattern may vary, but their protein repertoire resemblance can throw light on finding a specific drug target usable in both species. We have considered KPNIH11 and SPD39 species for our study. Streptococcus pneumoniae is isolated in nearly 50% of cases of community-acquired pneumonia (CAP) [1,30]. Klebsiella pneumoniae accounts for hospital-acquired pneumonia infections [12].
Putative proteins are a conceptually translated sequence of amino acids from open reading frames (ORFs) with no known protein/peptide evidence. Only the putative protein data was considered in this study since putative proteins may potentially have an expression in natural biological systems. Therefore, they may serve as novel potential drug targets. Domains are well-known functional modules of proteins, making them ideal candidates to study protein-specific functions rather than targeting the entire protein.
KPNIH11 and SPD39, respectively, had 18.64% (1006 out of 5397 proteins) and 5.9% (256 out of 4366 proteins) putative proteins in their total proteomes. The structure and function of putative proteins are often poorly understood due to no known evidence of translational expression. The possible reason for many putative proteins in Klebsiella pneumoniae as compared to Streptococcus pneumoniae might be because the latter is highly studied due to its well-known pathogenesis. Putative proteins are re-annotated to determine the possible function of the protein.
Drug target identification for an infectious disease like pneumonia has been attempted by many previous studies using either comparative genomics, metabolic network modeling and simulation, multi-omics approach, or subtractive genomics approach and has resulted in identifying a significant number of potential drug targets with considerable success. The present study focuses on the comparison of the Rosetta stone events followed by the domain-based re-annotation of unannotated putative protein population in the two most common pneumonia-causing pathogens culminating in potential drug target identification.

Data collection and protein domain repertoire analysis
The protein sequences of Klebsiella pneumoniae subsp. pneumoniae KPNIH11 (KPNIH11) and Streptococcus pneumoniae strain D39 (SPD39) were retrieved from the NCBI protein database (https://www.ncbi.nlm.nih.gov/ protein/). As mentioned in the introduction section about the importance of re-annotation of putative protein datasets, the study was concentrated on putative proteins, and hence out of the total proteins, only putative proteins were selected for the study. Domain repertoire was cataloged using the National Centre for Biotechnology Information (NCBI) Conserved Domain Database (CDD) Batch search, and domain architecture was cross-verified and ascertained using the SMART database [19]. Domain list was uploaded into Venny graphical tool version 2.1.0 (https:// bioinfogp.cnb.csic.es/tools/ venny/index.html). It gave the number and names of the shared domains between the two species of bacteria. Putative proteins containing the shared domains were separated and re-annotated using Position-Specific Iterated (PSI-BLAST), and annotated proteins were searched for homology against the human genome using Domain Enhanced Lookup Time Accelerated BLAST (DELTA-BLAST) according to Telkar et al. [36].

Druggability prediction, protein modeling, and virtual screening
Proteins from KPNIH11 and SPD39 that did not show any homology to human proteins were evaluated for druggability scores using the EMBL-EBI DrugEBilitytool [24]. Druggable proteins were modeled using the SWIS S-MODEL database [17]. Ramachandran plot was used to check model quality using the PROCHECK server [18]. The models thus generated were then energy minimized according to Pawan et al. [14] before using them for further analysis. The structure of inositol 2dehydrogenase orthologs from both the genomes was superimposed at the HOMSTRAD database, according to Khazanov et al. [16], and it was observed that there exists very least amount of deviation and high degree structural alignment between them (Fig. 5). The superimposed structure was visualized using the UCSC CHIM ERA software [25].
The active pocket of modeled proteins was determined using the CASTp server [37], which gives a list of amino acids lining the possible active pocket. For virtual screening, the supercomputer facility at TACC server was used, which scans for ZINC database [15] and TCM database [5] as default settings for the given protein structure and provide possible drugs interacting with the modeled protein active site [9]. These reported molecules were assessed for ADMET properties using the DATAWARRIOR software [29].

Domain composition analysis
We found 22 domains common in the two pathogens (Table 1). During evolution, different organisms tend to gain or lose their genome and proteome homology by exchanging DNA sequences through processes like duplication and recombination. These can arise from adaptations in response to environmental changes or the immune response of the host. As a result of their rapid doubling time and large population sizes, bacteria can evolve rapidly. The domains shared between KPNIH11 and SPD39 proteomes are the evidence that these organisms have had domain sharing and shuffling process during their evolution. One hundred twenty-three and 34 proteins in KPNIH11 and SPD39, respectively, contained these common domains. Supplementary Tables 1and 2 show the accession number, name of the common domain, and their tethering pattern in putative proteins of the two species. The results clearly show that some domains such as GFO_IDH_MocA, DeoRC, HAT-Pase_c, rve, and Transketolase_C tend to tether with the same domain each time they form a protein attributing to a specific function. For example, GFO_IDH_MocA domains tether with GFO_IDH_MocA_C domain whose function is attributed to utilizing nicotinamide adenine dinucleotide phosphate (NADP) or nicotinamide adenine dinucleotide (NAD) for glucose-fructose redox reactions [34]. Using such tethering data, each domain can be assigned with versatility and abundance scores. Versatility is the number of different tethering combinations a domain can form, and abundance is the number of times a domain is identified in a set of proteins. In the current study, we have considered all the putative proteins which contain the common domains in two organisms. The domain versatility and abundance of these common domains in the putative proteins are shown in Table 1. Furthermore, when proteins containing shared domains were re-annotated using PSI-BLAST, 27 and 11 proteins, respectively, in KPNIH11 and SPD39 were annotated (Supplementary tables 3 and 4). Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) uses protein-protein BLAST to derive a position-specific scoring matrix (PSSM) to identify protein similarity. This matrix is used to search the database for new matches. Thus, PSI-BLAST provides a means of detecting distant relationships between proteins. Hence, domain-based re-annotation resulted in annotating a set of proteins in the putative protein data set.

Identification of druggable proteins
DELTA-BLAST was used for finding the homology between bacterial proteins and the human proteome. The results showed that all annotated 27 proteins in KPNI H11 had no homology with the human proteome. However, two proteins (ABJ54307 and ABJ55037) of SPD39 had homology with the human proteins, making nine proteins available for further analysis. For a protein to be used as a drug target, it has to be checked for druggability. This was carried out using the EMBL-EBI DrugEBllity database. All short-listed proteins were found to be druggable in this procedure. The result confirmed that all annotated 27 proteins in KPNIH11 and nine proteins in SPD39 could be used as drug targets.

Comparative modeling
All the 36 druggable proteins were subjected to 3D modeling using the SWISS-MODEL server [13], in which only two proteins in each organism could be modeled resulting in four structures available for further analysis. Out of the models generated, EJJ83173 and EJJ80284 were found to be complete protein models and other proteins were truncated protein models (Fig. 1). CASTp server was used for determining the active pockets of the protein models. Binding sites and active sites of proteins are often associated with structural pockets and cavities (Fig. 2). This analysis provided the identification and measurements of accessible surface pockets and inaccessible interior cavities for proteins.

Protein interaction analysis for the protein EJJ83173
The protein-protein interaction of modeled protein was studied using the STRING database [33] (Fig. 3). Putative protein EJJ83173 showed interaction with eight different proteins in a biclique architecture where all proteins interact. From the interaction data, it is evident that targeting this protein may potentially affect the inositol metabolism in microbes under study. It is to be noted here that this annotation is based on the domain composition of the protein; hence, targeting the protein is like eventually targeting the GFO_IDH_MocA (oxidoreductase family, NAD-binding Rossman fold) domain, which is common in both organisms.

Model evaluation of EJJ83173
The 3D model generated by the SWISS-MODEL server showed −2.89 QMEAN value (a scoring function to estimate global and local model quality; higher numbers indicate higher reliability of the residues), 57.14% sequence identity, and 0.47 sequence similarity. The modeling was done using PDB ID: 3NT5 as a template (Fig. 1). Ramachandran plot constructed for EJJ83173 protein (Fig. 4) using PROCHECK in PDBsum database gave the following result: 1094 residues (92.3%) are found in the most favored region, 84 residues (7.1%) are found in the additionally allowed region, and only eight residues (0.7%) are found in the disallowed region which is negligible. Over 90% of residues in the favored region assures a good quality protein model.

Structure superimposition study
The structures of inositol 2-dehydrogenase orthologs from both the genomes were superimposed at the HOMSTRAD database, according to Khazanov et al. [16], and it was found that there exists very least amount of deviation and high degree structural alignment between them (Fig. 5). The superimposed structure was rendered in the UCSC Chimera visualizer [25]. Figure 5 depicts the inositol 2-dehydrogenase from KPNIH11 (red color) and SPD39 (green color). The orthologues have RMSD on Ca atoms = 2.837 Å units.

Virtual screening
The 3D model for EJJ83173 was subjected to pocket prediction (section 3.3), and it identified a probable active pocket, which was subsequently used for molecular docking studies to find potential inhibitors. For virtual screening, grid box ( Fig. 6) was set (with size x-10, y-18, z-12 and grid center x-11.739, y-30.608, z-18.306) using AutodockTools 1.5.6 software and was docked against natural ligands at drugdiscovery@TACC database.
Virtual screening was carried out against two ligand databases (ZINC database of commercially available drugs and TCM database of Traditional Chinese Medicine) to check for specific interaction with protein EJJ83173. These two databases were set as default ligand libraries in the server. Results obtained for the ZINC database and TCM database showed the binding energy, drug likeliness, tumorigenic property, mutagenic property, reproductive effect, and irritation properties of the ligands (Supplementary tables 5 and 6). The highest negative binding energy was shown by ligand with TCM ID: 45055 (−9.8) in the TCM database and by ligand with ZINC ID: ZINC68563949 (−9.5) in the ZINC database (Fig. 7). The top 15 results are tabulated in both TCM and ZINC database results, which can be used to inhibit the inositol dehydrogenase enzyme.

Discussion
In the light of many reports, which argue that the complexity of an organism is loosely linked to the number of genes, the complexity of protein and their architecture is  centered on some of the observations that flies have fewer genes than nematodes and humans have fewer than rice. Even the simplest bacterial genome is the product of extensive gene duplication and recombination [7]. Hence, the increase in protein repertoire is due to (i) duplication of domain coding sequences; (ii) divergence and modification of duplicated genes through mutations, deletions, and insertions; and (iii) gene recombination. These mechanisms are believed to be the origin of the diverse proteome [21]. Several protein-protein interactions facilitate through autonomously folding modular domains. Proteome-wide efforts to model protein-protein interaction or "interactome" networks have largely ignored this modular organization of proteins [3]. The protein-protein interactions are, in turn, domain-domain interactions. Hence, the complexity of domain architecture may increase the complexity of protein interaction also. This study was taken up to identify the domains and domain-based reannotation of the putative protein datasets of the two pneumonia-causing bacteria K. pneumoniae and S. pneumoniae. Since the putative proteins are annotated based on the similarity but not experimentally validated, the re-annotation may help find the previously unreported novel drug targets [2, 4, 31]. The putative protein sequence datasets from both organisms under study were scanned for their domain composition using the CDD database. Twenty-two domains were found to be commonly present between the two putative protein datasets. The study was further explicitly focused on the putative protein dataset of only those proteins containing common domains. It helps discover the ortholog, which may have structural homology, which in turn is advantageous if they are druggable proteins where, hypothetically, one ligand may inhibit both the orthologues [8].
The protein domain repertoire shows different levels of abundance and versatility for each of the proteomes. It appears that K. pneumoniae has more versatility than S. pneumoniae. MFS domain is abundant in K. pneumonia,    [34]. Reserved domains contribute to the abundance but not to versatility, and hence, the biochemical versatility of the proteome may diminish. Our study was able to re-annotate the protein by using PSI-BLAST, subsequently verifying by reciprocal BLAST against the related organism. Among the re-annotated protein lists, S. pneumoniae comprised many functionally different proteins than K. pneumoniae. In the present work, the protein domain repertoire analysis leading to the identification of Rosetta stone events helped in identifying the more abundant domains and versatile protein domains. Only the proteins with these selected domains were further used for the identification of potential drug targets through DELTA-BLAST and DrugEbility analysis. A potential drug target is a protein that does not have homology with the host genome when it comes to infectious diseases [23]. Genomes have always been investigated for non-homologous proteins while searching for potential drug targets [10]. DELTA-BLAST searches a database of pre-constructed position-specific scoring matrices before searching a protein sequence database, to yield better homology detection. For its positionspecific score matrix (PSSMs), DELTA-BLAST employs a subset of NCBI's CDD. It implies that DELTA-BLAST performance is directly dependent on CDD that contains information regarding conserved domains. Hence, using DELTA-BLAST is more appropriate to determine the domain-based homology search. Therefore, the results showed that 27 proteins in KPNIH11 and nine proteins in SPD39 had no homology with the human proteome.
In principle, all these proteins could be used as drug targets because they are specific to the microbes. Theoretically, any drug administered against these proteins should not interact with human proteins to alter human physiology.
The main motto of understanding common domain fusions in two different pathogens is to finally identify an ideal druggable target, common to both organisms. In this respect, the EMBL-EBI DrugEBllity tool was used to calculate the possibility of using proteins as druggable targets. This database predicts the druggability of any given protein by comparing it to existing protein models by performing a BLAST search [39]. This result suggested that 27 proteins in KPNI H11 and nine proteins in SPD39 can be used as drug targets based on the druggability and tractability score. DrugEBIlity server has been used in many previous studies and proven a reliable calculation method for evaluating the druggability of the protein [6,11].
For any drug design and development, the presence of a target protein 3D structure is essential. If unavailable, at least the protein should be available for the 3D modeling with a proper template. Most druggable proteins did not find templates when sequence search was performed against the PDB database except for two proteins, EJJ83173 (inositol 2-dehydrogenase with GFO_ IDH_MocA domain) and EJJ80284 (antitoxin with HTH_XRE domain). Furthermore, the protein also needs to be the hub in the protein interaction network with several interacting proteins. The more the degree, the more crucial the protein will be since knocking down the protein will knock down the entire protein interaction network. The putative protein EJJ83173 has more advantage over EJJ80284 because of three reasons. Firstly, EJJ83173 was found to have more protein interacting with it compared to EJJ80284. Secondly, the biological relevance of EJJ83173 for the survival of an organism is more because it is a part of carbohydrate catabolism where EJJ80284 is predicted as an antitoxin molecule for which no significant biological relevance was found in terms of essentiality to the survival of the organism. And finally, no suitable active pockets were predicted for EJJ80284.
The putative protein EJJ83173 which is identified as inositol 2-dehydrogenase as described in the manuscript has been shown as an attractive drug target by the previous reports which are quoted in the manuscript. Furthermore, this enzyme is reported to be an integral part of myo-inositol catabolism which is the sole source of carbon for many bacteria including Legionella pneumophila, Bacillus subtilis, Lactobacillus casei, Salmonella enterica, and Sinorhizobium meliloti in previous studies [22]. Therefore, it was thought that this protein may serve as a good drug target since it may disrupt the carbon utilization process by the bacteria under study. EJJ83173 can serve as a promising drug target because it has a confirmed orthologue in S. pneumoniae, it is predicted as a druggable protein, it has no homologous domain/protein in human proteome as indicated by DELTA_ BLAST, and it is having more degree of interacting proteins. The model generated for EJJ83173 using SWISS-MODEL has more than 92.3% residues in the allowed region; hence, it is considered a good model. Furthermore, the structure superimposition of orthologues of inositol 2-dehydrogenase shows only 2.8Å root mean squared deviation (RMSD), suggesting the topological and geometrical similarity between the orthologues. Hence, it can be hypothesized that a ligand that binds to EJJ83173 can also bind to its counterpart in S. pneumoniae.
Many servers are available for ligand screening. Among them, the drugdiscovery@TACC server is the most robust one. It is a web resource that provides controlled access to molecular docking software running on the Lonestar 5 supercomputer at TACC. The database has collaborated with other databases containing sets of natural and synthetic ligands. In our study, we used two such datasets, namely, the ZINC database and the TCM database. ZINC database is the curated collection of commercially available chemical compounds created for keeping virtual screening as the primary objective. TCM database stands for Traditional Chinese Medicine database, which contains traditionally used medicines and their three-dimensional structure data ready for virtual screening. Previously, this server has been used for virtual screening and discovery of novel inhibitors [35]. The results show the availability of a good number of inhibitors for the protein. The top ligand from the ZINC database showed a good number of physical interactions along with an affinity of −9.5μM, also with no predicted side effects. The ligands which show very close affinity and no predicted toxicity can be taken further for the development process.
It is reported that the Gfo_Idh_MocA protein family contains many different proteins, which almost exclusively consist of NAD(P)-dependent oxidoreductases that have a diverse set of substrates, typically pyranoses. The members of this protein family have a two-domain structure consisting of an N-terminal nucleotide-binding domain and a C-terminal α/β-domain. The C-terminal domain contributes to the substrate binding and catalysis and contains a βα-motif with a central α-helix carrying common essential amino acid residue. The β-sheet of the α/β-domain contributes to the oligomerization in most of these proteins [34]. Domain-based annotation of EJJ83173 (putative NADH-dependent dehydrogenase of KPNIH11) has not been reported yet. Our study has reannotated this putative protein as inositol 2dehydrogenase protein. Combining this information, it is evident that protein EJJ83173 can serve as an attractive drug target. Inosine monophosphate dehydrogenase was targeted in previous studies in the case of tuberculosis [32]. This protein is an attractive drug target [28]. Therefore, the study explains the potency of EJJ83173 protein as a probable drug target. The list of ligands obtained from virtual screening may be further used for clinical testing, which targets both KPNIH11 and SPD39. This study provides a rich source for further experiments to elucidate the role of putative protein EJJ83173 as a drug target for pneumonia infection.

Conclusion
This study has focused on analyzing the domain repertoire of two pneumonia-causing pathogens to identify Rosetta stone events. Putative proteins of these pathogens were selected in particular to analyze any missing links in protein domain shuffling. Analysis of domain tethering and domain sharing showed that the two species shared many domains in common. The re-annotation of the putative protein by utilizing position-specific iterations has provided several druggable protein candidates. However, an attempt to 3D model these re-annotated putative proteins resulted in only one protein (EJJ83173) as the most likely drug target with good model parameters. Reannotation of protein EJJ83173 (which contains the GFO_IDH_MocA domain) showed that its probable function is glucose-fructose oxidoreduction. This protein also has sufficient interactor proteins and homolog in both pathogens but no homolog in the host (human), indicating it as an idol drug target. Through virtual screening, several traditional medicines and existing marketed drugs were found to effectively interact with the protein. This study provides a model for drug target identification through domain-based protein re-annotation. However, protein/peptide evidence for this protein target has to be identified to validate these findings and to analyze the usability of this protein as a reliable drug target.
Additional file 1: Supplementary table 1. Proteins containing common domains and their domain repertoir. These proteins of SPD39 contain one or more of the common domains between the two microbes. Their tethering patterns with other domains account for the domain versatility, while the number of occurence accounts for abundance. Supplementary table 2. Proteins containing common domains and their domain repertoir. These proteins of KPNIH11 contain one or more of the common domains between the two microbes. Their tethering patterns with other domains account for the domain versatility, while the number of occurence accounts for abundance. Supplementary table 3. Protein annotation using PSI-BAST. These proteins of SPD39 have been annotated using PSI-BLAST and checked for homology with human proteome using DELTA-BLAST. All the proteins except ABJ54307 and ABJ55037 were found non-homologous to human proteomemaking them ideal drug targets. Supplementary table 4. Protein annotation using PSI-BLAST. These proteins of KPNIH11 have been anntated using PSI-BLAST and checked for homology with human proteome using DELTA-BLAST. All the proteins were found non-homologous to human proteomemaking them ideal drug targets. Supplementary table 5. Virtual screening results of TCM databse ligands. This list of ligands showed affinity towars EJJ83173 protein during docking. These are Traditional Chineses Medicine that can be targeted to interfere protein function of EJJ83173. Supplementary table 6. Virtual screening results of ZINC databse ligands. This list of ligands showed affinity towars EJJ83173 protein during docking. These are commercially available drugs that can be targeted to interfere protein function of EJJ83173.