Protein Annotation of Breast-cancer-related Proteins with Machine-learning Tools

One of the primary contributors to the mortality of women is breast cancer. Several approaches are used to cure it, but recurrence occurs in 79% of the cases because the underlying mechanism of the protein molecules is not carefully examined. The goal of this research was to use machine-learning tools is to elucidate conserved regions and to obtain functional annotations of breast-cancer-related proteins. The sequences of five breast-cancer-related proteins (BRCA2, BCAR1, BCAR3, BCAR4, and BRMS1) and their annotations were retrieved from the UniProt and TCGA databases, respectively. Conserved regions were extracted using CLUSTALX. We constructed a phylogenetic tree using the MEGA 7.0. SUPERFAMILY database to obtain fine-grained domain annotation. The tree revealed that the BRCA2 and BCAR4 protein sequences are located in a clade, which indicates that they have overlapping functions. Several protein domains were identified, including the SH2 and Ras GEF domains in BCAR3, the SH3 domain in BCAR1, and the BRCA2 helical domain, the nucleic-acid-binding protein, and tower domain. We found that no protein domains could be annotated for BCAR4 or BRMS1, which may indicate the presence of a disordered protein state. We suggest that each protein has distinct functionalities that are complementary in regulating the progression of breast cancer, although further study is necessary for confirmation. This protein-domain annotation project could be leveraged by the complete integration of mapping with respect to gene and disease ontology. This type of leverage is vital for obtaining biochemical insights regarding breast cancer.


Introduction
As reported by WHO, breast cancer is considered to be one of the most dangerous diseases for women [1]. In the United States, breast cancer is reported to comprise 15.2% of all the cancer cases and almost 7% of the mortality rate [2]. A survey by the Ministry of Health of Indonesia found that symptom of breast cancer are detected in 2.6 of every 1000 breast lesion samples [3]. This disease has long been treated by conventional means, i.e., surgery, radiation, and chemotherapy, which are undertaken only after the cancer has entered a late stage and has metastasized [4].
However, because the biomolecular mechanism of breast cancer has been determined, the development of more modern diagnostic approaches and therapeutics are now feasible. Moreover, the field of molecular medicine has employed the machine-learning approach to diagnose the lesion images, metabolomics, genomics, medical informatics predictors, and structural bioinformatics of breast cancer, which significantly facilitate the development of highly accurate diagnostics [5]. Moreover, machine-learning models have been employed to annotate medical informatics and pharmacogenomics data to diagnose and provide expert judgments regarding the progression of breast cancer [6,7]. Based on recent research, several mutations have been found in particular genes that could play some part in the progression of breast cancer. The most well-known genes are BRCA1 and BRCA2, as the mutations of those genes have been identified in breast-cancer cell-line samples [5,6]. Moreover, a transcriptomics biomarker is being developed to detect the progression of triple-negative breast cancer (TNBC) by leveraging its non-coding (nc)RNA signatures [9].
The development of transcriptomics-based diagnostics and therapy is recognized as challenging, especially with respect to clinical trials. In this regard, studying the proteomics expression of breast cancer is expected to provide significant relevant information. Several proteomics-based breast-cancer drugs have been June 2020  Vol. 24  No. 2 approved by the FDA, including Anastrozole®, which targets hormone-receptor-positive (HR+) breast cancer, Trastuzumab® for HER2-positive (HER2+) breast cancer, and Fulvestrant (Faslodex®) for HR+ and HER2-negative breast cancers with no hormone therapy [8][9][10]. An exhaustive list is linked with the FDA database that is available online [13]. As noted above, combining molecular medicine with machine-learning methods provides an interesting approach for obtaining information about the-omics repertoire of breast cancer. In this regard, the shortcoming of the machine-learning method is the possibility of determining any introduction of bias in the data annotations, which could introduce redundancy in reports [14]. However, the advantages of the machine-learning method for making gene or protein annotations regarding breast cancer far outweigh its shortcomings, i.e., mainly its ability to predict a finegrained molecular mechanism, its extensibility to broadrange-omics studies, and the provision of data annotations for breast cancer survivors [15][16][17].
With respect to molecular proteomics, the protein domain is widely recognized as an independent molecular evolutionary unit that plays an important role in cancer progression. An approach to structural bioinformatics research has been devised that simulates the molecular mutations in the TP53 and ER proteins directly related to breast cancer [14,15]. The role of disordered proteins and their phylogeny are also computed. Moreover, to obtain complete populationsize samples of cancer patients, a dedicated database has been developed, namely The Cancer Genome Atlas (TCGA) [20]. The availability of this dedicated database and the growing interest in proteomics research have enabled the growth of machine learning-based tools for annotating the occurrence of protein domains that have a role in the progression of breast cancer [19][20]. This method employs all the latest protein annotation or proteomics tools. Notable proteomics databases such as SUPERFAMILY, SCOP, and PFAM incorporate the hidden Markov model method for predicting previously unidentified protein domains and folds [21][22][23]. Moreover, the STRING database employs a graph algorithm with a statistical probability model for devising a nodes and edges repertoire [26]. In this respect, the SUPERFAMILY protein domain database was developed to provide annotations regarding the progression of disease [27]. Based on the SCOP classification that incorporates the machine-learning approach of the SUPERFAMILY database, protein domains are classified into families, superfamilies, and folds. Protein families comprise domains with similar sequences but different features, superfamilies comprise domains with common ancestors, and folds comprise domains with similar structures [26,27]. The SCOP classification is the industry standard for protein domain annotations.
Several software packages and pipelines have been developed for making extensive domain annotations, and these have attracted the interest of the scientific community [30,31]. This effort has been strengthened by the more fine-grained annotation of protein-protein interaction networks of cancer-related proteins [30][31][32]. To this end, these efforts should be solidified as a basis for developing a blueprint for constraining the menace of breast cancer and developing drug designs for breast cancer [34,35]. The development of proteomics information pipelines will facilitate the examination of the protein domain repertoire in the progression of breast cancer. The objective of this research is to use machinelearning tools to annotate the protein domains responsible for the progression of breast cancer. These annotations are expected to provide fine-grained information for use in constructing a blueprint for the development of drugs based on a complete analysis of proteomics data regarding functional and conserved regions, with annotations for TCGA-database-based proteins that represent a significant risk of breast cancer as the drug target. The reason protein sequences were examined rather than genes is due to the availability of 3D structural data that could shed a light on functional annotations.

Materials and Methods
Our choice of research methodology was inspired by existing pipelines that have undergone significant modifications with respect to existing indicators and parameters [37,38]. This research was conducted using a standard MacBook Pro Laptop with MacOSX version 10.13.6, 512 GB of HDD, and 16 GB of RAM. The employment of a Mac-based laptop was crucial for leveraging a graphics subsystem with proven scientific computation performance [39][40][41][42]. To navigate and search for associations between genes and breast cancer, the phrase 'breast cancer' was used to search for the appropriate genes in the TCGA database (https://portal.gdc.cancer.gov/). After identifying the names of the genes, the associated protein sequences were downloaded from the UNIPROT database (https://www.uniprot.org/). All available sequence retrieval procedures were utilized to obtain a consensus regarding the sequences. The evolutionary history was inferred using the maximum likelihood (ML) method based on the Jones-Taylor-Thornton (JTT) matrix-based model in the MEGA7 package, along with its default parameters. The ML method was chosen for its ability to iterate many different evolutionary models to improve its reliability as a general statistics model. Moreover, the ML method has been proven to be the most accurate phylogenetic method for estimating branch length and other parameters [64,65]. The tree with the highest log likelihood (1709.1737) was also identified. The initial tree(s) for the heuristic search were obtained automatically by applying the neighbor-join and BioNJ algorithms to a matrix of pairwise distances estimated using the JTT model and then selecting the topology with the highest log-likelihood value. The tree is drawn to scale, with branch lengths measured based on the number of substitutions per site. The analysis involved five amino-acid sequences. Positions with less than 95% site coverage were eliminated. After the construction of the tree, SUPERFAMILY and STRING database searches were performed to annotate the protein domains and their respective interactions. For the SUPERFAMILY database search (http://supfam.org/), the HMMSCAN significant Evalue hit was 0.03, whereas the reported E-value hit was 1, which were leveraged as default values. Then, as part of the SUPERFAMILY database project, annotations of disordered proteins were searched using the D2P2 database (http://d2p2.pro/), which only provides hits with 100% identity. Thus, the minimum required interaction score of the STRING database (https://string-db.org/) is a medium confidence rating (0.400, the default parameter) [41][42][43][44][45][46][47][48][49]. The downloaded Protein Data Bank (PDB) files linked from the PFAM database (http://pfam.xfam.org/) were visualized using Chimera version 1.13.1 [54].

Results
The results are divided into five subsections with their respective calculation times for each cycle in brackets, i.e., protein sequence retrieval (10 minutes), protein phylogeny (15 minutes), SUPERFAMILY domain annotations (5 minutes), PFAM 3D annotations (5 minutes), STRING annotations (5 minutes), and the disordered protein annotations (5 minutes), the last four of which were obtained using machine-learning tools.
Protein sequence retrieval. The search for entries regarding genes associated with breast cancer in the TCGA database yielded the identification of five genes with supporting references that are associated with a significant risk of breast cancer and serve as drug targets, namely BRMS1, BRCA2, BCAR1, BCAR3, and BCAR4 [55][56][57][58]. As the TCGA mainly annotates genes with complete population data, this does not mean that other genes have no association with breast cancer. They are simply not annotated because there is insufficient population data for establishing a strong association with the progression of breast cancer. To retrieve the protein sequences, the gene names were simply queued into a search box in the UNIPROT database ( Table 1). As all of the proteins are associated with the phrase "breast cancer" due to their importance in the progression of this disease and their extensive annotations in the TCGA database, it is interesting to examine whether or not these proteins have a common ancestor.
Protein phylogeny. Based on the UNIPROT database, the respective protein function can be obtained by accessing its annotations. The BRMS1 protein functions as a translational repressor that regulates the antiapoptotic gene and inhibits the metastasic stage of cancer. The function of the BRCA2 protein is to regulate homologous recombination, avoid genomic instability, and maintain the integrity of DNA repair [59]. The function of the BCAR1 protein is to perform docking for cell adhesion and migration, as well as to mediate anti-estrogen resistance [60]. The function of the BCAR3 protein is to regulate the signaling pathway during the proliferation of breast cancer. Lastly, the BCAR4 protein functions as an oncoprotein that induces tamoxifen resistance to breast cancer. The aberration of these genes has a significant influence on the progression of breast cancer. To shed light on the clustering of these proteins, we generated a phylogenetic tree by performing a MEGA7 phylogenetic tree computation of the proteins, as shown in Figure 1.  The results of our phylogeny analysis indicated that the BRCA2 protein was in the same cluster protein, although it was evident that the BCAR1 protein is considered to be a distinct and unique cluster relatively unrelated to other proteins. The phylogeny analysis revealed an interesting domain cluster consisting BRCA2 and BCAR4, both of which are subjects of continuous drug development for breast cancer. It could be true that these proteins were clustered due to their extensive annotations in molecular research, although a common molecular evolutionary history may also play a part. It is also feasible that the development of diagnostics and therapeutic agent involving these two protein domains are aligned to some extent.

SUPERFAMILY domain annotation.
distribution of protein domains, we utilized the SUPERFAMILY database. Figure 2 shows the domain distribution of the five annotated breast proteins.
The BRCA2 protein was found to have two annotated domains (Figure 2a c. d.
e. The results of our phylogeny analysis indicated that the s in the same cluster as the BCAR4 protein, although it was evident that the BCAR1 protein is considered to be a distinct and unique cluster other proteins. The phylogeny analysis revealed an interesting domain cluster CA2 and BCAR4, both of which are subjects of continuous drug development for breast cancer. It could be true that these proteins were clustered due to their extensive annotations in molecular research, although a common molecular evolutionary part. It is also feasible that the development of diagnostics and therapeutic agents involving these two protein domains are aligned to some . To annotate the distribution of protein domains, we utilized the SUPERFAMILY database. Figure 2 shows the domain distribution of the five annotated breast-cancer-related The BRCA2 protein was found to have two annotated domains (Figure 2a In the figure, we can see that there are no annotated domains for either the BCAR4 or BRMS1 protein, which are shown as straight lines in the protein model (Figures 2d and 2e) and have E This means that the query results are template in the database, which indicates that the designated protein domain has a highly probability of existing. The descriptions in the SCOP database regarding the function of these protein domains are limited. Therefore, these outpu the PFAM database, which provides descriptions. Moreover, the BCAR4 and BRMS1 proteins merit more attention to address the lack of domain annotations. To provide structural and functional annotations for a protein doma structural data in the protein domain must be accessed from both the PFAM and RCSB databases.

PFAM 3D annotation.
The PFAM database provides access to 3D-visualized annotations of the protein domain, whereas the SUPERFAMILY with the SCOP database lacks this feature. The PDB file visualization was obtained from the RCSB website, which is directly linked with the PFAM database.
Homologous recombination repair, an important process for cancer avoidance, is the main role of the BRCA2 protein in humans. Both the under of the BRCA2 protein are found in sporadic breast cancer cases. The main feature of the BRCA2 helical domain, which is the backbone of the BRCA2 protein, is its helical and beta-hairpin structure (Figure 3). value threshold for hit significance is a value 4.
In the figure, we can see that there are no annotated domains for either the BCAR4 or BRMS1 protein, which are shown as straight lines in the protein model have E-values close to zero (0). This means that the query results are very similar to the template in the database, which indicates that the designated protein domain has a highly probability of descriptions in the SCOP database regarding the function of these protein domains are limited. Therefore, these outputs were extrapolated to the PFAM database, which provides more detailed descriptions. Moreover, the BCAR4 and BRMS1 proteins merit more attention to address the lack of o provide structural and functional annotations for a protein domain, 3D structural data in the protein domain must be accessed PFAM and RCSB databases.
The PFAM database provides visualized annotations of the protein domain, whereas the SUPERFAMILY with the SCOP acks this feature. The PDB file-based visualization was obtained from the RCSB website, which is directly linked with the PFAM database.
Homologous recombination repair, an important process for cancer avoidance, is the main role of the BRCA2 . Both the under-and overexpression of the BRCA2 protein are found in sporadic breastcancer cases. The main feature of the BRCA2 helical domain, which is the backbone of the BRCA2 protein, hairpin structure (Figure 3). The BCAR3 protein comprises two domains, SH2 and RAS GEF. The SH2 domain mainly regulat transduction and expression of oncoproteins, with assistance from tyrosine kinases (Figure 4a). The RAS GEF domain is a smart molecular switch that catalyzes the hydrolytic reaction from guanosine triphosphate (GTP) to guanosine diphosphate (GDP), which ensures the balance of these biochemical species in the cell for the activation of the GTPase enzyme ( Figure 4b).
The interaction between adaptor proteins and tyrosine kinases is the primary feature of the SH3 domain, while also acting as a mediator in the assembly of protein complexes ( Figure 5).
In the PDB, the BRMS1 protein does not have a complete structure, with only fragments of the N terminal region provided (http://www.rcsb.org/pdb/ results/results.do?tab toshow=Current&qrid=6E9 The BCAR4 protein also lacks a complete structure and even any fragments in the PDB database. However, the UNIPROT database provides annotations of homologous information about these protein domains, with slightly more detail at the domain level step was to identify protein-protein interactions to better understand the protein features.

STRING database annotations for protein interaction.
Protein-protein interactions (PPIs) are annotated in the STRING database. Figure 6 shows t PPIs for the BRCA2, BCAR1, BCAR3, and BRMS1 proteins. The interaction intensity is expressed as a  (Figure 4a). The RAS GEF domain is a smart molecular switch that catalyzes the hydrolytic reaction from guanosine triphosphate (GTP) to guanosine diphosphate (GDP), which ensures balance of these biochemical species in the cell for e activation of the GTPase enzyme (Figure 4b).
The interaction between adaptor proteins and tyrosine kinases is the primary feature of the SH3 domain, while also acting as a mediator in the assembly of protein protein does not have a complete structure, with only fragments of the Nhttp://www.rcsb.org/pdb/ toshow=Current&qrid=6E9CAA53). The BCAR4 protein also lacks a complete structure and even any fragments in the PDB database. However, the UNIPROT database provides annotations of homologous information about these protein domains, with slightly more detail at the domain level. The next protein interactions to STRING database annotations for protein-protein protein interactions (PPIs) are annotated in the STRING database. Figure 6 shows the PPIs for the BRCA2, BCAR1, BCAR3, and BRMS1 proteins. The interaction intensity is expressed as a score with a maximum of 1, which indicates the most intense interaction. In Figure 6a, we can see that the protein that interacts with BRCA2 is PALB2, with interaction score of 0.99. PALB2 is as protein for BRCA2 that assists with homologous repair [61]. In Figure 6b, we can see that the protein that interacts with BCAR1 is PXN, with an interaction score of 0.999. The function of PXN is to assist with the membrane attachment of the cytoskeleton protein. Figure 6c shows that intense interaction between the BCAR3 and BCAR1 proteins, with interaction score of 0.983. Figure 6d shows interaction between BRMS1 and ARID4A, with score of 0.996. The function of ARID4A is to support interaction with the retinoblastoma protein. There are no annotations for the BCAR4 protein in the STRING database. As such, we could not validate the occurrence (a) June 2020  Vol. 24  No. 2 maximum of 1, which indicates the most intense interaction. In Figure 6a, we can see that the protein that interacts with BRCA2 is PALB2, with an interaction score of 0.99. PALB2 is as an auxiliary protein for BRCA2 that assists with homologous repair . In Figure 6b, we can see that the protein that is PXN, with an interaction score of 0.999. The function of PXN is to assist with the membrane attachment of the cytoskeleton protein. Figure 6c shows that intense interaction has occurred between the BCAR3 and BCAR1 proteins, with an igure 6d shows interaction between BRMS1 and ARID4A, with an interaction score of 0.996. The function of ARID4A is to support interaction with the retinoblastoma protein. There are no annotations for the BCAR4 protein in the STRING ch, we could not validate the occurrence of PPI between BCAR4 and BRCA2. However, interaction may occur as in the phylogeny study they were found to share the same protein cluster. Subtle annotations could be the result of a disordered protein feature, which is described in the next section. We note that the proteins that interact with BCAR4, BRCA2, BCAR1, and BRMS1 may have potential for being a c

Figure 6. Protein-protein Interactions
June between BCAR4 and BRCA2. However, interaction may occur as in the phylogeny study they were found to share the same protein cluster. Subtle annotations could be the result of a disordered protein feature, which is described in the next section. We note at the proteins that interact with BCAR4, BRCA2, BCAR1, and BRMS1 may have potential for being leveraged for further study toward a rational drug design because the latest biomedical research indicates that they play important roles in the progression of b cancer [57,[62][63][64][65][66][67]. In future work, consideration should be given to whether genes are up breast cancers. An explanation of the predictors is published protocol [68]. Moreover, the top of shows three available isoforms for the BRMS1 protein. BRMS1 is considered to be a disordered protein with no annotated SUPERFAMILY domain, although it shows hits in the PFAM domain, as indicated by the red block in Figure 7. There is at least a 75% agreement among different predictors, as shown by the green block with the label 'Predicted Disorder Agreement'. There are also five post-translational modifications of the protein domain, as indicated by the 'PTM sites' label. These results confirm that the BRMS1 protein is indeed disordered. Although the structural disorder BCAR4 and BRMS1 proteins make their domain information unavailable, we note that the 3 of the proteins are intact and can still be used to facilitate drug design. However, the missing domain  Figure 7 for the BRMS1 protein. BRMS1 is considered to be a disordered protein with no annotated SUPERFAMILY domain, although it shows PFAM domain, as indicated by the red block 75% agreement among as shown by the green block with 'Predicted Disorder Agreement'. There are translational modifications of the protein domain, as indicated by the 'PTM sites' label. These results confirm that the BRMS1 protein is indeed . Although the structural disorders of the BCAR4 and BRMS1 proteins make their domain information unavailable, we note that the 3-D structures of the proteins are intact and can still be used to facilitate drug design. However, the missing domain information could hamper our chemical bonds and our ability to obtain resolution of the 3-D structure to identify what detracts from the overall stability of the protein possible solution to this probl the next update to the structural disorde SUPERFAMILY database.

Discussion
Due to the incompleteness of the domain annotations, some possible strategies must be devised to address this problem. A gene prediction package c leverage the existence of protein domains the function and the structure of these genes. Homology modeling and ab-initio methods could be used to predict the structures of the BCAR4 and BRMS1 proteins, which are feasible within the boundaries of available computational power. In our computational study, we did not leverage any of the predicted data to determine their degree of alignment with information generated by wet laboratory research groups. As the indicators of protein domain annotations, structural annotation, PPI, and disordered proteins can shed and BRMS1. More research is needed in this area. Moreover, using structural bioinformatics tools, the feasibility of using the PPI partners of the annota breast-cancer-related proteins as drug candidate targets could be considered. Further research should be conducted on proteins with unannotated domains as cancer-related Proteins 107 June 2020  Vol. 24  No. 2 http://d2p2.pro) on could hamper our understanding of the our ability to obtain a more-detailed D structure to identify what detracts from the overall stability of the protein [69]. One this problem would be to wait for the next update to the structural disorders in the incompleteness of the domain annotations, some possible strategies must be devised to address this gene prediction package could be utilized to leverage the existence of protein domains that provide the function and the structure of these genes. Homology initio methods could be used to predict the structures of the BCAR4 and BRMS1 proteins, thin the boundaries of available computational power. In our computational study, we did not leverage any of the predicted data to determine their degree of alignment with information generated by wet laboratory research groups. As the indicators of n domain annotations, structural annotation, PPI, and disordered proteins can shed light on both BCAR4 ore research is needed in this area. Moreover, using structural bioinformatics tools, the feasibility of using the PPI partners of the annotated related proteins as drug candidate targets could be considered. Further research should be conducted on proteins with unannotated domains as possible signatures of a disordered protein. For proteins without a clearly defined tertiary structure, the domain repertoire should be annotated using a different approach. This effort will be much more effective when the SUPERFAMILY database has successfully integrated its D2P2 disordered protein database into one integrated platform. On this platform, strategies could be developed to establish a disease ontology of the protein domain. To this end, the relation between protein domains, genes, and disease annotation could be elucidated in finer detail.
Phylogenetic tree and PPI studies have shown that the current molecular simulation protocol is insufficient for effective drug design as proteome data has been used without consideration of the networking context of the interacting proteins. In the current approach, structural bioinformatics studies are conducted with very limited knowledge of PPIs or gene networks. To some extent, solid knowledge about the protein domain repertoire is subtly handled in structural annotations. Only indicators that provide fine-grained annotations can be secured, mainly from sequence retrieval and phylogenetic tree studies. Annotations from other indicators should be strengthened. To do so, advanced machine-learningbased tools or even deep-learning tools could be used to facilitate completion of these annotations. Future structural bioinformatics studies should incorporate a more holistic approach to increase the success rate of drug designs. The complexity of proteomics studies remains daunting as protein domain rearrangement or co-occurrence must be incorporated as the basis of biomedical informatics studies and no method has yet been developed to incorporate these data into the structural bioinformatics approach.

Conclusion
Based on the results of this phylogenetic study, the BCAR4 and BRCA2 proteins can be inferred to be in the same cluster. Furthermore, the BCAR4 protein was found to have no annotated domain in the SUPERFAMILY, STRING, PDB, or D2P2 databases. As such, to develop a blueprint for drug design, determination of the correlation and interaction between the BCAR4 and BRCA2 proteins will require further in-depth annotations. Moreover, the absence of SUPERFAMILY annotations for the BRMS1 protein indicates that it may be in a disordered state. However, STRING studies offer hope regarding the availability of other proteins that could serve as targets for drug design. This drug design effort would be made more effective by incorporating the results of proteindomain rearrangement and co-occurrence studies using more advanced machine-learning-based annotation tools such as the gene prediction pipeline.