The prediction of peptide-protein binding sites is of utmost importance to tackle the onset of severe neurodegenerative diseases and cancer. In this work, we detail a novel machine learning model based on Linear Discriminant Analysis (LDA) demonstrating to be highly predictive in detecting the putative protein binding regions of small peptides. Starting from 439 high-quality pockets derived from peptide-protein crystallographic complexes, three sets of well-established peptide -binding regions were first selected through a Partitioning Around Medoids (PAM) clustering algorithm based on morphological and energetic 3D GRID-MIF molecular descriptors. Next, the best combination between all the putative interacting peptide pockets and related GRID-MIF scores was automatically explored by using the LDA-based protocol implemented in BioGPS. This approach proved successful to recognize the actual interacting peptide regions (that is, AUC = 0.86 and partial ROC enrichment at 5% of 0.48) from all the other pockets of the protein. Validated on two external collections sets, including 445 and 347 crystallographic peptide-protein complexes, our LDA-based model could be effective to further run peptide-protein virtual screening campaigns.
An Integrated Machine Learning Model To Spot Peptide Binding Pockets in 3D Protein Screening
Siragusa, Lydia;Cruciani, Gabriele;
2022
Abstract
The prediction of peptide-protein binding sites is of utmost importance to tackle the onset of severe neurodegenerative diseases and cancer. In this work, we detail a novel machine learning model based on Linear Discriminant Analysis (LDA) demonstrating to be highly predictive in detecting the putative protein binding regions of small peptides. Starting from 439 high-quality pockets derived from peptide-protein crystallographic complexes, three sets of well-established peptide -binding regions were first selected through a Partitioning Around Medoids (PAM) clustering algorithm based on morphological and energetic 3D GRID-MIF molecular descriptors. Next, the best combination between all the putative interacting peptide pockets and related GRID-MIF scores was automatically explored by using the LDA-based protocol implemented in BioGPS. This approach proved successful to recognize the actual interacting peptide regions (that is, AUC = 0.86 and partial ROC enrichment at 5% of 0.48) from all the other pockets of the protein. Validated on two external collections sets, including 445 and 347 crystallographic peptide-protein complexes, our LDA-based model could be effective to further run peptide-protein virtual screening campaigns.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.