Featured Publications

For questions on any specific publication, feel free to email me.

Data Size and Quality Matter: Generating Physically-Realistic Distance Maps of Protein Tertiary Structures (Featured as Title Story for 60 days, Editor’s Choice Article 2022)  [Cite

Fardina Fathmiul Alam and Amarda Shehu

Type:  Journal article

Publication: Biomolecules 2022, 12, 908.    

https://doi.org/10.3390/biom12070908                                        

Abstract: With the debut of AlphaFold2, we now can get a highly-accurate view of a reasonable equilibrium tertiary structure of a protein molecule. Yet, a single-structure view is insufficient and does not account for the high structural plasticity of protein molecules. Obtaining a multi-structure view of a protein molecule continues to be an outstanding challenge in computational structural biology. In tandem with methods formulated under the umbrella of stochastic optimization, we are now seeing rapid advances in the capabilities of methods based on deep learning. In recent work, we advance the capability of these models to learn from experimentally-available tertiary structures of protein molecules of varying lengths. In this work, we elucidate the important role of the composition of the training dataset on the neural network’s ability to learn key local and distal patterns in tertiary structures. To make such patterns visible to the network, we utilize a contact map-based representation of protein tertiary structure. We show interesting relationships between data size, quality, and composition on the ability of latent variable models to learn key patterns of tertiary structure. In addition, we present a disentangled latent variable model which improves upon the state-of-the-art variable autoencoder-based model in key, physically-realistic structural patterns. We believe this work opens up further avenues of research on deep learning-based models for computing multi-structure views of protein molecules

Generating Physically-Realistic Tertiary Protein Structures with Deep Latent Variable Models Learning Over Experimentally-available Structures  [Cite]

Fardina Fathmiul Alam and Amarda Shehu


Type:  Conference article

Published in: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)- CSBW Workshop.

Abstract: Sophisticated deep neural networks have significantly advanced our ability to predict a native structure of a protein amino-acid sequence. However, going beyond a single-structure view remains challenging. While rapid advances are being made, fundamental questions on the ability of generative deep modeling to learn to generate physically-realistic tertiary structures remain. This paper makes two key contributions. It first extends deep convolutional variable autoencoder networks to be able to learn from experimentally-available tertiary structures of proteins of variable lengths. The presented models learn over distance matrix representations of tertiary structures. A systematic and detailed analysis demonstrates that the design of the training data is of primary importance to the ability of the proposed models to learn key characteristics of tertiary structures. The second contribution this paper makes is a careful analysis along several metrics that measure the physical realism of generated tertiary structures. The presented results are promising and show that once seeded with sufficient, physically-realistic structures, variational autoencoders are efficient models for generating physically-realistic tertiary structures.

CSBW2021-SPP-CVAE-Presentation.pptx

Equivariant Encoding based GVAE (EqEn-GVAE) for Protein Tertiary Structure Generation [Cite]

T.Rahman*, Fardina Fathmiul Alam*, and Amarda Shehu

(*: Equal Contributions)

Type:  Conference- Workshop article

Published in: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)- CSBW Workshop.

Abstract: Extensive research on deep neural networks shows that complex deep learning models have considerably improved our ability to predict the native structure of a protein amino acid sequence. With the release of AlphaFold2, an entirely data-driven approach to machine learning, we can now predict the native tertiary structure of a given protein sequence with great precision. On the other hand, research into deep learning frameworks that can take into account protein structural plasticity is in its early stages. Obtaining a multi-structure view of a protein molecule remains an outstanding challenge in computational structural biology. In this paper, we make two key contributions. We first propose a novel end-to-end generative model framework, a new formulation under Equivariant Graph Neural Networks (EGNN) based encoding and Graph Variational Autoencoder (GVAE), advancing our ability to generate realistic tertiary structures. Most existing models rely on 2D convolution, with protein structures represented by contact maps or distance matrices. In contrast, our presented model learns over both 3D coordinates of protein structure and sequence directly. The second contribution of this paper is control of tertiary structure realism. We show that through the loss function, we can control properties of generated tertiary structures. We suggest different terms in the loss function and analyze how those terms allow us to recover realistic patterns, such as backbone, short-range, and long-range contacts that we find in tertiary structures. Additionally, we conduct a careful analysis along several metrics that measure the physical realism of generated tertiary structures and show that EGNN encoding-based GVAE are effective models for generating physically-realistic structures.

Variational Autoencoders for Protein Structure Prediction [Cite]

Fardina Fathmiul Alam and Amarda Shehu


Type:  Conference article

Published in: 2020 ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB).

Abstract: The universe of protein structures contains many dark regions beyond the reach of experimental techniques. Yet, knowledge of the tertiary structure(s) that a protein employs to interact with partners in the cell is critical to understanding its biological function(s) and dysfunction(s). Great progress has been made in silico by methods that generate structures as part of an optimization. Recently, generative models based on neural networks are being debuted for generating protein structures. There is typically limited to showing that some generated structures are credible. In this paper, we go beyond this objective. We design variational autoencoders and evaluate whether they can replace existing, established methods. We evaluate various architectures via rigorous metrics in comparison with the popular Rosetta framework. The presented results are promising and show that once seeded with sufficient, physically-realistic structures, variational autoencoders are efficient models for generating realistic tertiary structures.

ACM_BCB_AlamShehu_Dr.pptx

Unsupervised multi-instance learning for protein structure determinations [Cite]

Fardina Fathmiul Alam, and Amarda Shehu


Type:  Journal article

Published in: 2021 Journal of Bioinformatics and Computational Biology Vol. 19, No. 01, 2140002 (2021) Special Issue.

Abstract: Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods. A significant challenge in elucidating these dark regions in silico relates to the ability to discriminate relevant structure(s) among many structures/decoys computed for a protein of interest, a problem known as decoy selection. Clustering decoys based on geometric similarity remains popular. However, it is unclear how exactly to exploit the groups of decoys revealed via clustering to select individual structures for prediction. In this paper, we provide an intuitive formulation of the decoy selection problem as an instance of unsupervised multi-instance learning. We address the problem in three stages, first organizing given decoys of a protein molecule into bags, then identifying relevant bags, and finally drawing individual instances from these bags to offer as a prediction. We propose both non-parametric and parametric algorithms for drawing individual instances. Our evaluation utilizes two datasets, one benchmark dataset of ensembles of decoys for a varied list of protein molecules, and a dataset of de-coy ensembles for targets drawn from recent CASP competitions. A comparative analysis with state-of-the-art methods reveals that the proposed approach outperforms existing methods, thus warranting further investigation of multi-instance learning to advance our treatment of decoy selection.

Learning reduced latent representations of protein structure data [Cite]

Fardina Fathmiul Alam, T. Rahman, and Amarda Shehu


Type:  Conference- Workshop article

Published in: 2019 ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB)- CSBW Workshop.

Abstract: The protein modeling community has long been interested in dimensionality reduction of structure data. Motivated by rapid progress in neural network research, we investigate autoencoders of various architectures on reducing the dimensionality of protein structure data generated by template-free protein structure prediction methods. We show that autoencoders that model nonlinear relationships among variables outperform linear dimensionality reduction. We evaluate various architectures and propose a better-performing one. We further show that the learned, low-dimensional latent representations capture inherent information useful for structure prediction. Given the ease with which open-source neural network libraries, such as Keras, which we employ here, allow constructing, training, and evaluating neural networks, we believe that autoencoders will gain in popularity in the structure biology community and open up further avenues of research.

NLP_presentation (5).pptx