In a world where data drive effective decision-making, bioinformatics and health science researchers often encounter difficulties managing data efficiently. In these fields, data are typically diverse in format and subject. Consequently, challenges in storing, tracking, and responsibly sharing valuable data have become increasingly evident over the past decades. To address the complexities, some approaches have leveraged standard strategies, such as using non-relational databases and data warehouses. However, these approaches often fall short in providing the flexibility and scalability required for complex projects. While the data lake paradigm has emerged to offer flexibility and handle large volumes of diverse data, it lacks robust data governance and organization. The data lakehouse is a new paradigm that combines the flexibility of a data lake with the governance of a data warehouse, offering a promising solution for managing heterogeneous data in bioinformatics. However, the lakehouse model remains unexplored in bioinformatics, with limited discussion in the current literature. In this study, we review strategies and tools for developing a data lakehouse infrastructure tailored to bioinformatics research. We summarize key concepts and assess available open-source and commercial solutions for managing data in bioinformatics.
The enzymatic hydrolysis of inulin, a fructose-rich polysaccharide from plants like Agave spp., is crucial for bioethanol production. Fungal glycoside hydrolase family 32 (GH32) enzymes, especially inulinases, are central to this process, yet no dedicated database existed. To fill this gap, we developed FUNIN, a cloud-based, non-relational database cataloging and analyzing fungal GH32 enzymes relevant to inulin hydrolysis. Built with MongoDB and hosted on AWS, FUNIN integrates enzyme sequences, taxonomic data, physicochemical properties, and annotations from UniProt and InterPro via an automated ELT pipeline. Tools like CLEAN and ProtParam were used for EC number prediction and sequence characterization. The database currently includes 3420 GH32 enzymes, with strong representation from Ascomycota (91.2 %) and key genera such as Fusarium, Aspergillus, and Penicillium. Exo-inulinases (43.9 %), endo-inulinases (33.4 %), and invertases (21.6 %) dominate the dataset. These enzymes share conserved domains (PF00251-PF08244), acidic pI values, and moderate hydrophobicity. A network similarity analysis revealed structural conservation among exo-inulinases. FUNIN includes an automated monthly update via InterPro API, ensuring current data. Publicly accessible at http://funindb.lbqc.org, FUNIN enables rapid data retrieval and supports the development of optimized enzyme cocktails for Agave-based bioethanol production.
The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-based methods, such as BLAST, are increasingly overwhelmed by the scale of contemporary datasets due to their high computational demands for classification. This study evaluates alignment-free (AF) methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.
During the coronavirus disease 2019 (COVID-19) pandemic, the number and types of dashboards produced increased to convey complex information using digestible visualizations. The pandemic saw a notable increase in genomic surveillance data, which genomic epidemiology dashboards presented in an easily interpretable manner. These dashboards have the potential to increase the transparency between the scientists producing pathogen genomic data and policymakers, public health stakeholders, and the public. This scoping review discusses the data presented, functional and visual features, and the computational architecture of six publicly available SARS-CoV-2 genomic epidemiology dashboards. We found three main types of genomic epidemiology dashboards: phylogenetic, genomic surveillance, and mutational. We found that data were sourced from different databases, such as GISAID, GenBank, and specific country databases, and these dashboards were produced for specific geographic locations. The key performance indicators and visualization used were specific to the type of genomic epidemiology dashboard. The computational architecture of the dashboards was created according to the needs of the end user. The genomic surveillance of pathogens is set to become a more common tool used to track ongoing and future outbreaks, and genomic epidemiology dashboards are powerful and adaptable resources that can be used in the public health response.
The SARS-CoV-2 Africa dashboard is an interactive tool that enables visualization of SARS-CoV-2 genomic information in African countries. The customizable app allows users to visualize the number of sequences deposited in each country, and the variants circulating over time. Our dashboard enables near real-time exploration of public data that can inform policymakers, healthcare professionals and the public about the ongoing pandemic.
Proteins are intricate, dynamic structures, and small changes in their amino acid sequences can lead to large effects on their folding, stability and dynamics. To facilitate the further development and evaluation of methods to predict these changes, we have developed ThermoMutDB, a manually curated database containing >14,669 experimental data of thermodynamic parameters for wild type and mutant proteins.
This represents an increase of 83% in unique mutations over previous databases and includes thermodynamic information on 204 new proteins. During manual curation we have also corrected annotation errors in previously curated entries. Associated with each entry, we have included information on the unfolding Gibbs free energy and melting temperature change, and have associated entries with available experimental structural information. ThermoMutDB supports users to contribute to new data points and programmatic access to the database via a RESTful API.
PathoTrack unites scientists to utilize data on various diseases for genomic surveillance and pandemic prevention. We develop dashboards offering crucial information for decision-makers, enhancing disease surveillance and response. Our mission is to promote global health security through advanced data analytics.
joicymara@ita.br
joicy@sun.ac.za
Best viewed using Chrome on 1280×960 resolution and above.