Model Organism Databases are more than just repositories
Posted by Mansi Srivastava, on 18 September 2024
This post is co-authored by Beatrice L. Milnes and Mansi Srivastava, who participated in SciCommConnect organised by the Node, preLights and FocalPlane.
A frequent source of lightheaded acknowledgment at the end of many talks is often not just human collaborators, but a thank you to the model organisms themselves, which are of phenomenal importance for scientific research. To many who work with them, Model Organism Databases (MODs) are indispensable, so much so that imagining one day in a lab without access to them is horrifying. MODs curate information generated from years of experimental research, providing a scaffold for sharing data in a consistent and structured way, thereby enhancing our understanding of both the organisms and the work they are being used for. With modern technologies like nucleic acid sequencing, gene editing, and structural protein biology, the quantum of data is increasing exponentially. MODs categorise information, annotate the quirks of gene naming, and compile references that capture the growth curve of the field. Databases assemble the available informational pieces of the jigsaw puzzle and at the same time highlight the gaps in our understanding — they often serve as handbooks for researchers to delve into new aspects of their system. As biology utilises increasingly ‘omic’ data, integrating big data across species is more important than ever. Modern biology has placed a large emphasis on addressing fundamental questions and assessing the translatability of their findings to human welfare. This often requires relating data not only across scientific disciplines but also between the various model systems employed. Groups like the Alliance of Genome Resources have already begun some of this work by building a cross-species consortium. The utility of integrating data across species has resulted in a call for awareness among scientists to assist in this process by including information in publications that help with indexing in MODs.
What does it take to build a model organism database?
As active experimentalists, we rarely get a peek into the process of database consolidation despite near-daily usage of the end product. Many scientists do not interface with the management that puts great effort into intentionally curating these MODs. It is important to acknowledge that the level of data organisation and layering that databases achieve is in itself as important as the experimentation they inform. The web design contributes greatly to the utility of the sites and is responsible for visualising the available tools, which differ notably between species. MODs also facilitate the exchange of information between two researchers from different corners of the globe without the need for direct communication. In this regard, databases are directly responsible for the accelerated pace of research in the last few decades. Experimentalists, as the primary consumers and beneficiaries of MODs, should be aware and supportive of the resources needed to build the databases they rely so heavily on.
Managing and integrating scientific data has become an essential skill that needs to be inculcated in the next generation of researchers. To this end, consortia like the National Institute of Health (NIH), European Molecular Biology Laboratory (EMBL), and DNA Databank of Japan (DDBJ) can be used as benchmark examples of highly structured databases. Whether a new database is created by a lab, within an institute, in a community space, or available through open access, it should have core structural components that accommodate its growth in the future. Reporting detailed data accurately and including experimental commonalities that could benefit the larger community (eg. plasmid names or standardised phenotyping) should lead to a generation of a database seamlessly. Much of this data could already be published and locally available through alternative resources but many larger trends can only be validated once multiple sources have been compiled by curators. Perhaps one of the largest barriers to this work is that as end-users, we do not know much about the generation of MODs.
Funding support for MODs
Generating large-scale databases is a skill and time-intensive task. For example, the National Center for Biotechnology Information – Sequence Read Archive (NCBI) is curating a central repository for genome/ transcriptome sequencing datasets across model organisms. Unfortunately, such endeavours have limited funding and grant opportunities. For example, the NCBI is supported by the US government alone despite MODs being well utilised as a global resource. Several MODs that serve a smaller research community, for example, Axobase, do not have long-term sustenance funding. Model organisms that are evolutionarily closer to humans such as rodents, their disease-causing pathogens, and food crops are heavily researched. On the other hand, some model organisms are used by a smaller research community aiming to crack open a new area of biology. Intuitively, the nature and quantum of information available for each model organism is very diverse. Databases are crucial for both novices entering a field or experts looking for the next big question. For upcoming MODs, the lack of seed funding limits their conception in the first place. Securing the future of MODs must be prioritised by the scientific community, perhaps through diversification of funding sources.
What’s the future for MODs?
There is an open question of what measures need to be taken to ensure the sustenance and growth of all common and upcoming MODs. As more emerging model systems come into common use, new areas of comparative biology can be catapulted by the presence of a MOD. Perhaps we can support the generation of databases for new species by building off an existing MOD template to generate another, for example, Echinobase has been built using Xenbase as a reference. The first and foremost step would be to have ample funding opportunities for data curation. The funding bodies of existing MODs can begin by allocating annual funds for database development in addition to maintenance of current databases. To ensure revenue generation, the users of databases may be charged a nominal amount for storage and curation of their data. The databases of model organisms that are of national importance should be supported by multiple government schemes. To ensure its sustainability, database building can become an integral component of scientific training and can be developed as a career opportunity. More and more early career researchers investing their creativity and skill into science communication platforms such as MODs would ensure their growth. In the present scenario, scientific information is a commodity of high value, and its true potential can be harnessed only after compilation into these databases.
The progress of research is directly correlated to the development of databases. MODs are not just scientific repositories; they are now a direct measure of our scientific prowess. On a longer timescale, they are also the knowledge that we will pass on to the next generation of researchers. The responsibility of supporting MODs lies on the scientific community as a whole. All of us as stakeholders should prioritise, strategise, and contribute towards both old and new MODs.