Biocuration in the Era of Big Data Zhang, Zhang


CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China With the rapid advancements in high-throughput sequencing technologies, biology enters the era of big data. Many databases developed for managing biological data are traditionally based on expert curation, viz., conducted manually by dedicated experts. However, with the burgeoning volume of biological data and increasingly diverse densely informative published literatures, expert curation becomes more and more laborious and time consuming, increasingly lagging behind knowledge creation, or worse, not being done at all in fields where insufficient funds can be allocated to curation. Although traditionally expert-curated databases have proven important for biological studies, they are struggling with the flood of knowledge and accordingly requiring a large number of people getting involved in curation, viz., community curation-exploiting the whole power of the scientific community for knowledge integration. A case in point that harnesses community intelligence in knowledge integration is Wikipedia. Wikipedia is an online encyclopedia, allows anyone to create/edit any content and features collaborative knowledge curation, up-to-date content, huge coverage, and low cost for maintenance. Despite fears that the openness of editorial capacity could lead to incorporation of significant flawed content, it is reported that Wikipedia rivals the traditional encyclopedia in accuracy. Due to the extraordinary success of Wikipedia, it has been advocated that biological databases go wiki. As a consequence, more than a dozen biological wikis (bio-wiki) have been constructed to call on community intelligence in knowledge curation. To date, however, there is no community-curated resource for rice, as rice is the most important staple food feeding a large part of the world population and building expert-curated rice reference genomes with comprehensive and accurate annotations remains a formidable challenge. Moreover, one of the major limitations in bio-wikis is insufficient participation from the scientific community, which is intrinsically because of lack of explicit authorship and thus no credit for community- curated contributions. To increase community curation in bio-wikis, we developed AuthorReward (Bioinformatics 2013), to reward community-curated efforts by contribution quantification and explicit authorship. AuthorReward quantifies researchers' contributions by properly factoring both edit quantity and quality and yields automated explicit authorship according to their quantitative contributions. Author Reward provides bio-wikis with an authorship metric, helpful to increase community participation in bio-wikis and to achieve community curation of massive biological knowledge. We also developed RiceWiki (; Nucleic Acids Research 2014), a wiki-based, publicly editable, and open-content platform for community curation of rice genes. To test the functionality of AuthorReward, we installed it in RiceWiki. A live demo is the rice semi-dwarfing gene (sd1), which was collaboratively curated by 9 researchers, providing 89 versions as of August 1, 2013. As testified in RiceWiki, AuthorReward is capable of yielding sensible quantitative contributions and providing automated explicit authorship, consistent well with perceptions of all participated contributors. Additionally, due to significant importance of rice, RiceWiki serves as a critical community-curated knowledgebase for the rice research community. Considering the growing volume of rice-related data and contrastingly the small number of expert curators working on rice, RiceWiki bears the potential to make it possible to build a rice encyclopedia by and for the scientific community, which harnesses collective intelligence for collaborative knowledge curation, covers all aspects of biological knowledge, and keeps evolving with novel knowledge.

