Discovery of research dataEugene Barskyeugene.barsky@ubc.cahttp://researchdata.library.ubc.ca/ Fall 2018 Image - https://www.flickr.com/photos/kenfagerdotcom/ Outline● Background: ○ Definitions○ Tri-Agencies directions in RDM● How to make research data findable and discoverable○ Principles and best practice● We do it too:○ Abacus Dataverse○ Federated Research Data Repository (FRDR)2Image by http://epicgraphic.com/metaphors/Data richSoccer clubs, like Arsenal, record on average 10 data points per second for every player on the field, or about 1.4 million data points per game.Image - https://www.flickr.com/photos/kevlar/ Source - https://www.forbes.com/sites/bernardmarr/2015/03/25/big-data-the-winning-formula-in-sports/#2a9791e234de 3Define research data“Data that are used as primary sources to support technical or scientific enquiry, research, scholarship, or artistic activity, and that are used as evidence in the research process and/or are commonly accepted in the research community as necessary to validate research findings and results. “Source - CASRAI Glossary - http://dictionary.casrai.org/Research_data * Image - https://www.flickr.com/photos/34547181@N00/ 4Why data management● In the USA* From Developing data services: a tale from two Oregon universities - http://www.slideshare.net/amandawhitmire/20140618-rml-rendezvousfinal 56Timeline● Tri-Council introduced Draft RDM policy in June 2018 - http://www.science.gc.ca/eic/site/063.nsf/eng/h_97610.html ● Public consultation for a period of two-three months. ● Six months after the policy has been publically available, institutions will be expected to enact RDM policies.● Realistic timeline - Fall 2019 for compliance.* Image - https://www.flickr.com/photos/pamilne/ 78“All grant proposals submitted to the agencies should include methodologies that reflect best practices in research data management. The agencies encourage grant applicants to complete data management plans (DMPs) as an essential step in research project design. For specific funding opportunities, the agencies may require DMPs to be submitted to the appropriate agency at time of application; in these cases, the DMPs may be considered in the adjudication process.”91. Data Management Plans10“Grant recipients are required to deposit into a recognized digital repository all digital research data, metadata and code that directly support the research conclusions in journal publications, pre-prints, and other research outputs that arise from agency-supported research. The repository will ensure safe storage, preservation, and curation of the data. The agencies encourage researchers to provide access to the data where ethical, legal, and commercial requirements allow, and in accordance with the standards of their disciplines”2. Data Repositories and Discovery113. Institutional Data Policy“Each institution administering Tri-Agency funds is required to create an institutional research data management strategy. The strategy will outline how the institution will provide its researchers with an environment that enables and supports world-class research data management practices”12Focus on Data Deposit for DiscoverySet of Principles: ❏ Common metadata❏ Persistent identification❏ Open access❏ Common licensing❏ Collaboration (coexistence in the scholarly ecosystem)White papers released in 2016/17:● Fenner, M., Crosas, M., Grethe, J., Kennedy, D., Hermjakob, H., Rocca-Serra, P., ... & Clark, T. (2017). A data citation roadmap for scholarly data repositories. bioRxiv, 097196.● Barsky, E., Brosz, J., & Leahey, A. (2016, July 31). Research Data Discovery and the Scholarly Ecosystem in Canada : A White Paper. doi:http://dx.doi.org/10.14288/1.0307548● Leggott, Mark, Shearer, Kathleen, Ridsdale, Chantel, Barsky, Eugene, & Baker, David. (2016, September 9). Unique Identifiers: Current Landscape and Future Trends. Zenodo. http://doi.org/10.5281/zenodo.55710613Practical principles for Discovery: Metadata● Use common and established metadata schemas - Dublin Core, DDI, Datacite...○ For instance, Google new Data Search - https://toolbox.google.com/datasetsearch is using Schema.org metadata standard● Dataset landing page -○ Metadata need to be embedded into the dataset landing page so that the indexers/harvesters can find them○ Google and Google Data Search also ask to provide a sitemaps file with the URLs of all dataset landing pages (https://www.sitemaps.org/) - machine-readable metadata ● Search engine is only as good as the metadata that goes into it!!Image - https://www.flickr.com/photos/wakingtiger14Practical principles for Discovery: Persistent Identifiers● All datasets intended for discovery should have a globally unique persistent identifier that can be expressed as unambiguous URL (e.g. DOI, ARK or Handle)● PIDs should be embedded in the landing page in machine-readable format● This persistent identifier expressed as URL must resolve to a landing page specific for that dataset● Persistent identifiers for datasets should support multiple levels of granularity, where appropriate (e.g. DOIs for individual files in a study dataset)Illustration by Jørgen Stamp CC BY 2.5 Denmark15Practical principles for Discovery: Open Access and APIs● A repository should provide an API or at least work with OAI-PMH protocol● OAI-PMH protocol provides consistent, structured, and interoperable formats for metadata exchange● Caveat: Harvesting metadata doesn’t address issues or concerns about metadata quality, completeness, or a common metadata across repository systemsImage - https://www.flickr.com/photos/centralasian/16Practical principles for Discovery: Licensing● We believe that nobody yet has solved all the complexities of making data openly available and reusable● We prefer applying CC-0 license to open data (same as Dryad, Biomed Central, Europeana, and others). See more:○ Einhorn, David, et al. "Post-Publication Sharing of Data and Tools." Nature, vol. 461, no. 7261, 2009, pp. 171-173.Image - https://www.flickr.com/photos/jwyg/ Data Repositories and Discovery - FRDR● We have worked to create the national research data discovery layer with Federated Research Data Repository (FRDR) - national discovery layer for research data - https://www.frdr.ca/ ● Data Sources - https://www.frdr.ca/discover/html/repository-list.html17Data Repositories and Discovery - FRDRFRDR’s harvester indexes data repositories across Canada to make research data held in many repositories discoverable from a single platformCurrently supports OAI-PMH, CKAN, CSW, Marklogic standards with plans to add moreGoals: ● supplement existing repository sites ● improve discovery● breakdown repository siloing● avoid being “just another repository”18Dataverse Repositories- UBC Abacus Dataverse - http://dvn.library.ubc.ca/dvn/ - has more than 38,000 data files under management- We mint DOIs and expose research metadata into:- Summon- Google- Bing- Datacite- Google Scholar and more...19Dataverse Repositories- Work is underway to create one Dataverse for Canada - Dataverse North, a collaborative platform based in Scholars Portal (UofT)- Digital preservation of research data is to come next!20* Image - https://www.flickr.com/photos/wwworks/Questions? Image - https://www.flickr.com/photos/debord/