UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Augmenting metadata tags in open data tables using schema matching in a pay-as-you-go fashion Yu, Haoran

Abstract

Metadata helps users understand the data contents in a table. Metadata tags can describe the contents in the table and allow a user to easily browse, search, and filter data. However, metadata is less useful when there is heterogeneity and incompleteness in a table. It's difficult to find all related tables to the given table by only examining the tags, because the user is typically looking for overlap of tags between two or more tables and there are no such overlaps in the heterogeneous metadata. We use Open Data tables in a case study and develop strategies to augment the tags in table metadata to increase the number of the tag overlaps among metadata of different tables. As an initialization step, we perform semantic enrichment of words in attributes of table schema and in tags, and perform schema matching between attributes and tags of a table to create semantic labeling, where an attribute is labeled with zero or more tags. We provide one base table, and search for tables using the semantic labeling we created to quickly find related tables. We integrate the table searching step and a schema matching step into an iterative framework, which incrementally add additional tags to a table's metadata for all the tables related to the base table. The additional tags added to the metadata are discovered by semantics overlap during the schema matching step in the iterative framework, based on a composite score with evidence from multiple pairwise value comparison criteria. We evaluate two approaches using a gold standard we created, and compare the accuracy of the augmented tags and the runtime with the two baseline approaches. We show that the case of augmented tags has relatively high accuracy and the runtime of our iterative approach is reasonable. We argue that an approach that creates approximate matching in a pay-as-you-go fashion has good precision and recall, and is the more realistic option in a real-world scenario.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International