Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Using semantic web technologies to implement flexible information management systems Saghafi, Arash 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2012_fall_saghafi_arash.pdf [ 925.21kB ]
Metadata
JSON: 24-1.0073030.json
JSON-LD: 24-1.0073030-ld.json
RDF/XML (Pretty): 24-1.0073030-rdf.xml
RDF/JSON: 24-1.0073030-rdf.json
Turtle: 24-1.0073030-turtle.txt
N-Triples: 24-1.0073030-rdf-ntriples.txt
Original Record: 24-1.0073030-source.json
Full Text
24-1.0073030-fulltext.txt
Citation
24-1.0073030.ris

Full Text

Using Semantic Web Technologies to Implement Flexible Information Management Systems by  Arash Saghafi BSc, Sharif University of Technology, 2008 MM, University of British Columbia, 2010  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE  in The Faculty of Graduate Studies (Business Administration)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  August 2012  ©Arash Saghafi, 2012  Abstract Two main phenomena have recently become prominent in respect to information management. First, there has been a rapid increase in the number and variety of sources of information on the World Wide Web and in the role of users as content providers. In parallel, there has been an evolution of web technologies to support the creation, storage, and sharing of information. These developments have led to new paradigms such as cloud computing, crowdsourcing, and social collaboration. The second phenomenon is the increasing need of organizations to effectively and efficiently benefit from massive data — termed “big data” (a term also linked recently to cloud computing and to social network data). These two phenomena both indicate new needs and new technological opportunities. Traditionally, information management has been based on storing data in structured databases, where the structure reflects some expected uses, and managed with (at least some) central control (even for distributed data). These assumptions, however, do not fit the new paradigms, where flexible information management is needed to support different views of multiple users, unknown future uses, no central control, and new unexpected sources. This thesis explores an approach to information management intended to provide the flexibility to support multiple, varied, and emerging sources where uses of information may not be known in advance. The approach employs three principles. First, data should be stored independent of any pre-conceived “containers” that reflect anticipated ii ii  uses (classes, tables). Second, reconciliation of meaning of data can be done by abstraction of the properties the data represent. Third, classification can be created as needed depending on the application, based on some usefulness considerations. The thesis has two objectives. First, suggest how to apply these principles in the implementation of a flexible information management system. Second, demonstrate how semantic web technologies can be used to implement this approach. These technologies include triplestores (storing data in resource description framework), related query languages (SPARQL), and formal ontologies (Web Ontology Language). The thesis describes a prototype implementation, demonstrates it on a case study, and discusses its advantages compared to traditional database systems.  iii iii  Table of Contents Abstract ........................................................................................................................... ii  Table of Contents ............................................................................................................iv  List of Tables ...................................................................................................................vi  List of Figures ................................................................................................................. vii  Acknowledgements ....................................................................................................... viii  1. Introduction ................................................................................................................. 1  2. Principles of the Solution ............................................................................................. 8  2.1 Separation of instances from classes ............................................................... 13  2.2 Abstraction of properties .................................................................................. 15  2.3 Class structure ................................................................................................. 21  2.4 Chapter conclusions ......................................................................................... 25  3. Semantic Web and Related Technologies ................................................................ 28  3.1 Resource description framework ...................................................................... 29  3.2 Triplestore ........................................................................................................ 34  3.2.1 Query language ............................................................................................. 35  3.3 Web Ontology Language.................................................................................. 36  3.3.1. Representation of the Web Ontology Language .......................................... 38  3.3.2 Reasoning and inference .............................................................................. 39  3.4 Jena framework ................................................................................................ 39  4. Principles of Representation and the Implementation ............................................... 41  4.1 Property lattice ................................................................................................. 46  4.1.1 OWL property types ...................................................................................... 47  4.2 Class definitions and hierarchy ........................................................................ 51  4.3 Repository of instances and properties ............................................................ 57  4.4 Making queries ................................................................................................. 62  4.5 Chapter summary and conclusion .................................................................... 66  5. Case Study ................................................................................................................ 68  5.1 Making queries ................................................................................................. 71  5.2 Evolving the property lattice ............................................................................. 74  5.3 Changing the class definitions and hierarchy ................................................... 76  5.4 Conclusion and summary ................................................................................. 80  6. Analysis and Discussion............................................................................................ 81  6.1 Repercussions of adapting the instance-based paradigm ................................ 81  6.2 Discussing the implementation ......................................................................... 86  6.2.1 Another implementation of the instance-based model .................................. 86  6.3 A semantic web database with inherent classification ...................................... 91  6.4 Considerations and limitations in the proposed implementation....................... 94  6.5 Contribution of this research ............................................................................ 98  iv iv  6.6 Summary and conclusion ............................................................................... 101  7. Summary and Future Vision .................................................................................... 103  References .................................................................................................................. 107  Appendices ................................................................................................................. 111  Appendix A: SPARQL .......................................................................................... 111  Appendix B: Reasoning in a Triplestore, using Jena API ..................................... 112  Appendix C: A portion of Instances that were used to Populate the Triplestore for the Purpose of the Case Study (Chapter 5) ......................................................... 114  Appendix D: Linkages of Open Data Datasets ..................................................... 115   v v  List of Tables Table 1: Semantic web domain and requirements of the solution ................................. 29  Table 2: An example of the XML format ........................................................................ 34  Table 3: SPARQL code of a query that lists female employees of Google.................... 36  Table 4: Select construct from Bunge–Wand–Weber ontology (Wand et al. 1999) ....... 42  Table 5: Mapping the principles of the solution to the concepts of the semantic web domain ................................................................................................................... 43  Table 6: Creating an ontology and adding an object property ....................................... 48  Table 7: Creating classes in the ontology...................................................................... 51  Table 8: Creating existential restrictions in the ontology ............................................... 54  Table 9: Creating hasValue restrictions......................................................................... 54  Table 10: Listing all the instances that have an employer ............................................. 63  Table 11: All the RDF subjects and predicates that have “Er1” as their object.............. 63  Table 12: Retrieval operations needed for the semantic web implementation of the instance-based database (adapted from Parsons and Wand (1997)) .................... 64  Table 13: SPARQL code that lists John Smith’s grandchildren ..................................... 72  Table 14: A query that lists all the nephews of a certain individual ............................... 73  Table 15: A query that lists all the fathers that exist in the domain of the case study .... 74  Table 16: SPARQL code of a query that lists siblings within a certain vicinity............... 76  Table 17: SPARQL code lists senior grandparents and their minor female grandchildren ........................................................................................................ 78  Table 18: Possible implementations that were investigated in this work ....................... 86  Table 19: Running a selection query on a triplestore .................................................... 90  Table 20: Applications in which the flexible paradigm can be beneficial ....................... 95  Table A-1: General SPARQL Commands ................................................................... 111  Table A-2: The structure of a general SPARQL query ................................................ 111  Table C-1: Instances in the triplestore (RDF format) ................................................... 114   vi vi  List of Figures Figure 1: Lemma 1, adapted from Parsons and Wand (2003) ...................................... 18  Figure 2: Lemma 2, adapted from Parsons and Wand (2003) ...................................... 19  Figure 3: An RDF graph: (a) general structure, (b) an example. ................................... 31  Figure 4: An example using a blank node ..................................................................... 33  Figure 5: OWL language family (from W3C (2012)) ...................................................... 38  Figure 6: A property hierarchy ....................................................................................... 45  Figure 6: RDF statement: <John, hasEmployer, Google> ............................................. 48  Figure 8: Property hierarchy in Protégé......................................................................... 51  Figure 9: Class hierarchy pane ..................................................................................... 52  Figure 10: Defining the “Employee” class...................................................................... 54  Figure 11: An “Employee” with its necessary and sufficient condition ........................... 55  Figure 12: Loading the ontology in the triplestore ......................................................... 59  Figure 13: Adding instances to the triplestore without classifying them......................... 60  Figure 14: Running the reasoner in the triplestore ........................................................ 61  Figure 15: Property hierarchy vs. class hierarchy in the “Generation” ontology ............ 70  Figure16: Defining the “Father” class ............................................................................ 70  Figure17: Query that identifies grandchildren of a certain individual ............................. 71  Figure 18: Query that finds siblings within a certain vicinity .......................................... 76  Figure 19: Query to find senior grandparents and minor female grandchildren............. 77  Figure 20: An example of retrieving and updating properties ........................................ 89  Figure 21: Defining filters in Allegrograph ..................................................................... 97  Figure D-1: Linkages of open data datasets (adapted from lod-cloud.net) .................. 115          vii vii  Acknowledgements I would like to express my appreciation and gratitude to my supervisor, Dr. Yair Wand, who provided me with this incredible opportunity to do research under his tutelage. His support, patience, and constant guidance helped me to overcome many of the challenges that I faced during my studies. Also, I would like to acknowledge the academic and financial support that I have received from Dr. Carson Woo. As well, I greatly appreciate the many constructive comments that he provided as one of my supervisory committee members. My gratitude is also extended to Dr. Ken Takagaki, my external examiner, for his insightful comments. His feedback led to better highlighting of the contributions of the thesis. My dear friends, Bahman Razmpa, Amir Mehdi Dehkhoda, and Ali Elahimehr, have supported me whenever I needed them, and I am very grateful to them. Last but not least, I would like to thank my parents — Maryam and Mojtaba — whose love, support, and encouragement have been beyond words throughout my life. Without them, I would not be where I am today.  viii viii  1. Introduction The new technologies of the World Wide Web have facilitated the growth and availability of user-generated content through new facilities such as social media (O’Reilly 2007) and cloud computing. These sources of information are growing at an increasing pace. Moreover, the past few years have been known as the era of big data, because in the United States “companies with more than 1,000 employees store, on average, over 235 terabytes of data” (Brown et al. 2011), which is more than the volume of data in the US Library of Congress (Brown et al. 2011). Organizational big data may be found in internal repositories within the companies or on the web. At the user level, the growth of information is exemplified by phenomena such as virtual communities, crowdsourcing, and citizen science, which are the focus of this work. These phenomena create a need for cooperation and information sharing among multitudes of users, in ways that often cannot be predicted. However, such collaboration can only happen within systems that can support communications, sharing, and networking among users of the information (Brabham 2008). In particular, to share information effectively individuals need to share the meaning of the information. In the traditional data management paradigm, data are stored in well-defined and structured databases, and there is some central authority that determines the initial structure, meaning, and classification of the data. Also, it is typically assumed that the application and uses of data are known in advance.  1  In the recently emerged phenomena, such as social collaboration, multiple users interact to achieve individual or shared objectives, whether it is to solve complex problems or capture and analyze data (e.g., citizen science) without the benefit of structured databases. Common characteristics of such projects can be summarized as (1) the collected data are usually unstructured or semi-structured, (2) there is no central control of data, (3) users (content providers) have different levels of domain knowledge, (4) uses of the data are often not known in advance but may emerge over time. Moreover, it might not even be known who the users of the data will be. This implies that (5) the same data might be used and, hence, viewed in completely different and unanticipated ways by multiple levels of users, with multiple views, who provide unstructured or semi-structured data. In such situations there would still be a need for some structure, such as abstraction mechanisms (e.g., categorization methods), to access the data. However, even for structured databases (relational being the most common), it has been observed that problems may arise owing to the assumption that instances of data belong to well-defined classes of types (e.g., in tables). In other words, classification of the data is implicitly and explicitly assumed in the way data are stored (Parsons and Wand 2000). Even when structured data are considered, binding the instances to classes can give rise to issues related to schema design and database operation. These issues will be further described in the coming chapters. Having unstructured data means that the information does not have a pre-defined data model. In unstructured data, since there is no schema (Buneman et al. 1996), each component or data object is interpreted dynamically and may be linked to other 2 2  components in an arbitrary fashion (Buneman et al. 1996). The semantic understanding of unstructured data would be more difficult in comparison with semantic understanding of data stored in a structured database with well-defined fields or semantically tagged information (such as an XML coded document). In semi-structured data, the schema is contained within the data; for example, an HTML web page has some degree of structure, as the data are embedded within HTML tags (e.g., <table> or <script> tags), but the information written between the tags does not fit into a structured database. In other words, in semi-structured data, there are only “loose constraints on the data” (Buneman 1997). In crowdsourcing and social collaboration projects, the user-generated data are not necessarily constrained by a schema, and it is desirable to exchange the data among disparate sources, users, and applications more flexibly. The content generators are not homogenous, and they do not have the same knowledge level. A common case is when users with different levels of domain knowledge deal with the same data. This would be the case in crowdsourcing (notably, citizen science), where some users might know much more than those who generate the data. Having a single view of the data prohibits the users with more extensive domain knowledge to fully utilize their collaborative effort. For example, in a citizen science project run by Cornell University,1 anyone who watches birds can report his/her citing on the website and help researchers build a database of birds and their habits. In this case, users could be from varying backgrounds, one might be a high school student and one might be a professional  1  http://www.birds.cornell.edu/citsci/ accessed on 15/12/2011.  3 3  ornithologist. Therefore, it is not desirable to discourage regular citizen scientists by exposing them to complicated and advanced views. As another example, there are crowdsourcing projects in Canada2 and the United States3 in which users can report if they feel an earthquake. The reported data lack any structure, and even the users who have observed the same earthquake within a region might submit different information regarding the same phenomenon. As a hypothetical situation, one could think of a scenario in which the Canadian and American earthquake databases are integrated. In that case, there would be a need for an information system that is able to deal with unstructured data, without binding them to a specific structure in the form of classes or categories, and support heterogeneous users with multiple views, without imposing a central control of the data. In short, not only have users’ needs evolved over time, but also new technological opportunities have emerged which have facilitated the growth of information sources and virtual communities by allowing users to become content providers. At the same time, the new technologies could provide opportunities to implement an information management system that is able to flexibly handle unstructured and semi-structured data. This work proposes to combine three principles that can offer a potentially better solution than the traditional approach to managing data flexibly. The first principle suggests keeping the “stable” part of data — which are the instances — without “locking” them into pre-determined categories; that is, avoid binding the instances to 2  http://earthquakescanada.nrcan.gc.ca/index-eng.php accessed on 20/12/2011. http://earthquake.usgs.gov/earthquakes/dyfi/ accessed on 20/12/2011.  3  4 4  specific assumptions about data semantics (Parsons and Wand 2000). The second principle is regarding property abstraction, which facilitates the semantic reconciliation of different sources of data (Parsons and Wand 2003) as well as making inferences regarding class membership. And, finally, the third principle provides guidelines for creating useful class structures (Parsons and Wand 1997). The combined principles are studied within the instance-based data management paradigm which is described in terms of a triangular view: instances, properties, and classes. Instances and their properties (the stable parts) are stored in a repository. Property abstraction can be defined via “property hierarchies”,4 which can be shared for different purposes if needed. As well, class definitions can be created for certain domains, applications, or even a specific need. Indeed, even the creation of “on demand” queries can be considered as creation of an ad hoc class. As an example, if information about graduate students who act as teaching assistants is queried in order to levy more responsibilities on them, this can conceptually be considered as creation of a new class (e.g., “senior TA” class). This thesis proposes an implementation of these principles, using semantic web technologies. The semantic web technologies, which are sometimes referred to as Web 3.0, are becoming more common and important than before, and the amount of data in semantic web format is increasing (Lassila and Hendler 2007). In addition, the available standards of the semantic web technologies can provide an appropriate foundation for implementing the aforementioned principles (refer to chapter 3).  4  For example, in such a hierarchy “hasParent” is a specialized case (“manifestation of”) the “hasAncestor” property.  5 5  In the proposed implementation, ontologies created with Web Ontology Language (OWL) are used to define the classes by their properties. As well, the ontology axioms are used to infer the data structure using abstraction mechanisms (e.g., an axiom can state the necessary and sufficient condition for membership in a certain class. From abstraction mechanisms, additional properties that an instance of a class inherits can be inferred; these will be described in more depth in the future chapters). Thus, the ontology would constitute the property and class layers in the triangular view. Finally, the instances along with their properties (i.e., the instance layer or the stable parts) are stored using the resource description framework (RDF)5 data model in a repository. As mentioned earlier, this work is focused on the user-generated data on the web; however, the issues that are raised with the traditional paradigm, as well as the proposed alternative approach, are applicable to the organizational big data. The objective of this thesis is to show that combining the principles of the instancebased paradigm with chosen semantic web technologies can create a flexible information management system. This information system would be able to separate instances from semantics as much as possible, support abstraction of properties (via a property hierarchy), and flexibly manage data (e.g., to create/use classes on the fly, change the property hierarchy, or run complex queries), while avoiding the problems that are associated with the traditional class-based paradigms. After the possibility of this implementation is demonstrated, the implications are discussed.  5  RDF is a data model that is used to express statements in the form of <subject, predicate, object> triples. RDF will be further investigated in chapter 3.  6 6  Chapter 2 studies the problems arising from inherent classification (i.e., class-based data management) as well as the principles of the instance-based data management paradigm. Chapter 3 introduces some of the semantic web technologies and investigates their compliance with the aforementioned principles. In chapter 4, the details of the representation of the principles and the implementation are described. The flexibility of the proposed implementation is demonstrated in chapter 5, through the analysis of the application and functionality of the semantic web implementation of the instance-based paradigm in a sample crowdsourcing project. It is shown that complex queries can be supported with ease. Also, the evolution of properties or classes (i.e., schema evolution) will not be problematic in the proposed approach; many examples will be shown that support the points made regarding a successful implementation. In chapter 6, the implications of the proposed approach as well as some other alternative implementations are studied. Finally, in chapter 7, possibilities for future research are introduced.  7 7  2. Principles of the Solution In the introduction (chapter 1) the common characteristics of the new phenomena emerging over the World Wide Web, such as crowdsourcing and social collaboration, were investigated. In such phenomena the uses of data are not anticipated well, and as mentioned before, there would be a need for some structure. Creating some structure over the data would cause problems if the users decided to use the data for different purposes, since different uses would require different data semantics. For example, imagine having a database of a sample population; if one decides to study the distribution of subjects over the city, he/she can impose that all subjects need to have a property representing their postal code. If another researcher decides to study the ethnicity of the individuals in the database, the previous structure would be irrelevant, and it might deprive the researcher of some information regarding individuals who do not have a fixed address (e.g. homeless individuals). The consequences of structure imposition, or inherent classification, are not limited to only the new phenomena. Also in the traditional paradigm inherent classification can lead to problems related to conceptual schema design and database operations (Parsons and Wand 2000). The schema design problems surface in different situations; one is when an instance belongs to multiple classes that are not generalization/specializations of each other in the class hierarchy (i.e., multiple classification problem). In such cases the traditional database management paradigm does not “usually provide good mechanisms to support multiple classification” (Parsons and Wand 2000), and having disparate classes 8 8  across the hierarchy might enforce the multiplication of the instance’s information in the hierarchy. For example, a certain individual might be a student and a trustee of a university; if it is assumed that the “student” and “trustee” classes are not generalization/specializations of each other, it might be necessary to duplicate the information about that individual to be recorded under both classes. Also, integrating the multiple views of different users into a single global schema is a necessary task in logical database design because a “class-based approach requires the identification of a set of base classes” (Parsons and Wand 2000). Constructing the global database by reconciling different views at the class level would be very difficult, “since the same portion of reality is usually modeled in different ways in each schema” (Batini et al. 1992). Similar to this issue is the interoperability problem, which arises when independent sources of data need to be merged; this would require schema reconciliation between the two sources (Parsons and Wand 2000). For example, if two airlines decide to form an alliance and start exchanging information, their database schemas should be integrated; a “platinum” frequent flyer in one airline might be defined differently from the other airline’s “platinum” client. As well, classes might be defined similarly yet have different names (e.g., “platinum” flyer in one airline might be the same as “gold” flyer in the other). As mentioned earlier, developing a global schema would be necessary. Moreover, as the users’ views change the information they need also changes, and therefore the schema should be able to evolve to support the addition of new classes, changes in class definitions, and obsolescence of some classes. Having a fixed structure of classes would complicate the operations needed to ensure that “no 9 9  information is lost, no outdated information remains, and integrity is not compromised” (Parsons and Wand 2000). The other type of problems arising from inherent classification is database operation problems. These problems manifest themselves in cases such as handling exceptional instances, in which the instance might possess unique properties. For example, in a relational database, if one wants to record that a particular student is able to play the piano, one must either add a new column to the “student” class that would refer to the musical talents of all students (and needs to be filled with null values for most students) or create a special subclass for students with musical talents. This is termed the class proliferation problem (Parsons and Wand 2000). Reclassifying an instance could also be problematic. If the definition of the new class does not include all the properties of the reclassified instance, some information about the properties might be lost (Parsons and Wand 2000). Creating a subclass to preserve the properties would lead to class proliferation. For instance, if a university professor decides to resign and become a student again, he/she would be placed under the student class, and the information about the courses that he/she taught previously could be lost because the schema designer had not anticipated such a property for the “student” class. Furthermore, adding and removing instances that belong to many classes require repetitive operations unless the classes are generalization/specialization of each other. Otherwise, the problem of proliferation of operations can arise, since there would not be a common superclass, and therefore the instance should be added to or removed from 10 10  all the relevant classes (Parsons and Wand 2000). For example, if it is decided to remove information related to an individual who is a student and a school trustee, it is necessary to find his/her records in both classes and delete them separately (since these two classes are not generalization/specializations of each other). Also similar to the previous issue, ensuring the integrity of data is a concern when instances are added or removed. In cases when a class is no longer relevant and it is decided to remove it from the schema, the risk of data loss is faced. The instances that belong to only that particular class would be removed as well, and properties that belonged only to the instances of that class “will no longer be represented in the database anymore” (Parsons and Wand 2000). For example, if a school decided to abolish its music program and remove the “music students” class, it might no longer be able to store information about students who were registered only in the music courses. As well, if the property of playing a musical instrument was not be used in other class definitions, that property could be completely lost. The last type of database operations problems arises when the definition of a class changes. Redefining a class by changing the properties of the class might change the class membership status of its instances (Parsons and Wand 2000). For example, if a university requires that all research assistants must also be graduate students, the information about the research assistants who are not in the graduate programs and do not belong to any other class would be lost (i.e., loss of instances). Also, scanning the previous members to verify if they qualify to be members of the newly defined class could lead to a proliferation of operations. 11 11  Besides the problems arising from locking data into a predefined structure, sometimes instances that have unique properties and are not identifiable by classes might be encountered. By definition, instances are manifested by properties. In cases where the properties and their hierarchies are not well defined, the classification of instances based on abstraction schemas might not be possible, or instances with only a few properties might be found. In such cases it is necessary to either discard the instance’s information or put the instance under different classes, which might be incorrect. For example, in a university database, people could be students or professors; imagine finding an instance that consists of a name only. This instance would be either discarded or added under both tables. These problems reflect the fact that the data in use do not always “provide a good bridge between conceptual view and the database design” (Parsons and Wand 2000). To overcome the problems that have been mentioned so far, three principles will be introduced and discussed. The first principle is separation of instances from classes, which suggests storing the instances (i.e., stable part of data) and their properties without binding them to the class structure. The second principle is regarding property abstraction, which is manifested in the concept of property precedence.6 Having a property hierarchy (or lattice) is helpful in devising abstraction mechanisms which are used to infer classification. Finally, the last principle introduces guidelines for creating useful class structures; in a useful class structure, other than the membership of an instance in a given class, the additional properties that can be inferred from the class membership are of great interest. 6  For example in a database about food, “having toppings” is preceded by “having ingredients”. In other words, “having toppings” is a manifestation of “having ingredients”.  12 12  2.1 Separation of instances from classes This principle suggests that instances should be disconnected from the data structure and the user views (as reflected in the data structure); this is also referred to as the instance-based approach. The underlying postulate is that “the world is viewed as made of things that possess properties” (Parsons and Wand 2000). In this paradigm “thing” is what is represented and would be the same as an instance in a data repository. In Bunge’s ontology an instance is “fundamental and precedes any notion of classes” (Parsons and Wand 2000); hence, instances can exist independently of any classification. However, in a more recent paper, Parsons and Wand (2008b) have gone beyond “things” in the Bunge sense and have studied phenomena encountered in daily life; “an instance can be a material object, action, event, or any other phenomenon” (Parsons and Wand 2008b). For the purposes of this thesis, the scope is limited to instances that represent things in the world, but it is acknowledged that the concept can be extended so that instances could represent events or any phenomena. The principles here are based on Parsons and Wand (2000), in which they have suggested the instance-based approach. Constructs to represent instances along with their properties are needed, as are operations to create, maintain, and examine information about the domain of instances in the aforementioned representation. In this paradigm, instance is defined as a “cognitive symbol designating the perceived existence of a thing in the world” (Parsons and Wand 1997). It should be noted that a “thing in the world” might be an instance of a non-physical or even a fictional concept (e.g., mermaid); hence, an instance might refer to such concepts. 13 13  A property could be any observation of an instance and could typically (but not necessarily) be defined as a “characteristic of an instance relating to form or internal structure, to linkages with other instances, or to behavior” (Parsons and Wand 1997). A class is defined by a set of properties and is intended to help abstract the knowledge about instances. All these concepts are defined within the relevant universe, which consists of a finite set of known instances and a finite set of properties possessed by the instances, but one might not know all of them in advance. Parsons and Wand (2000) introduce the instance model, which consists of the instance base and its operations. More specifically, the instance model should include the instances or things, all the properties that each instance possesses (both intrinsic and mutual among several things), and the composition of each composite thing (Parsons and Wand 2000). The operations would be retrieval and modification of data. Retrieval and query operations answer questions regarding the properties of a given instance and the components of a given instance, whereas the modification operations deal with the addition and removal of instances, properties, and their relationships. Separation of instances from classes is the fundamental principle in this solution paradigm. Tying instances to classes leads to schema design problems (e.g., when class definitions need to change) and database operation problems (e.g., when the information about instances change) (Parsons and Wand 2000). Having this principle will solve most of those concerns. An elaborate discussion on the nature of the problems and their resolution by this approach is beyond the scope of this thesis but can be found in full in Parsons and Wand (2000).  14 14  The following two principles would complement this approach.  2.2 Abstraction of properties As mentioned earlier, new technologies on the web have facilitated the generation of content by users, and therefore the number of sources of information is growing at an increased pace. One of the main purposes of this thesis is introducing and implementing a data management paradigm that is able to handle the integration of multiple sources flexibly. To make two independent systems interoperable, it is necessary to “resolve the semantics of the data in the separate sources” (Parsons and Wand 2003). Semantic reconciliation of data sources through schema-based approaches is suitable for structured databases, since it would identify semantic relationships in data by mapping the schema (Parsons and Wand 2003). Another approach would be attribute based; this identifies correspondence (Parsons and Wand 2003) between attributes in different sources, based on the real-world information that the attributes reflect (Clifton et al. 1997). Since the first principle in this chapter suggests disconnecting instances from structure, schema-based approaches to reconcile types would not be relevant in this paradigm because they are suitable for structured databases. In this work, an attribute-based approach that would use semantic information would be used. It is based on a postulate, adapted from Bunge’s model, that the world is made of things that possess properties. In other words, properties are always attached to instances and they provide  15 15  “the basis for determining similarity of things”; this premise would guide this attributebased approach7 (Parsons and Wand 2003). One very important concept in this approach is the property precedence, which would abstract properties across a higher level and might regard them as the same. It should be noted that precedence is not one of the principles in the proposed approach, but it is a way to formalize property abstraction. For example, the sensation related to the wavelength in the 620–740 nm period is perceived as the color red. Any sensation related to a visible wavelength can be abstracted and be called a color. One can state that “being red” is preceded by “being colored”. Thus, to formalize abstraction the concept of precedence can be used. In other words precedence means that if an instance possesses a certain property, then it possesses the preceding property as well. Different users may view a property differently, owing to different levels of information granularity, meaning that users with a certain view might be interested only in a generalized form of a property, while another user may want to base his/her database operations on a manifestation of a certain property. For example, in a database that lists instances based on their family relations, the “hasSister” property would be a specialized form of “hasRelative”. If a user is interested in seeing whether there exists any relationship between two instances, he/she could base his/her view/query on the “hasRelative” property, whereas another user might only be interested in locating individuals who have sisters. In this case the specialized property would be of interest.  7  As mentioned earlier, Parsons and Wand (2008a) have gone beyond “things” and have studied phenomena in general, and it is not the intent here to limit the domain to “things”. For the scope of this principle, the same approach would be common between “things” and phenomena.  16 16  In a formal way, one could say P1 precedes P2 if and only if the set of things possessing P2 is a subset of things possessing P1. As an example, the property “hasAncestor”, precedes the “hasFather” property. As well, manifestation could be defined as a set of properties that are preceded by the property P. For example, having a certain color is a manifestation of the property “hasColor”. In this study, precedence and manifestation notions facilitate the identification of semantic similarity across different domains by considering manifestations of a generic property as “similar” (Parsons and Wand 2003), and this will be the basis of the attribute-based semantic reconciliation. To semantically reconcile properties in practice, (Parsons and Wand 2003) have suggested that intra-source and inter-source precedences should be identified first, and then they have stated two lemmas to infer additional inter-source precedences. The purpose is to come up with a precedence lattice which can be considered “property ontology” of the domain. For example, in an ontology about food, “hasIngredient” would be a property on top of the lattice, while “hasTopping” and “hasBase” would be subproperties or, in other words, manifestations of the “hasIngredient” property. More detailed examples of a property lattice will be introduced in chapter 5. To semantically reconcile different sources of information, Parsons and Wand (2003) have suggested two lemmas. The first lemma states that “if G2, is fully manifested by the set of S2, and for each property in S2 there is a preceding property in the set of S1, then G1 (which precedes the set of S1) precedes G2. Since G2 is fully manifested, everything possessing G2, also possesses a property in S2. This property is preceded 17 17  by a property in S1 that is a manifestation of G1; hence the thing possesses also G1” (Parsons and Wand 2003). Figure 1 shows the relationships in lemma 1.  Figure 1: Lemma 1, adapted from Parsons and Wand (2003)  As an example of the first lemma, one could consider a given police station, in which officers are assigned to “traffic”, “vice”, or “arson” cases. In this station, “being a police officer” property is fully manifested by {traffic, vice, arson}. In this example domain it is assume that traffic officers “wear uniform”, while vice and arson officers “wear civilian clothes”. Also, “being a government employee” is manifested by {wears uniform, wears civilian clothes}. In this case, it can be inferred that “being a police officer” is preceded by “being a government employee” because of lemma 1. The second lemma states that “if G2 is preceded by G1, then each property in S2 (which is a set of properties all preceded by G2) is preceded by one or more properties in S1. 18 18  Since every property in S2 is preceded by G2, and in retrospect by G1, G1 is fully manifested; hence, everything possessing G1 possesses at least one property in S1” (Parsons and Wand 2003). Figure 2 depicts the second lemma.  Figure 2: Lemma 2, adapted from Parsons and Wand (2003)  The previous scenario (i.e., police stations and government employees) could also be used for this lemma, only by imagining that the domain knowledge is different in this case. It is known that “being a police officer” is preceded by “being a government employee” and that government employees either wear uniforms or civilian clothes. In this case one can infer that being a traffic, vice, or arson officer is preceded by the set of {wears uniform, wears civilian clothing}. However, if the data are not complete, such as cases when the knowledge about property precedence was not known when the database was originally designed, 19 19  inferring all property precedences might not be possible. Having incomplete data might “lead to ‘spurious’ precedences that will disappear when more instances are known” (Parsons and Wand 2003). More specifically, if the property lattice within one source is not defined, then it might not be possible to semantically reconcile two sources of information. In the previous example, it would not have been possible to make any inferences based on the first lemma if it was not known that being a traffic officer is preceded by being a police officer. The purpose of introducing these lemmas is to show that there is a formal way to construct the property lattice across different sources. When multiple sources of information are studied, the property lattice is a basis for analysis and making inferences. For example, in the scenario stated earlier in this chapter, the integration of databases of the two different airlines requires a unified semantic in the shape of a property lattice. Without a lattice that represents all the properties across the relevant universe and the property precedences, the flexibility in managing unstructured data over the World Wide Web that was promised in this thesis could not be achieved. In summary, the principle of property abstraction provides a means to analyze the properties possessed by an instance and infer whether they belong to the same property lattice (i.e., generalization or a manifestation of a property). With the lemmas, the property lattice can be constructed and used to analyze multiple sources of information.  20 20  2.3 Class structure The final principle addresses the issue of identifying a set of fundamental concepts (i.e., classes) to describe a domain. Parsons and Wand (1997) have suggested using cognitive principles as guidelines for representing human knowledge of a domain. “Classification involves forming concepts (also called categories or classes) to abstract common characteristics of instances and assigning new instances into these categories” (Parsons and Wand 1997). Humans assign classes to phenomena they encounter and they “communicate about the phenomena by referring to the class(es)” (Parsons and Wand 2008b). Once an object is classified, its unobserved properties can be inferred from the classes that it belongs to. Moreover, the complexities associated with dealing with individual instances can be reduced once the instances are classified and treated as members of a particular class (Parsons and Wand 2008b). However, the preconceived notion of a class in a human mind is not the same as a class in a conceptual model; this text studies the latter. The work done by Parsons and Wand (1997) was used as a reference from which guidelines were extracted by studying concept theory and classification theory. Concept theory mandates that the classification should be governed by cognitive economy and inference (Holland et al. 1986; Rosch 1978), while classification theory incorporates the premises of the antecedence of the knowledge of the instances from “formation of concepts”, and concept-formation’s support of cognitive economy and inference. Providing cognitive economy and supporting inference are two fundamental purposes of classification, and they “have been used to define necessary conditions for a set of 21 21  classes to capture the relevant knowledge about a domain effectively and efficiently” (Parsons and Wand 2008b). The following conditions regarding the cognitive economy address effectiveness of classification, which is the ability to use classification for the purpose for which it was intended (i.e., finding the members of the class). The additional conditions that support inference also address efficiency, which could be extended to mean eliminating redundant inferences and minimizing the resources needed for classification (Parsons and Wand 2008b). Cognitive economy means that classifying instances under a concept should provide “maximum information with the least cognitive effort” (Rosch 1978). It implies that maximizing knowledge about individual instances and minimizing knowledge of irrelevant details should be balanced by concept formation. On the other hand, inference means that after a given instance is identified as a member of a certain class, it should be possible to infer unobserved properties of the aforementioned instance based on its membership in the class (Parsons and Wand 1997) owing to perceived “high correlational structure” of the things in the world (Rosch 1978). In Parsons and Wand (1997) the first two requirements for a potential class to be “permissible” are suggested as follows: the set of properties that form a potential class should be abstracted from instances and provide maximal abstraction. Abstraction from instances mandates that classes can be defined only when there are “instances in the relevant universe, possessing all properties defining the class” (Parsons and Wand 1997). Maximal abstraction requires that if all instances of a class possess a relevant property, then it should be included in the class definition (Parsons and Wand 1997). As mentioned earlier, cognitive economy means providing maximum information with the 22 22  least cognitive effort. These two requirements are used to provide maximum information and help the expression of “what the instances in a set have in common” under the umbrella of a certain class. Two additional requirements to constrain the possible class structure are proposed to address the concerns regarding cognitive economy (Parsons and Wand 1997), which intends to balance maximization of information content with the minimization of the number of stored concepts, while allowing inference of properties from class membership. The principles of completeness and nonredundancy address those concerns. To support the inference of all relevant properties of an instance of a class while supporting the “maximum information” aspect of cognitive economy, the completeness principle mandates that “every property should be used in the definition of at least one class in the set of classes” (Parsons and Wand 1997). However, the nonredundancy principle requires that a class that is a subclass of several other classes “should be defined by at least one property not in any of its superclasses” (Parsons and Wand 1997). This would address the aspect of “least cognitive effort” within the cognitive economy. Based on these guidelines, a class structure is defined as a set of potential classes in which every property within the relevant universe is used in at least one class definition, and no class is defined only in terms of the properties of its superclasses without having a distinguishing property. These guidelines form “a basis for reasoning about  23 23  classifications” (Parsons and Wand 1997), and from a cognitive point of view they would help prevent loss of information. Within the relevant universe, multiple class structures could exist that conform to the aforementioned conditions, but the conditions do not provide a framework for evaluating the possible class structures in comparison with each other. Another limitation of those conditions is that they “only address inferences from class membership to instance properties, but do not take into account the benefit of identifying class membership based on partial observations about instances” (Parsons and Wand 2008b). Therefore, the concept of useful abstraction (or useful class) can be used to address the limitations (Parsons and Wand 2008b). In conceptual modeling of a domain, a class is useful when inferences regarding class membership can be made based on only some of the properties that define it, and once the class membership is determined other properties possessed by an instance can be inferred (Parsons and Wand 2008b). Additional rules that can guide the classification of phenomena and help system analysts in identifying useful class structures have been formalized in Parsons and Wand (2008b). The rules can be used to “evaluate potential classes in terms of their inference value” (Parsons and Wand 2008b) and are of three types: a screening rule (which specifies conditions for a useful class), nonredundancy rules (which evaluate the subclasses of multiple superclasses based on their usefulness), and formation rules (which go beyond the two common mechanisms of identifying subclasses to evaluate their potential usefulness) (Parsons and Wand 2008b).  24 24  It is important to distinguish between the terms “category” and “class”. “A category simply reflects a repeating pattern of properties. A class additionally indicates that relationships exist between these properties, even if the mechanisms behind the relationships are unknown” (Parsons and Wand 2008a). A “class” is a more useful “category”, in which the inference of further information is possible (Parsons and Wand 2008a). For example, someone’s belongings form a category, since they share the property of belonging to that particular person, but one might not be able to infer more underlying information. In comparison, one could think of “birds”, which are feathered, have wings, lay eggs, have beaks, are warm blooded, bipedal, etc. Knowing the first three properties would be enough to identify a bird and the other properties could be inferred further on; thus, one could say “birds” is a class. In this work the categorization process based on instances and their properties is of interest, but the inferences which are the underlying reasons of why classes might be useful are out of the scope of this thesis. Thus, the conditions mentioned earlier in this section (i.e., requirements for permissible classes and constraining the possible class structure) would not be limiting here because there is no intention to find the most “useful classes”. Although making inferences based on the observed instances is briefly touched upon in chapter 4, it is beyond the scope of this thesis and might be further studied in future research.  2.4 Chapter conclusions The research domain was set out in the Introduction chapter, this chapter focused on the problems that could arise from inherent classification of instances and to presenting 25 25  an alternative approach in which three principles were introduced. Separation of instances from classes, property abstraction, and guidelines for class structures are the principles that not only resolve most of the problems but also provide facilities for a flexible information system that is able to manage unstructured data over the World Wide Web. Having instances separated from classes will resolve the view and class reconciliation issues that occur in the traditional paradigms, since the instance-based approach only requires an agreement on the instances and the properties and not on the types and class levels. Evolution of a schema was mentioned earlier in this chapter; both properties and class definitions can evolve, owing to either the evolution of the defining properties or changes in the needs of the users. Also, class definitions could evolve over time without affecting the instance repository, since there is no tie between the two. Semantic agreement of classes between two repositories would not be required either, since an agreement on the things and their properties would be enough to make two databases interoperable. As well, as taken from the second principle (i.e., property abstraction), identifying generalizations of a certain instance’s properties via precedences would enable one to put those instances in a class that is defined by the common (generic) properties. For example, in a database of animals, if one identifies that “flying” and “moving on land” properties are preceded by being “mobile”, one can put a mockingbird and a cow under the category of mobile animals. 26 26  The guidelines represented in the third principle “forms a basis for reasoning” (Parsons and Wand 1997) regarding classification, and their purpose is to avoid losing information from a cognitive point of view. In the relevant universe more than one class structure can be defined for a domain. This means that one should not impose unnecessary limitations by fixing the data to a rigid classification or a single view. Hence, it would be useful to support multiple views that can exist over the same relevant universe. Separating the instance base from the classes allows the schema to evolve, multiple views and data sources to be integrated, and database operations to become more flexible, since the instances are not bound to the structure. In the following chapters these principles will be applied within the concepts of semantic web domain, and the proposed implementation will be tested in a sample crowdsourcing project. As well, the implications and limitation ram rams of the approach will be discussed.  27 27  3. Semantic Web and Related Technologies Berners-Lee, the founder of the World Wide Web Foundation, declared his vision of semantic web as a dream in which computers “become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize” (Berners-Lee and Fischetti . 2000, Chapter 12). Enabling the machine to understand the semantics of the web requires machinereadable metadata which contains information on how they are related to each other. Therefore, the challenge became to provide technologies that express both meaning (semantics) and rules for reasoning about the data. With the principles of the solution that were introduced in the previous chapter in mind, a way to store instances and link properties to the instances is required to have an instance base separate from the classes. To enable the vision of semantic web within the confines of this thesis, three elements are required. Table 1 lists these elements along with the corresponding requirements of the principles of the solution:  28 28  Table 1: Semantic web domain and requirements of the solution Semantic Web Element  Requirements of the principles of the solution  A way to exchange data and metadata. One technology is eXtensible Markup Language (XML), which allows users to add arbitrary structure (e.g., tags) to their documents but says nothing about what the structure means (i.e., semantics of the data). A way to express meaning of information resources or represent knowledge, such as resource description framework (RDF), which enables one to express statements regarding particular “things” that have “properties” with certain “values”.  A way to store instances and link the properties to the instances in order to have an instance-base separate from the schema (i.e., first principle).  A way to describe the ontology, which is a set of concepts, relationships among the concepts, and premises about what can exist and happen (i.e., axioms). Web Ontology Language (OWL) is a technology that incorporates taxonomies (definition of classes of objects, properties, and relations among them) and sets of inference rules.  A way to determine class membership of instances and to make inferences regarding their unobserved properties.  Means to describe property precedence, since property abstraction (i.e. second principle) is manifested in precedence.  In the following sections the technologies that facilitate the required elements will be described in more depth.  3.1 Resource description framework Resource description framework (RDF) is a means to represent information on the web. RDF has an abstract syntax that reflects a “simple graph-based data model” (Klyne and Carroll 2004) and allows machine-processable information to be used beyond the environment for which it was created in order to facilitate “interworking among applications” (Klyne and Carroll 2004). 29 29  RDF is a simple data model that can be used to make statements about any resource. A resource is a fundamental element in the domain of the web; a more general definition of the resource in Berners-Lee et al. (1998) states that “a resource can be anything that has identity”. An example of a resource in the general sense could be an addressable file in the computer memory, such as C:\Documents\SampleResource.txt, since it can be located, accessed, and referenced. A Web Resource could be defined more specifically as everything that can be identified and retrieved on the web or in any other networked system (W3C 1999; Berners-Lee 2009). In semantic web and RDF terminology, a resource could be defined simply as a source of information on the web which is usually identified by a uniform resource identifier (URI), i.e., a string of characters used to identify a name or a resource over a network) (W3C 1999; Klyne and Carroll 2004). According to this definition, http://en.wikipedia.org/wiki/Elizabeth_II is an example of a web resource that could be identified, accessed, and referred to. RDF also allows for representation of anonymous resources with the notion of blank node. A blank node is a resource for which the URI or literal is not given; it is just a “unique node that can be used in one or more RDF statements, but has no intrinsic name” (Klyne and Carroll 2004). As an example, stating that “John has a friend” would require using a blank node to represent the unknown friend. In this scenario, “John” would be an identifiable resource and “having friends” is an identifiable attribute, whereas “a friend” is an anonymous resource.  30 30  A blank node is a convention which allows “several statements to reference the same unidentified resource” (Klyne and Carroll 2004). This will be explored further in the examples to be discussed later. With the RDF syntax, statements can be made about resources in the format of <subject, predicate, object> expressions; such an expression is known as an RDF triple. The subject represents the resource to be described; it could be either a URI or a blank node and, as a resource in the general sense, can refer to anything that can be identified (W3C 1999). A predicate, which is also called property, indicates a relationship between the subject and the object of the statement, and it is represented by a URI. The thing denoted by the object could be a URI, literal, or a blank node and could be considered the value of the corresponding property. The RDF data model can intrinsically be represented as a directed graph (Klyne and Carroll 2004). Figure 3a depicts the general structure of an RDF graph along with an example of an RDF expression which states that “John Smith has an email address, and it is johnsmith@gmail.com” (Figure 3b).  (a)  (b) Figure 3: An RDF graph: (a) general structure, (b) an example.  31 31  The same expression in the form of RDF triples would be <John Smith, hasEmail, johnsmith@gmail.com>. Making some statements might require referencing the same node in order to preserve the semantics of the represented phenomenon. For example, in order to state “John Smith has worked for Google since 2007”, one would need to create a connection between John Smith and Google to describe the employment relationship as well as another connection to show the time of his employment with Google. Two unconnected statements, <John Smith, Works for, Google> and <John Smith, has worked since, 2007> would be inferred separately, and the fact that John has been employed since 2007 by Google would not be visible in such representation. Using a blank node allows both objects (i.e., Google and 2007) to be related to John within the same graph to preserve the meaning of the original statement. With the RDF data model the above statement could be represented as follows: <John Smith, Works, blank_node_1> <blank_node_1, Company, Google> <blank_node_1, Since, 2007> A graphical representation is shown in Figure 4:  32 32  blank_node1  Figure 4: An example using a blank node  The storage, transmission, and retrieval (i.e., serialization8) of RDF data are commonly done with the use of eXtensible Markup Language (XML) format. Other formats such as Notation3, Terse RDF Triple Language (Turtle), and N-Triple are used for storing RDF data. N-Triple is a subset of Turtle, which itself is a subset of Notation 3. These formats are a more compact and human readable substitute of XML (Berners-Lee 2005). For example, with the XML format the statement of <John Smith, Works for, Google> can be represented as shown in Table 2. Making the same statement using Notation3 format would be as follows: @prefix ex: <http://www.example.org>. <http://www.example.org/John_Smith> ex:worksFor "Google". In this case, the domain in which the resources are located is www.example.org. The second line declares the object, while the third line states the predicate and object of the statement.  8  “Serialization is the process of converting data into a format that can be stored and retrieved later in the same or another computer environment” (www.wikipedia.org/wiki/Serialization).  33 33  Table 2: An example of the XML format Command  Description  <rdf:RDF  This line declares that this code is a piece of RDF data.  xmlns:rdf="http://www.w3.org/1999/02/22-rdfsyntax-ns#"  This line refers to the location that the RDF syntax can be accessed from.  xmlns:ex="http://www.example.org">  “ex” is defined as a prefix for the knowledge domain, which in this case is www.example.org. Instances, properties, and classification scheme can be located within the knowledge domain.  <rdf:Description rdf:about="http://www.example.org/John_Smith"> <ex:worksFor>Google</ex:worksFor> </rdf:Description>  The URI of the subject of the RDF statement (i.e., John Smith) is specified. The next line states that Google (i.e., object of the RDF statement) is the value of the “worksFor” predicate.  </rdf:RDF>  3.2 Triplestore A triplestore is a purpose-built database for the storage and retrieval of RDF data; in other words, a system that manages the storage, modification, and retrieval of RDF data is called a triplestore. Some triplestores have been built on top of existing commercial relational database engines (i.e., SQL-based). A representation using the Oracle database management system is an example of such an implementation, in which the RDF triples are stored, indexed, and queried as object relational data types (Oracle 2009). However, implementing an RDF query, which is based on the graph structure of RDF, onto SQL queries might not be very efficient (Pelligrini 2006), since SQL is not particularly suited 34 34  for storage and retrieval of RDF data (which might not fall under a rigid structure). Therefore, the natively developed triplestore applications would be more advantageous. Because the RDF data model can intrinsically be represented as a directed graph (Klyne and Carroll 2004), an RDF-based graph database management system —which uses nodes and edges to represent instances and their properties — was selected for the implementation of the proposed “class-less information management system”. In a graph triplestore the subject and object of an RDF statement could be represented as nodes, and the predicate of the statement would be shown as a directed edge that connects the nodes. The experiments in this project were run on Allegrograph, which is a graph database that is designed to store RDF triples and supports querying and reasoning over the RDF data. The choice here of a graph triplestore was based solely on the structural similarity and compliance of the RDF model with the graph data structure of such triplestores. No tests were performed to compare the performance of a graph triplestore with that of a relational triplestore. 3.2.1 Query language A computer language that is able to retrieve and manipulate data stored in RDF format is called an RDF query language. SPARQL protocol and RDF query language (SPARQL) is a language that was made standard by the World Wide Web Consortium, and it will be used in the tests run on the implementation discussed here (Prud’hommeaux and Seaborne 2008).  35 35  SPARQL allows users to run queries on the pattern of the triples (i.e., different permutations of the <subject, predicate, object> of the RDF statement can be questioned), and logical conjunctions and disjunctions (e.g., intersection and union of classes) (Prud’hommeaux and Seaborne 2008). As an example, creating a SPARQL query that lists all the female employees of Google requires listing all the subjects that are present in <?subject, Works For, Google> statements and then finding the intersection of the result with the “Female” class. The SPARQL code of this example is given in Table 3. Table 3: SPARQL code of a query that lists female employees of Google Command  Description  SELECT ?subject WHERE  Selects the subject of RDF statements which satisfy the following conditions.  { ?subject <worksFor> Google  The subject has the “worksFor” property, and the object of this statement is Google.  ?subject <hasGender> Female}  The subject has the “hasGender” property, and the object of this statement is Female.  Please refer to Appendix A for a list of commonly used SPARQL commands. More examples will be provided in section 4.4 and in chapter 5.  3.3 Web Ontology Language Ontology can be defined as a set of domain concepts, relationships among the concepts, and premises about what can exist and happen (Berners-Lee et al. 2001; Smith et al. 2004). In the initial stages of the emergence of the semantic web, RDF schema (RDFS), which is “a vocabulary for describing properties and classes of RDF 36 36  resources with semantics for generalization-hierarchies of such properties and classes” (McGuinness and van Harmelen 2004),9 was considered to be a language which had enough description power for the needs of that period. As the needs for a more expressive ontology language grew, the World Wide Web Consortium started endorsing the development of Web Ontology Language (OWL). OWL has added more vocabulary for describing properties and classes to the constructs of RDFS, such as the ability to assign cardinality, defining relationships between classes (e.g., disjoint classes) and characteristics of properties (e.g., functional, symmetrical) (Smith et al. 2004; McGuinness and van Harmelen 2004). Simply put, OWL provides means to define classes and properties for those classes, define individuals and assign properties to them, and finally make inferences (i.e., reasoning) regarding the individuals and classes. OWL is a general term that represents a family of three sublanguages: OWL Lite, OWL DL, and OWL Full (Smith et al. 2004). Each of these sublanguages is designed for specific purposes. OWL Lite is the least expressive of the three; it helps users define classification hierarchy and simple constraint features such as cardinality. OWL Description Logics (OWL DL) utilizes the constructs, restrictions, and constraints of the OWL language while guaranteeing computational efficiency of the reasoning application or in other words making sure that the reasoning process will reach a decidable state. It means that OWL DL will ensure that the reasoning software would be able to support all the features of the defined ontology without falling into infinite loops of undecidability.  9  In RDF terminology, a set of instances is a class and “the rdf:type property may be used to state that a resource is an instance of a class” (W3C 1999).  37 37  OWL Full provides maximum expressiveness, without any restrictions in the ontology (McGuinness and van Harmelen 2004). For example, OWL DL requires a separation between classes, properties, and instances, meaning that a class cannot be an instance or a property at the same time, and similarly an instance cannot be a property. However, this restriction does not exist in OWL Full; therefore, running a reasoner to make inferences might lead to undecidable states. Figure 5 illustrates the descriptive power of the aforementioned languages in relation to each other.  Figure 5: OWL language family  3.3.1. Representation of the Web Ontology Language As mentioned earlier in section 3.3, OWL was developed by adding more vocabulary and classes to the RDFS constructs in order to create a more expressive ontology language. According to the World Wide Web Consortium, “an OWL ontology is an RDF graph, which is in turn a set of RDF triples” (McGuinnes and van Harmelen 2004). 38 38  Therefore, RDF serialization methods such as XML can be used to represent OWL ontologies. As a very simple example, the code below shows how the “Employer” class can be created in the ontology, using the XML format. <rdf:RDF> <Class rdf:about="Employer"/> </rdf:RDF> The properties that are used to define the “Employer” class were not shown above. More advanced examples will follow in chapter 4. 3.3.2 Reasoning and inference A semantic reasoner is a software program that is able to make inferences based on the defined rules and axioms in the ontology (such as those defined by OWL DL). A reasoner is able to perform tasks such as a validation check (e.g., to see if a value assigned to a property is within its assigned range), check the consistency of the ontology, and determine class membership status of instances (McGuinness and van Harmelen 2004). The last is of paramount importance in the proposed implementation; it will help the classification of an instance based on the properties it possesses and, based on the class to which it has been assigned, allow inferences on its other properties to be made.  3.4 Jena framework Jena is an open source semantic web Application Programming Interface (API), which is based on the Java programming language. Jena provides means to create, store, modify, and query RDF data; it can be used as a stand-alone application for managing 39 39  RDF data. Also, as mentioned in section 3.3.1, Web Ontology Language is considered an RDF graph; therefore, one can use the Jena API to create and modify OWL ontologies. As well, there are Jena client applications for many commercialized databases and triplestores (such as Allegrograph). Thus, codes written in Jena can be used to create the ontology and manage the information. In the implementation proposed here, Jena could also be used for the management of larger scale data. The technologies introduced in this section will be used to build an implementation for the class-less information management system. In chapter 4, each principle will be concatenated with the technologies, and then the prowess of this implementation will be demonstrated in a sample scenario for further analysis.  40 40  4. Principles of Representation and the Implementation In chapter 2, three principles were introduced that provide an alternative approach that could potentially solve most of the problems related to schema design and database operations by separating instances from the classification. Also, it was claimed that creation of a property lattice through the notion of property precedence may facilitate integration of multiple sources of information and make the management of data more flexible. Finally, the last principle, regarding the class structure definition is not heavily focused on in this work. However, if followed, it provides extra facilities for the designers and users of the information system when they intend to make inferences and gain extra knowledge regarding the instances, properties, and their relationships. As mentioned earlier, the principles of the solution can be applied to any set of phenomena, e.g., business events, activities, and physical things (Parsons and Wand 2008b). However, the principles are very strongly linked to the notion of things in the Bunge–Wand–Weber(BWW) ontology, which is the underlying paradigm that guides this thesis. This work incorporates only a subset of the concepts of the BWW ontology, not all of its ontological constructs. Table 4 introduces the concepts used and provides a definition for each one.  41 41  Table 4: Select construct from Bunge–Wand–Weber ontology (Wand et al. 1999) Ontological Construct  Definition  Thing  In the BWW ontological model, the world is made up of things. Thing is the elementary unit in BWW. In the principles of the solution, things, as well as events, actions, and other phenomena, are represented by instances.  Property  Property is any observation of a thing. Things possess properties, and they cannot exist without properties. A thing is known by its properties. For the purpose of applying the principles, a property can be defined as any statement about an observation of a phenomenon. For example, a car’s height, length, width, engine volume, etc. are its properties. In BWW, not having a property is not a property.  Mutual property  A relational property that depends on two or more things. For example, “hasSpouse” is a mutual property that is shared between two individuals.  Intrinsic property  A property that a thing possesses by itself, for example, a person’s height or an element’s melting point.  Class  A set of things that are defined by the properties they possess or “a set of things possessing a common property”.  Kind  A concept that is in most domains interchangeable with class. A kind is a set of things that are defined only by two or more properties they possess or in other words “a kind is defined by a set of properties”.  Natural kind  A set of things whose properties are governed by certain laws, or in other words, “a set of lawfully related properties”.  *For convenience, thing can be considered as a typical metaphor, but it can be generalized to any phenomena.  Having defined the ontological constructs that are the foundation of this paradigm, one could show how the principles of the solution are mapped to the concepts in the semantic web domain. This section provides a brief and high-level overview of the mapping or the principles of representation used to achieve the proposed 42 42  implementation. The detailed steps required to implement the instance-based flexible information system will be investigated later in this chapter, and multiple examples will be given. The justifications for mapping a certain construct to a particular semantic web concept will also be investigated in the section named after the concept (i.e., property lattice, class definitions and hierarchy, and repository of instances and properties). Table 5 shows an overview of this mapping. Table 5: Mapping the principles of the solution to the concepts of the semantic web domain Concept in the principles of the solution  Representation in the semantic web domain  Example  Instance  A resource/node in the RDF data model:  John Smith  -  As the subject of a statement. As an object of an RDF statement, if the instance is the value of a mutual property.  <John Smith, hasGender, Male> <Christina Miller, hasSpouse, John Smith>  Property  Properties can be defined in OWL using the property construct, and these properties can be represented using the RDF data model. In RDF, predicates or properties are represented as edges that connect the nodes.  “hasGender” or “hasSpouse” are properties.  Mutual property  In the OWL domain, object properties link an instance to another instance.  <Jane, hasSibling, Eric>  Intrinsic property  In OWL, datatype properties link an instance to a datatype or a literal value.  <Samantha, hasHeight, 165>  Class  OWL classes are defined based on properties. Certain restrictions are used to define the properties that the members of a class possess.  A musician can be defined as a person who plays a musical instrument.  43 43  Concept in the principles of the solution  Representation in the semantic web domain The value of the properties can be restricted as well. It should be noted that OWL classes are not the same as BWW classes. In this implementation, OWL is used to represent relationships among classes and to show class definitions.  Kind  A class in the ontology  Natural kind  A class in the ontology. In OWL, axioms can be defined to specify the rules governing the relationships between the properties that define a class.  Example  Class: Musician EquivalentTo: (type only Person) and (playsInstrument some Musical Instrument)  As mentioned in section 3.1, resource description framework (RDF) is a means to represent information on the web, and the RDF data model has an abstract syntax that reflects a “simple graph-based data model” (Klyne and Carroll 2004). Because of this, a graph-based triplestore was chosen for the implementation described here, in which the RDF subjects and objects were stored as nodes, and the predicates (or properties) were represented as edges connecting the nodes. With RDF data model, the instance base can be implemented by storing the instances (i.e., nodes) along with their properties (i.e., edges) in a triplestore. In this implementation, the instances in the triplestore would not be bound to the classification; therefore, the first principle of the solution, which was separation of instances from classes, would be satisfied. The second principle was property abstraction, and it was manifested through the notion of property precedence. In this work an OWL ontology was used to define the property 44 44  hierarchy. OWL allows for defining the properties and the relationships that they have with respect to each other. For example, one could define that “hasBrother” and “hasSister” are subproperties10 of “hasSibling”. Also, “hasSibling” could be defined as a subproperty of “hasRelative”. Figure 6 shows a representation of this property hierarchy:  Figure 6: A property hierarchy  Having a property hierarchy can facilitate the integration of multiple sources of information, as well as running complex queries. Example and details will be investigated in section 4.1. Finally, to support the third principle, which was a set of guidelines for creating useful class structures (in order to facilitate inference), one could define the classes in an OWL ontology. In the ontology, classes are defined by their properties, and axioms could be defined to specify the relationship between properties in the form of restrictions. Using the guidelines of the third principle, one can define class structures that are useful and 10  “If property P2 is a subproperty of property P1, then all pairs of resources which are related by P2 are also related by P1” (W3C 1999).  45 45  informative from a cognitive point of view. It should be noted that enforcing the third principle is beyond the scope of this thesis; this work simply investigates the possibility of supporting this principle. The details of creating classes and defining restrictions will be introduced in section 4.2. The purpose of this subsection was to declare the roadmap for the rest of the chapter and claim that the principles can be implemented. The “how-to” of the implementation will be elaborated on in the following subsections. First, the creation of property lattice will be described. Then, having specified the set of properties in the relevant universe, one can define the classes. Later, in section 4.3, the details of creating and populating the instance base will be described. In this chapter, Protégé (a software developed at Stanford University and the University of Manchester) is used as an interface to create the property lattice, class definitions, and the hierarchy. To store the instances and make queries, Allegrograph, an RDFbased triplestore, is used.  4.1 Property lattice In this proposed implementation, an OWL ontology will be used to define the class hierarchy and the property lattice. Parsons and Wand (1997) defined a class as “a set of properties, which help us abstract our knowledge about instances”. Also, as noted earlier, the relevant universe is defined by the set of known properties and the set of known instances. Thus, it makes sense to design the property lattice before the class definitions and their hierarchy. One underlying assumption is that all the relevant properties in the universe are defined in the property lattice. This is not a very limiting assumption. If the need to define a new 46 46  property arises, the property lattice can be updated without modifying the instance model; the latter would not be easily possible in a relational database paradigm. 4.1.1 OWL property types In OWL, three types of properties exist:   Object properties, which link an instance to another instance. This corresponds to the notion of mutual property (a relational property that depends on two or more things) in the BWW ontology (Wand and Weber 1990; Stanford Center for Biomedical Informatics Research 2011). For example, “hasSpouse” is a property that depends on two people who are married to each other. In other words, if the value of the property is an instance of a class, it would be an object property (e.g., the value of “hasSpouse” would be an instance of “Person”).    Datatype properties, which link an instance to a datatype or an RDF literal (Stanford Center for Biomedical Informatics Research 2011). As a design decision, datatype properties would be used to model intrinsic properties (a property that a thing possesses by itself) of the Bunge–Wand–Weber (BWW) ontology (Evermann and Wand 2009). Properties such as height, mass, and calorie value can be captured this way.    Annotation properties give the option of annotating classes, properties, and the ontology header with metadata. Comments, version information, and labels are examples of annotation properties. Although annotation properties do not directly correspond with the principles of the implemented solution, they could add valuable information to the ontology. 47 47  Object properties link two individuals; therefore, a domain and a range can be specified for them. A property links an instance from the domain to an instance from the range. For example, take the “hasEmployer” property. This property connects an instance from the “Employee” class to an instance from the “Employer” class. Showing that John (an employee) works for Google (an employer) using the RDF data model is represented in Figure 7.  Figure 7: RDF statement: <John, hasEmployer, Google>  Table 6 shows how an ontology model can be created (in this scenario named “m”) and the “hasEmployer” object property added to the ontology using the Jena application programming interface (API). Table 6: Creating an ontology and adding an object property Command  Description  OntModel m = ModelFactory.createOntologyModel();  In this line, the ontology is created and the name “m” is assigned to it.  ObjectProperty hasEmployer = m.createObjectProperty (“http://www.example.org/#hasEmployer");  “hasEmployer” is a mutual property (in the BWW sense) between Employee and Employer. However, the domain and range of this property has not been defined yet. In this code, “m” is the ontology model in which all the classes and properties will be defined. The method “createObjectProjerty” creates the object property and adds it to the ontology model (i.e., “m”). In this scenario, http://www.example.org/#hasEmployer" is the URI of the “hasEmployer” object property.  48 48  OWL also supports the notion of property manifestation with the concept of SubProperty. For example, one can define that “hasParent” is a subproperty of “hasAncestor”. This feature of OWL would support the concept of precedence that was introduced in the second chapter and allows one to define generalizations and manifestations of properties. The code below (written in Jena) shows how “hasParent” has been defined as the manifestation of “hasAncestor”. hasAncestor.addSubProperty( hasParent ); To assign a domain and a range to a property, it is necessary to have the classes defined. This would also be used by the reasoner to make inferences. In the case of the “hasEmployer” property, within this context it has been defined that the range of “hasEmployer” is an instance of an employer. If it is asserted that <Jack, hasEmployer, Microsoft>; the reasoning software will use the ontology to infer that “Microsoft” is also an instance of the class “Employer”. On the other hand, imagine that one specifies that “Honey” is an instance of “Food” and the classes of “Food” and “Employer” are disjoint from each other. In this case if one adds <Jane, hasEmployer, Honey>, the reasoning software will generate an inconsistency error, since the range of “hasEmployer” only accepts instances of “Employer” and not instances of “Food”. It should be noted that the reasoner is used in this implementation to facilitate the inference process; without the reasoning software, the inferences are still possible in principle. Adapting to the reasoning software’s requirements is not part of the principles used in this implementation. A reasoner is introduced as a structured process of analyzing and finding answers regarding data (i.e., making inferences), otherwise making inferences manually is possible. 49 49  The code below shows how the range of the “hasEmployer” property can be assigned to be instances of the “Employer” class. hasEmployer.addRange( Employer ); In order to model the intrinsic properties of an instance, one would utilize datatype properties in OWL; datatype properties link an instance to a schema datatype or an RDF literal (Stanford Center for Biomedical Informatics Research 2011). In the BWW ontology, intrinsic properties are the properties that depend on one thing only. In a conceptual model, as well as an ontological model, an attribute11 of an entity12 can be a representation of an intrinsic property of a thing in the real world (Wand et al. 1999). In short, since there is no need for binary or n-ary relationships to model intrinsic properties, one can use the notion of datatype property in OWL to model the entity attributes or the properties that depend on one thing only (i.e., intrinsic). Nutritional value of a certain food or someone’s height are examples of intrinsic properties that can be modeled using the integer datatype. Later, queries can be designed to single out (i.e., select) instances with certain values that fall within a given range. For example, if one stores the birthdates of people using the DateTime datatype, one can run queries to find certain individuals who were born between 01/01/1940 and 31/12/1949. In chapter 5, more examples will be studied. Creating the property lattice also allows the classes to be defined. Figure 8 shows how the object properties are represented in the Protégé environment:  11  “Properties of conceptual things are termed attributes” (Wand et al. 1999). “An entity is a ‘thing’ which can be distinctly identified” (Chen 1976).  12  50 50  Figure 8: Property hierarchy in Protégé  4.2 Class definitions and hierarchy With OWL, classes will be defined based on properties; also, the relationship of classes with respect to each other across the class hierarchy will be specified. In OWL every class is a subclass of “Thing”. It should be noted that “thing” in OWL is different from the “thing” in the BWW ontology. In OWL, “thing” is defined as a set that includes all instances (Stanford Center for Biomedical Informatics Research 2011) across the relevant universe. After a class is defined in OWL, it is possible to specify its subclasses or the sibling classes that exist on the same level within the hierarchy. Table 7 shows how one can create the “Employer” class using Jena API. It is assumed that the ontology is defined on the www.example.org domain, and therefore http://www.example.org/#Employer can be set to be the URI of the “Employer” class: Table 1: Creating classes in the ontology Command  Description  OntClass Employer = “Employer” is a class that is added to the m.createClass(""http://www.example.org/#Employer" “m” ontology. ); OntClass Employee = m.createClass( "http://www.example.org/#Employee" );  “Employee is a class that is added to the “m” ontology.  51 51  Figure 9 shows the class hierarchy pane in Protégé:  Figure 9: Class hierarchy pane  Here, properties will be used to define the classes. Property restrictions in OWL will help achievement of the task of class definition. There are three types of restrictions that could be used (Stanford Center for Biomedical Informatics Research 2011):   Quantifier restrictions, which can be categorized as existential and universal restrictions. o Existential restrictions identify a set of instances that “participate in at least one relationship along a given property, to instances that are members of a specified class” (Stanford Center for Biomedical Informatics Research 2011). The keyword “some” is used to denote existential restrictions. For example, if one wants to state that a computer has at least one CPU, one needs to use an existential restriction. The computer may have different components as well, but CPU is one of them. o Universal restrictions identify a set of instances that for a certain property “only have relationships along this property to instances that are members 52 52  of a specified class” (Stanford Center for Biomedical Informatics Research 2011). The keyword “only” is used to denote universal restrictions. As an example, ingredients of vegetarian food belong to the vegetable class only.   Cardinality restrictions, which could be used to describe instances that participate in “at least, at most, or exactly a specified number of relationships with other instances or datatype values”. For instance, an active student needs to be registered in at least one course.    hasValue restrictions describe the set of instances that have “at least one relationship along a specified property to a specific instance”. It should be noted that hasValue restrictions distinguish themselves by restricting the property value to a particular instance, rather than members of a certain class (as done in quantifier restrictions). For example, to define the “French” class, the value of “hasCountryOfOrigin” property should be restricted to France, which is an instance of the “Country” class.  For example, using these restrictions to define the “Employee” class, one would say that an employee is an individual who has at least one employer. To express this statement in OWL, one would need to use an existential restriction to define the “Employee” class. As shown in Table 8, first an existential restriction is created (here it is named Employment) which restricts “hasEmployer” property to instances from “Employer”. Then, it is asserted that this restriction is defined as a superclass of Employee, meaning that all employees should inherit this restriction; simply put, that is how the employee class is defined. The restriction in this example was created using Jena commands. 53 53  Table 8: Creating existential restrictions in the ontology Command  Description  SomeValuesFromRestriction Employment = hasEmployer.setSomeValuesFrom(Employer);  An existential restriction called “Employment” is defined.  Employee.addSuperClass(Employment);  This line asserts than the restriction is a superclass of “Employee”.  If Protégé is used, this restriction can be simply typed in the class description page. Figure 10 shows the definition of the “Employee” class.  Figure 10: Defining the “Employee” class  As an example for hasValue restrictions, one could consider creating a class of Google employees; in order to do so, one needs to use a hasValue restriction to restrict the value of the “hasEmployer” property to only “Google” (which itself is an instance of the class “Employer”). In Table 9, Google is defined as an instance of employer. and then the hasValue restriction can be created: Table 9: Creating hasValue restrictions Command  Description  Individual Google = m.createIndividual( "http://www.example.org/#Google", Employer);  “Google” as an individual (instance) is added to the ontology.  HasValueRestriction GEmploy = hasEmployer.setHasValue (Google);  This line asserts the hasValue restriction.  54 54  In the process of defining classes, some restrictions could be defined as necessary and sufficient for membership in a class. By listing those conditions, a defined class in which there is at least one necessary and sufficient condition13 could be created. A class that is defined only by necessary conditions is called a primitive class. As an example of a defined class, an “Employee” is a person who might have many other properties such as having a gender, but the necessary and sufficient condition for it to be listed as an employee is having an employer. This is adherence to the cognitive economy principle, which tries to minimize the effort for classification. In this methodology all the properties in the class are necessary, but some are sufficient for determining the class membership by the reasoning software. Figure 11 depicts the scenario in which an employee is a person who inherits the property of having a gender from its superclass (i.e., “Person”); it is also stated that the necessary and sufficient condition (i.e., equivalent class) to be an employee is having an employer.  Figure 11: An “Employee” with its necessary and sufficient condition  13  In OWL, the term “Equivalent Class” is used. Jena Code is Employee.addEquivalentClass(Employment);  55 55  It should be noted that reasoning in OWL is based on the open world assumption (OWA), which means that “we cannot assume something doesn't exist until it is explicitly stated that it does not exist; in other words, because something hasn't been stated to be true, it cannot be assumed to be false” (Stanford Center for Biomedical Informatics Research 2011). This could lead to some potential problems in classes that use only existential restrictions. As an example, assume that one defines a “Car” as a selfpropelled vehicle that moves on land; a “Plane” could be defined as a self-propelled vehicle that moves on land and moves in air. In this case, due to the underlying OWA, a “Plane” would be considered a subclass of “Car”, since it has two of the properties that define “Car” (i.e., self-propelled, and moves on land). In the BWW ontology, not having a property is not a property (Wand et al. 1999); that means one cannot specify that “Car” is not able to move on air. To resolve this issue, in the BWW ontology, “Car” can be defined as a self-propelled vehicle that moves strictly on land, or in other words there is a law that states “if it moves, it must be on land”. This law links the properties of “moving” and “being on land”. In BWW terminology, “Car” would be a natural kind which “contains only things whose combination of properties abides by laws” (Wand et al. 1999). However, as a design issue in the proposed implementation, the notion of natural kind cannot be defined in OWL, as mentioned earlier in this chapter. To avoid classification of “Plane” as a subclass of “Car” in OWL, closure axioms could be added at the end of those existential restrictions. In other words one needs to explicitly state that only the union of the values stated in the existential restrictions can 56 56  occur and nothing else. The class of “Car” would be defined as a vehicle that has an engine and moves on land, and the definition can be closed on the union of those properties. Even though this might be in contradiction with BWW, which states not having a property is not a property, in this work closure axioms are used for the sake of representation and also complying with the cognitive economy. If a closure axiom is used, an instance of plane would not be classified as a “Car”, since it can also move in air, which is not included in the definition of “Car”. It should be noted that in this case, the axiom closed the range of “moves” property and restricted it to land and nothing else. The closure axiom has not put any restrictions on other properties; meaning one can state that an instance of the car has a certain color, or a certain top speed, etc., and still get the expected classification from the reasoner or any other inference mechanism used14 (i.e., still would be classified as a car). Bottom line, a closure axiom implies that there is additional information — such as laws that link certain properties that define a class — but for the sake of convenience in representation, the closure axiom is used. It should be acknowledged that ontologically there are distinguishing properties between a car and a plane (that might be explicit), but the modellers may not be fully aware of them or decide not to include them in the design.  4.3 Repository of instances and properties To populate the repository with instances and their properties, one could import the ontology into the triplestore first. However, it should be noted that populating the instance base is not dependant on the class hierarchy or the property lattice. However, 14  A reasoning software only facilitates the inference process. The principles of the solution do not rely on the processing power of a reasoner.  57 57  the designer’s awareness of the relevant properties in the universe is needed to ensure compliance between the ontology and the instance base. In other words, it is not desirable to have instances with properties that are not relevant in the universe. If multiple sources are being integrated, it is recommended to reconcile the property lattices of the sources first, to obtain a universal property lattice across the sources, and use the aforementioned lattice as a reference in populating the instance base. The details of implementing this hypothetical situation (i.e., integration of multiple sources) are beyond the scope of this thesis and could be studied in future research endeavours. It should be emphasized that the principles of the approach do not require the class hierarchy, property lattice, and instances be loaded together. However, to be able to practically run queries (especially the ones that look for abstraction and precedence of properties, such as a query that lists “everything red”), one needs to load the ontology (either the file or its resource location on the web) into the triplestore. The first principle of this approach was separation of instances from classes. Having the classification and instances within a triplestore does not contradict this principle. As long as instances are not bound to a class, the first principle is satisfied, even though the instances and class definitions might exist in the same data management system. In this implementation AllegroGraph® 3.3, which is a graph-based triplestore, was used. Figure 12 shows a snapshot of the loaded ontology in the triplestore. This figure is neither the representation of the property lattice nor the class hierarchy, but it shows that “Employee” and “Employer” are subclasses of “Person” and all three (i.e., Employee, Employer, and Person) are of the type “Class”. It also shows the 58 58  “hasEmployer” property, has the “Employee” class as its domain, and “Employer” is the range of this property.  Figure 12: Loading the ontology in the triplestore  Since the first principle of the solution was separation of instances from classes, new instances would not be classified as they are created. A new instance will be defined to be an instance of “Thing”; Thing15 is defined as a set that includes all instances (Stanford Center for Biomedical Informatics Research 2011) across the relevant universe. Instances will be classified, if needed, by using a reasoner, which uses the class definitions of the ontology to draw inferences based on the properties. As an example, consider the scenario of employment. Assume that in this universe the users’ view is that only the individuals can be employers and that the data instance  15  Not to be confused with “Thing” in the BWW ontology.  59 59  refers only to individuals. Some individuals are employees, and some are employers (and possibly employees as well). In the implementation, assume that two instances of “Thing” (the pre-defined construct in OWL), “E1” and “ER1”, have been added to the triplestore and then the following RDF statement was asserted: <E1, hasEmployer, ER1>. In this RDF statement, E1 is acting as the subject, hasEmployer is the predicate, and ER1 is the object. This is illustrated in Figure 13.  Figure 13: Adding instances to the triplestore without classifying them  If a reasoner is run, it will be inferred that E1 is an “Employee” because in the ontology it was specified that the domain of the “hasEmployer” property is an “Employee”. Thus, any instance that is positioned as the subject of “hasEmployer” will be classified as an “Employee”. Also, the range of “hasEmployer” was assigned to the “Employer” class. Therefore, ER1, which happens to be the object in the RDF statement above, would be classified as an “Employer”. In the above described users’ view only people employ other people. Therefore, “Employee” and “Employer” were defined (in the OWL ontology) to be subclasses of 60 60  “Person”. It is expected that the reasoner would infer, based on their properties, that E1 and ER1 are also instances of “Person”. Figure 14 shows the classification of instances after the reasoning software has been run within the triplestore. Running the reasoning software does not change the data but only represents the inferences made regarding the class membership of instances.  Figure 14: Running the reasoner in the triplestore  As is shown, the reasoner has identified E1 as an instance of “Employee” and hence as an instance of “Person” (another example of reasoning, using Jena API, is provided in Appendix B). Now assume that the users’ view of the domain changes, and in addition to people, organizations could also employ people. Since the needs within the domain changed, the ontology describing this view has to evolve as well. Now, how the implementation proposed in this work provides the flexibility to accommodate evolving users’ views will be shown. Within the context of this example, the definition of the “Employer” has to be changed in the ontology. The new definition would indicate that any instance that  61 61  happens to be in the “range”16 of the “hasEmployer” property is an instance of “Employer”. Referring to the RDF representation of the ontology, one could say that any instance that is at the receiving end of the “hasEmployer” predicate (i.e., RDF object), is an employer. After updating the domain ontology, one would still be able to run the same queries/operations on the existing data, even though the class definition has changed; that is because instances are not bound to classes. In a traditional database (assumed to be a relational database), reflecting this change in the domain would require adding a new relationship from the “Employer” table to the “Organization”. There are additional operations needed to reorganize the database, such as updating the “Employer” table by adding some instances of “Organization” (with their foreign keys). In a relational database, in order to query all the employers, one needs to know the structure of the data (i.e., whether employers are people or organizations), since the query needs to perform a join operation on “Person” and “Organization, whereas in the proposed implementation the employers can be queried simply by locating the instances that are at the receiving end of the “hasEmployer” property. In short, the system proposed in this work is flexible in applications in which the class definitions might change. By changing the schema, one does not need to change the data, and the operations/queries could be run as before, without making any changes to the existing data.  4.4 Making queries As mentioned earlier, SPARQL allows users to run queries on the pattern of the triples. This means that different variations of the generic RDF statement of <subject, predicate, 16  Refer to section 4.1.  62 62  object> can be questioned. As an example, one could query for all the instances that happen to be the subject for a particular property, like “hasEmployer”, as shown in Table 10. Table 10: Listing all the instances that have an employer Command  Description  SELECT ?subject WHERE  Lists all the nodes that are the subject of an RDF statement, in which the conditions of the second line are satisfied:  { ?subject <hasEmployer> ?object . }  This line states that the subjects should participate in statements that have the “hasEmployer” property.  This query lists every instance that participates in the “hasEmployer” relationship with any other instance (i.e., object node). In the simple database discussed here, the only result would be E1. As another example, one could write a SPARQL query that would list every subject and predicate that has ER1 as the object in their respective RDF statements. Table 11 contains the SPARQL code of this query: Table 11: All the RDF subjects and predicates that have “Er1” as their object. Command  Description  SELECT ?subject ?predicate WHERE  Lists all the subjects and predicates of the RDF statements, in which the conditions of the second line are satisfied:  { ?subject ?predicate "Er1" . }  This line states that the object of the RDF statements need to be “Er1”.  In Table 12 the retrieval operations needed to implement an instance-based database are listed, along with the semantic web representation of each operation. In Parsons and Wand (1997) the ‘update’ operations were also discussed in their Table IV. In this 63 63  work, the storage operations were described in detail in the previous sections of this chapter. Table 12: Retrieval operations needed for the semantic web implementation of the instance-based database (adapted from Parsons and Wand (1997)) Description  Name  SW Implementation  Details  Form  SELECT ?variable_property WHERE  Using a SPARQL query allows all the properties that a given instance possesses, to be listed.  Instance Base Properties an instance possesses  { <Given_Instance> ?variable_property ?variable_node . } Things which an instance is composed of  Composition  Things which an instance is a component of  Inclusion  SELECT ?Components WHERE { <Given Instance> ?hasComponent ?Components . } SELECT ?Composite WHERE { ?Composite ?hasComponent <Given Instance> . }  Instances which possess a property  Scope*  SELECT ?scope WHERE { ?scope ?hasGivenProperty ?variable_instance . }  Instances which possess a mutual property  Linked Scope  SELECT ?mutual_property WHERE { ?instance_1 ?mutual_property ?instance_2.  This query lists the components that are objects of RDF statements in which “hasComponent” is the predicate of a given instance. Generates a list of subjects in RDF statements, in which a given instance is the object of “hasComponent” predicate.  The instances that have a given property are listed. This list is generated by looking at RDF statements in which the given property is the predicate of the RDF triple. This query lists all the instances that share a mutual property. In this implementation, there is a caveat that needs to be added: the instances must be of the OWL type “Thing”, otherwise intrinsic properties would have been listed.  ?instance_1  64 64  Description  Name  SW Implementation  Details  rdf:type owl:Thing. ?instance_2 rdf:type owl:Thing . } Properties that a given property precedes  Preceded  SELECT ?preceded WHERE { ?preceded rdfs:subPropertyOf <Given_Property> . }  The query lists all the properties that are sub-properties of a given property.  Class Base Classes that have been defined  Classes  SELECT ?classes_defined WHERE { ?classes_defined rdf:type owl:Class . }  Properties of a given class  Properties  SELECT ?property_possessed ?property_value WHERE {?node_variableowl:onPro perty ?property_possessed ; owl:someValuesFrom ?property_value <Given_Class> owl:equivalentClass ?node_variable. }  Instances that belong to a class  Membership  SELECT ?members WHERE { ?members rdf:type <Given_Class> . }  This query generates a list of nodes in RDF statements which happen to be of OWL type “Class”.  A list of all the properties of a given class is generated by this query. In OWL, a class is defined by restrictions on certain properties, therefore in this query the restrictions defined on a given class will be listed first, and then the properties that form those restrictions are shown in the output.  All the instances that participate in RDF statements in which a given class is the value of “rdf:type” property, will be listed by this query.  * In OWL terminology, property domain is equivalent to scope.  65 65  SPARQL provides very powerful facilities, such as cardinality restrictions or various filters for running queries. Some of these advanced features will be demonstrated in chapter 5.  4.5 Chapter summary and conclusion As mentioned in the first chapter, one of the objectives of this work was to show that by applying the innovative principles, a flexible information management system can be created using semantic web technologies. Such a system is able to separate instances from semantics as much as possible, support abstraction of properties (via a property hierarchy), and flexibly manage data. This chapter demonstrated that the proposed implementation is able to store the instance base and the ontology (i.e., class definitions and property lattice) without binding them together. An OWL ontology was used to represent property precedence (i.e., the manifestation of property abstraction) and class definitions. The implementation proposed here is able to handle dispersed sources of data over the web, with no central control; one only needs to know the URI that identifies each source of the information on the web. This implementation also supports the evolution of users’ needs, by allowing modifications to the ontology, without making any changes to the existing data (e.g., changing the definition of the “Employer” class). Moreover, the proposed implementation facilitates having multiple views for multiple users. Each view can be defined as an ontology, and multiple ontologies could be used over the same repository of instances. While the semantic web technologies are not new, the novelty of this approach is in combining the principles with select technologies to obtain the 66 66  advantages of the instance-based approach, in particular, show the flexibility that can be achieved. The next chapter intends to demonstrate the benefits of the proposed approach in a sample crowdsourcing application, and chapter 6 will study the implications and advantages of this implementation in more depth.  67 67  5. Case Study In this case study, an ontology based on ancestry and generations will be examined to better demonstrate the features of the proposed implementation and at the same time point out the advantages of this approach. The objective is to show the flexibility of the semantic web implementation of a class-less information repository in its ability to create and use classes on the fly, change the property hierarchy without losing any data, and run queries that were not easily possible in the traditional data management paradigm. The class-less information repository that will be studied in this chapter could be used as a crowdsourcing application in which users from across the world enter their family data into the system. The generation-ontology would be used to infer relationships based on the users’ information and generate family trees of the people who have participated in this project. Users might find distant family members or realize the rich history of their heritage from past generations. The data are not known in advance, but it can be assumed that they are about people and the properties that potentially exist in human relationships. Also, the data can be scattered over different sources, but having the URL of each instance in the triplestore allows the users to perform retrieval and/or updating operations on them. The mapping between the principles of the solution and concepts of the semantic web was provided in Table 5. In the first phase of the case study, the ontology defines four object properties that will be used to define the classes and their respective relationships within the hierarchy. 68 68  The properties of “hasParent” and “hasChild” can be defined as being inverse properties of each other, meaning that if <a, hasParent, b> is true, then <b, hasChild, a> is also true. The other two properties that are used are “hasGender”, which will be used to assign gender to individuals, and “hasSibling”, which is used in the definition of “Brother” and “Sister” classes (e.g., a female individual who has a sibling is considered to be someone’s “sister”). The first class defined is an enumerated class in OWL, called “Sex”, with the two possible values of {MaleSex, FemaleSex}.17 The class “Female” can be defined by restricting the value of the “hasGender” property to “FemaleSex”. Then the “Person” superclass is defined, which would include all the possible relationships within a generation or ancestry tree. Figure 15 shows a screen shot of the property and the class hierarchy within this ontology. For example, “Father” can be defined as a person (i.e., subclass of “Person”) who is a male (i.e., the value of “hasGender” property is “MaleSex”) and has a child. These are necessary and sufficient conditions for a person to be a father; hence, it will be a defined class in this ontology (in contrast to primitive class, which contains only the necessary conditions) See chapter 4 for more information. Figure 16 shows how these existential and hasValue restrictions have been applied to the “Father” class (see sections 3.3 and 4.2 regarding ontology restriction and their implementations, respectively).  17  An enumerated class in OWL does not directly correspond with BWW ontological concepts. In this case study an enumerated class was used for the sake of convenience in the implementation.  69 69  Figure 15: Property hierarchy vs. class hierarchy in the “Generation” ontology  Figure16: Defining the “Father” class  This ontology can be imported into the triplestore for further experimentations. Later in this chapter the ontology will change as well, and class definitions and property lattice will evolve. This will be done to support the claim that this approach is able to accommodate the schema evolution without suffering from the problems that the relational database management paradigm might face (Parsons and Wand 2000). Also, some queries that might not seem straightforward will be tested to demonstrate the power and flexibility of the instance-based paradigm within the semantic web. 70 70  5.1 Making queries To test the queries, the triplestore was populated with sample data (individuals who have parents, children, siblings, etc.). A small portion of the instance base (i.e. instances along with their properties) that was used for this case study can be viewed in Appendix C. The first query tested will help identify grandchildren of a certain individual in the instance base. It should be mentioned that the instances are not bound to classes (in compliance with the first principle of the instance-based paradigm) and the instances are accompanied only by their properties in the repository. This query could be written using two RDF statements that are connected through a blank node. In Figure 17 “node-variable-1” is the blank node, which would help identify John Smith’s children, and in the second statement it would act as the subject of the RDF statement, in order to generate a list of grandchildren. In these statements, the node named “?GrandChildren” is a variable node and not a class or an identified node. That variable name was chosen simply so that when the results are generated, they would be listed under a heading that is clear and straightforward to the reader. Figure 17 shows a graphical representation of this query, and Table 13 shows the SPARQL code along with comments regarding each line.  Figure17: Query that identifies grandchildren of a certain individual  71 71  Table 12: SPARQL code that lists John Smith’s grandchildren Command  Description  SELECT ?GrandChildren WHERE  Lists all the “RDF objects” (conveniently titled GrandChildren) which participate in the following RDF-statement pattern.  { ?"John_Smith" < hasChild> ?node_variable_1 .  This statement represents John Smith’s children (with blank nodes).  ?node_variable_1 < hasChild> ?GrandChildren . }  Here, John Smith’s grandchildren are listed.  The example above was a Select query which returns variables bound in a query pattern match. One could use Construct queries as well; these queries construct new triples out of the matching values in the query (Prud’hommeaux and Seaborne 2008). This practical feature of SPARQL facilitates creating classes on the fly. For example, the SPARQL code below constructs a new property called “hasGrandChild” and adds new triples that use the “hasGrandChild” predicate to the triplestore. Construct queries will be referred to later in this chapter. CONSTRUCT ?node_1 <hasGrandChild> ?node_3 WHERE { ?node_1 <hasGrandChild> ?node_2 . ?node_2 <hasGrandChild> ?node_3 . } Another more complicated example would be a query that lists the nephews of a certain individual. This can be done by first identifying the siblings of an individual, then listing the children of the siblings, under the condition that those children should be male. As in the previous example, this was achieved by using blank nodes. In the code in Table 14, “?Nephews” is the name of a variable node that would contain the result list, while “node_variable_4” is a blank node. 72 72  Table 13: A query that lists all the nephews of a certain individual Command  Description  SELECT ?Nephews ? WHERE  Lists all the “RDF objects” (titled Nephews) which participate in the following RDF-statement pattern. :  {"John Smith" < hasSibling> ?node_variable_4 .  node_variable_4 (a blank node) represents John Smith’s siblings.  ?node_variable_4 < hasChild> ?Nephews .  This line lists the children of John’s siblings.  ?Nephews < hasGender> < MaleSex> . }  The children (of John’s siblings) should be of male sex (otherwise they will be nieces).  In the previous two examples, the queries were defined based on the property patterns that existed between instances without referring to the class structure in the otology. The next example will use the class definition to achieve the query results. The code in Table 15 uses the definition of the “Father” class (shown in Figure 16) to find individuals that are a father to a child. To perform this query, in a triplestore where the instances are not linked to the classes (i.e.. having the first principle in place), one needs to run the reasoning software first. The reasoning software will determine the instances that belong to the “Father” class by using definition of that class in the ontology (i.e.. the individual must have a child and be of the male gender) to identify the class members. Once the reasoning software is run and the assertions regarding the class membership are made, a new property named “rdf:type” is added to the instances; the value of this property is the class that the instance is a member of. In this particular example, instances that have the “Father” class as the value of their “rdf:type” property, are of interest. 73 73  Table 15: A query that lists all the fathers that exist in the domain of the case study Command  Description  SELECT ?node_variable WHERE  Lists all the instance which participate in the following RDF-statement pattern. :  { ?node_variable rdf:type <#Father>}  All the instances that are members of the “Father” class will be listed.  5.2 Evolving the property lattice It was claimed in this text that an instance-based information repository would be flexible and accommodating to the evolution of the schema; this section will examine a case in which the need for new properties has arisen. After the ontology is modified and new properties added to the lattice, the ontology can be applied to the instance base (in this case, the ontology file can be loaded into the triplestore), and different scenarios can be flexibly tested. It is intended that an address for each individual be kept as an intrinsic property (in this sample scenario, “having an address” was designed as a property that an instance possesses by itself and not as a property shared between a person and a location). Also, another property called “hasAncestor”, which would be a generalized form of “hasParent”, will be defined as a transitive property in OWL. This means that if <a, hasAncestor, b> and <b, hasAncestor, c>, then the statement <a, hasAncestor, c> would also be true. Using the “hasAncestor” transitive property, one can list someone’s family tree very easily, even though none of the instances might use the “hasAncestor” property. The reasoner will only need to find the instances that have the “hasParent” property, and from there it builds a tree that would be the answer to the query. Thus, the reasoning 74 74  software in the semantic web domain is able to make inferences based on generalized and manifested properties. This example may not highlight the advantages of the first principle of the approach (i.e., separation of instances from classes) per se, but this feature of the implementation supports the second principle (abstraction of properties) and allows one to define generalized/specialized properties. As for the new addition to the property lattice, the “hasPostalCode” property will be defined as a datatype property in OWL (see chapter 4 for the justification of using datatype properties to represent intrinsic properties), and the postal code will be stored as a literal. For simplicity it is assumed that the instance domain is within the United States, and therefore the postal codes are five digit numbers (i.e., can be stored as an integer datatype). The code below shows how siblings that live within a certain range of area codes can be found; this means that one can find family members that live within a certain vicinity. First, individuals who have the “hasSibling” mutual property are found. Then, a filter is put on their address property, which in this example mandates that their postal code should be within the range of 90000 to 91000. This example demonstrates that by separating instances from a rigid class structure, the database would be able to flexibly handle exceptional instances (not all instances have the postal code property, and with a rigid classification, it would have been necessary to make changes on all the instances of corresponding classes). Also, adding a new property (changing the schema) does not pose proliferation of operations problems.  75 75  Table 16 shows the SPARQL code of this query, and Figure 18 shows a graphical representation in which sibling A and sibling B are variable instance nodes and the addresses are values of the data type property “hasPostalCode”: Table 16: SPARQL code of a query that lists siblings within a certain vicinity SELECT ?Sibling_A ?Address_A ?Sibling_B ?Address_B WHERE  Lists four RDF nodes that conform with the following RDF-statement pattern.  { ?Sibling_A < hasPostalCode> ?Address_A ;  Finds a person’s postal code and his/her sibling.  < hasSibling> ?Sibling_B . ?Sibling_B < #hasPostalCode> ?Address_B .  Lists the sibling’s postal code.  FILTER ( 90000 < ?Address_A < 91000 )  Filters the pairs of siblings that have a postal code within a certain range.  FILTER ( 90000 < ?Address_B < 91000 ) }  Figure 18: Query that finds siblings within a certain vicinity  5.3 Changing the class definitions and hierarchy So far it has been demonstrated that the proposed implementation is able to handle the evolution of the property lattice. In this section the changes in the class hierarchy will be  76 76  studied. In other words, in these two sections the effects of schema evolution within the proposed approach are analyzed. First, a datatype property called “hasBirthdate” is added, and with this property the birthdates of individuals are stored. This way, new concepts such as “Senior Citizen” and “Minors” can be defined. Senior citizens would be defined as individuals aged over 65, and minors would be people aged less than 18. Even if these two classes are not added in this ontology, queries can be made and one can filter for a certain age. In this example, a query will be run that would help to identify grandparents who are considered senior citizens, along with their female grandchildren, who are minors. Figure 19 demonstrates this query:  Figure 19: Query to find senior grandparents and minor female grandchildren  First, the nodes that have children who also have children of their own are identified, and a list of grandparents and grandchildren is generated. Then, one can filter for grandparents who were born 65 years before the date of making the query (assume the 77 77  query was done on 2011-12-31). Next, one would filter the grandchildren who are female and were born within the 18 years previous to the date of the query. If the instances had been bounded to classes, defining new classes such as “Seniors” and “Minors” and populating them with their corresponding members would have required extensive effort by the database designer (i.e., proliferation of operations). Also, it would not have been possible to operate on the instances and run such queries flexibly. Table 17 shows the SPARQL code of this query: Table 17: SPARQL code lists senior grandparents and their minor female grandchildren SELECT ?Grandparent ?Grandchild WHERE  Lists four RDF nodes that conform with the following RDFstatement pattern.  { ?Grandparent < hasBirthdate> ?Grandparent_s_Birthday ;  A person’s birthday andhis/her grandchildren are listed.  < hasChild> ?node_variable_2 . ?node_variable_2 < hasChild> ?Grandchild . ?Grandchild < hasBirthdate> ?Grandchild_s_Birthday ; <hasGender> < FemaleSex> . FILTER ( str(?Grandparent_s_Birthday) < "1956-12-31" ) FILTER ( str(?Grandchild_s_Birthday) > "1993-12-31" ) }  The grandchildren’s birthdays are listed, and their gender is specified to be female. Grandparents should be older than 65 and grandchildren younger than 18.  It should be noted that in this query classes were not created for senior citizens and minors. SPARQL gives the option to create classes on the fly, by using the CONSTRUCT clause, instead of SELECT.18 However, it would not dynamically update the classes, since every day new individuals gain senior citizen status and many young adults lose their status of being a minor. Thus, it would be better to run a query every  18  Please refer to Appendix A for a more comprehensive description of SPARQL commands.  78 78  time that one needs to identify instances with the aforementioned statuses, and classes need to be populated only when used. Another approach would be to define a class for senior citizens and minors. In a traditional database, categorizing individuals under those classes is problematic, since some individuals gain or lose their senior or minor status every day, and as mentioned in chapter 2, reclassifying instances leads to database operation problems and potential loss of information. As well, it is not efficient to check the class membership eligibility of instances every day, and an information loss threat might be posed. An instance-based data management paradigm is able to handle such situations (i.e., classes with instances that are frequently reclassified) more flexibly. In the previous scenario, if a restriction (regarding the age of the instances) is created in the class definition within the ontology, the reasoning software can be used to check the restriction every time it makes an inference regarding an instance. For example, when inferring the members of the senior citizen class, the reasoner will check the birthdate of each instance and see if it was 65 years before the date on which the reasoning software is run. In this particular scenario, running reasoning software might not be more computationally efficient than running a query (however, the computational efficiency needs to be empirically tested). Although it would be possible to think of scenarios in which reasoning could be more efficient (e.g., if the users need to access the instances of “Senior Citizens” many times in one day, running a reasoning software once is better than running queries many times during that day). 79 79  Guidelines that would help database designers decide when to query and when to classify are beyond the scope of this project.  5.4 Conclusion and summary In this chapter some of the challenging tasks that were not possible, not efficient, or led to information loss in a class-based database were tested, and it was demonstrated that they could be done flexibly within the new paradigm by using semantic web technologies. The capabilities of this new paradigm have been demonstrated by showing how it can accommodate the addition of new properties and new class definitions and the evolution of the schema. It was also demonstrated that it was possible to manage instances that needed to be frequently reclassified by using the instance-base database management paradigm. The advantages and implications of the principles used (i.e., separation of instances from classes, property abstraction, and class structure definition guidelines) and the implemented technology will be explored in chapter 6, and comparison with the traditional data management paradigm will be made.  80 80  6. Analysis and Discussion This chapter begins with an analysis of the effects of adapting the instance-based data management paradigm to see if this paradigm has been able to address the schema design and database operation problems (see to chapter 2) that exist in the traditional paradigm. That will be followed by a general discussion regarding different implementations of instance-base databases as well as a semantic web database with inherent classification to highlight the benefits of the proposed implementation. After elaboration on the limitations and implications of the proposed approach, the chapter will conclude with a description of the contribution of this research.  6.1 Repercussions of adapting the instance-based paradigm Chapter 2 introduced some problems that could arise when instances are bound to welldefined classes (i.e., traditional data management paradigm). These problems were categorized into schema design and database operation problems (Parsons and Wand 2000). Applying the first principle (i.e., separation of instances from classification) in the proposed implementation could resolve some the aforementioned problems. In the category of schema design problems, one of the major issues was multiple classifications. In the proposed implementation, however, an instance can belong to multiple classes, depending on the properties that the instance possesses; these classes do not necessarily have to be specialization/generalization of each other or of a higher level class. As well, unlike the traditional approach, there is no need to duplicate 81 81  the information about the instance to record it under the multiple classes. In this implementation, the information regarding that instance is stored in one place, along with the properties that it possesses. Inferences regarding the class membership of instances can be made based on the class definitions and the properties that instances possess. As an example, in the proposed implementation one can store an instance which has the necessary and sufficient properties to be inferred to be a “client of a certain company”19 as well as a “father”.20 The “Client” and “Father” classes are not necessarily specializations/generalizations of each, and they do not share the properties that were necessary and sufficient in their respective class definitions (“client” was defined by the nature of transactions with the company, and “father” was defined by gender and the “hasChild” property). Another issue, as mentioned earlier, is developing a global schema, which is a necessary task in logical database design. Database designers must develop a global schema whenever more than one view of the application domain exists (Parsons and Wand 2000). Constructing a global schema, in the traditional approach, requires reconciliation of different views at the class level “because a class-based approach requires the identification of a set of base classes” (Parsons and Wand 2000). In the proposed implementation, since instances are not bound to the classes, it would not be necessary to reconcile class definitions. In the instance-based paradigm, schema integration requires only “agreement on the instances and properties that might exist” (Parsons and Wand 2000). Once the property lattices within different views are  19  A client of a company can be defined as a person or legal entity that engages in financial transactions with the said company. 20 A person of the male gender who has a child.  82 82  collected and reconciled for each instance (using the lemmas in section 2.2), the global schema could contain all the class sets (Parsons and Wand 2000). For example, within an organization only the payroll office might have access to employee names and their pay-grade, whereas the human resources department could have access to names, addresses, and marital status of employees. If one wants to integrate these views in an instance-based database, one need only combine all the properties that exist within different sources and reconcile them for each employee. Although there is still a need for schema integration in an instance-based database, it is “unnecessary to reconcile different views at the class level” (Parsons and Wand 2000). Therefore, it would not be necessary to reconcile class definitions (Parsons and Wand 1997), and multiple class structures can exist within the same instance repository. It is also known that the schema evolution in the traditional paradigm can be problematic due to the complexity of the operations needed to ensure that the integrity of information is intact (Parsons and Wand 2000). When the instances are separated from the classification, the schema can evolve independently. In this implementation the schema evolution would be done by implementing the changes in the ontology without affecting the instances and their properties in the repository, as was demonstrated in chapter 5. Moreover, in the traditional paradigm, when two (or more) sources of information are integrated there is need for agreement between the class structures within the application. The agreement is twofold: on semantics (i.e., phenomena and properties that exist) and on the class definitions that cover the phenomena (Parsons and Wand 2000). In the proposed implementation, integrating two sources of information requires 83 83  agreement “only on the things that can exist and their properties” (Parsons and Wand 2000). As an example, when two airlines decide to form an alliance and integrate their databases, it is only necessary to verify that the phenomena (e.g., pilots, flights, frequent flyers) and the properties (e.g., having frequent flyer number, having luggage, boarding flights) agree over the two data sources. However, the properties might not be defined identically. Identifying property precedences and using the lemmas in section 2.2 allows these two sources to be semantically reconciled. With all the instances and their properties in one place, each airline, based on its needs, can define an ontology (schema) that would be relevant to its domain. It should be noted that this approach does not completely eliminate the difficulties of semantic agreement, but “the integration effort under an instance-based approach is strictly less than the effort required under a class-based approach” (Parsons and Wand 2000) because there is no need to agree on the definition of classes and their meanings. The problems related to database operations could also become easier to handle if the first principle in the proposed implementation is employed. Handling exceptional instances that might have unique properties can be easily done, since it is possible to add properties to instances at any time without referring to the class structure. For example, one can record that a particular student is able to play the piano, run the marathon, speak Esperanto, etc. without changing the classification scheme. In contrast, in a traditional database, one needs to add a new column to the student table to record a property value for exceptional instances and fill it with null values for every other student or add a new subclass to store the exceptional instances.  84 84  Reclassification of instances in the proposed implementation would not be a problem either, because in the instance-based paradigm “reclassification occurs automatically and implicitly when a class is defined or when an instance acquires or loses properties” (Parsons and Wand 2000). In the implementation proposed here, since the instances are not bound to classes, when the ontology changes (e.g., a new class for senior citizens is added), the reasoning software will make inferences based only on the properties that an instance possesses (e.g., age of individuals). Also, when the properties of an instance change, it might lose the necessary and sufficient conditions of membership in a certain class and gain the conditions for becoming a member in another class (e.g., one individual reaches the age of 18 and loses the status of being a “minor”). When the reasoner is run, it will make inferences based on the properties that the instance possesses, regardless of the previous state of the instance. Adding or removing instances in the traditional paradigm would require operations (adding or removing) on all the classes to which the instance is related. Applying the first principle (i.e., separation of instances from classes) in the proposed implementation allows an instance to be added or removed without requiring any operations on the classes. It was also mentioned that in the traditional paradigm that removing a class or changing the definition could lead to loss of information (either of instances or of properties) (Parsons and Wand 2000). In the proposed approach, instances are not affected by any changes at the class level, and there would be no risk of losing information. Finally, redefining a class in the proposed implementation requires modifying only the ontology, without any changes to the instances and their properties. After the ontology 85 85  is changed the reasoning software will make inferences based on the new definitions. The database can operate with no risk of losing instances or properties or information regarding them (Parsons and Wand 2000). In this section the implications of an instance-based paradigm were analyzed with respect to the problems that arise in a class-based model. It should be noted that employing the second principle (i.e., property abstraction) facilitates developing the global schema, schema evolution, and reconciliation of multiple data sources (refer to section 2.2). Even though the problems mentioned in chapter 2 are not completely eliminated in the proposed implementation, most disappear or become less complex. The following sections will discuss the possible implementation variations that were studied in this work. A roadmap of these investigations can be found in Table 18: Table 18: Possible implementations that were investigated in this work Relational Implementation  Semantic Web Implementation  With Classification  Issues were discussed in chapter 2.  Section 6.3  Without Classification  Section 6.2  The proposed implementation of this thesis.  6.2 Discussing the implementation 6.2.1 Another implementation of the instance-based model Another implementation suggested by Parsons and Wand (2000) was based on a twolayered architecture using a relational database. In that implementation the bottom layer is used to store instances and their properties, and a top layer for class definitions and an “engine” is used to access and process each layer. The instance engine is able to 86 86  modify (adding and deleting of instances and/or properties) and retrieve information about instances and their properties. The class engine is used to modify (adding, removing, and changing) class definitions and retrieve class definitions and class membership (Parsons and Wand 2000). As a mechanism to identify instances — rather than using keys in the traditional paradigm — one global identifier for every instance is used in this implementation. This global identifier “serves as a surrogate designating the existence of a corresponding real-world thing” (Parsons and Wand 2000). To represent the properties of an instance three approaches were suggested. One approach is to use binary relations in the form of a tuple; the first column would be the instance identifier, the second column would contain the property value (e.g., <student512, kinesiology>), and the relation name represents the preceding property (in this case, having a major). Another approach is storing each instance as a bundle (its identity and its properties) in a list; this list would contain all the properties that the instance possesses. The third approach would keep each property as a list, and all the instances possessing that property would be recorded on the list (Parsons and Wand 2000). To represent classes, a list of pointers can be used. Each pointer would “refer to a property, either a binary table or a list of instances possessing the property” (Parsons and Wand 2000). This way, class memberships can be computed instantly “by navigating from the properties to the instances that possess them” (Parsons and Wand 2000).  87 87  6.2.1.1	Implications	of	the	two‐layered	architecture	in	comparison	with	the	semantic web	implementation In this subsection some of the implications and concerns related to the two-layered architecture will be stated, and then analysis will be done to see if the semantic web implementation has been successful in eliminating those concerns. In the two-layered architecture there are two efficiency concerns: one is related to retrieval of the properties of instances and the other is retrieval of instances of a class (Parsons and Wand 2000). To retrieve properties of a group of instances it might be necessary to scan the whole database to determine whether an instance possesses a certain property. This can be addressed “by keeping linkage relationships between properties and instances” (Parsons and Wand 2000), but in a case where properties need to be updated this would cause inefficiencies in the updating operations, as all the instance links would need to be explored and updated. In the semantic web implementation of the instance-based paradigm (as proposed in this work), this issue would not cause major inefficiencies, as every instance is kept as a node and all of its properties are connected to it. Thus, retrieving a property would only require locating the instance node. Also, every property is considered a resource, and its URI points to a certain location. If the need to update a certain property arises, it is necessary to update only one location (i.e., the location that the URI refers to). For example, considering the situation in Figure 20, if one wants to retrieve properties of Daniel Cooper it is only necessary to locate his node in the triplestore; all of his properties would then be visible (e.g., hasSibling, hasChild). To demonstrate updating properties it is assumed that one wants to assert that “hasSibling” is a symmetric property, i.e., if the statement <Daniel 88 88  Cooper, hasSibling, Sandra Cooper> is true, then <Sandra Cooper, hasSibling, Daniel Cooper> must be true. This can be done easily by going to hasSibling’s location on the property lattice (within the ontology) and making the assertion within the ontology. This way, every other instance or class that uses the “hasSibling” property will access the updated definition. In short, operations related to managing the property lattice will be more flexible using semantic web technologies.  Figure 20: An example of retrieving and updating properties  The other inefficiency concern revolves around the process of retrieving instances of a class. In the two-layered architecture, the class membership could be recomputed as needed (in the form of a query); “however in a query-intensive environment, this would be impractical” (Parsons and Wand 2000). Another approach suggested is keeping a class index that would keep record of the class’s instances. A downside to this approach is in update-intensive environments where instances and their properties would be frequently added, deleted, or updated and also “whenever an instance acquires or loses properties, and whenever a class definition changes” (Parsons and Wand 2000). This problem exists in the semantic web implementation to some extent. 89 89  Although there would be no need to keep an index and update it, the class memberships would need to be recomputed using reasoning software. In a triplestore, the inferences made on an instance would not be flushed from the memory (as is done for a query). Therefore, in a query-intensive environment it is only necessary to run the reasoning software once, and all the inferences regarding the class membership of instances would be reusable, unless an instance is updated (added, deleted, or have its properties changed); in such cases the reasoning software needs to be run again. Bottom line, in a query-intensive environment in which the instances are modified constantly, the reasoning software needs to recompute class memberships frequently; this might lead to inefficiencies. In other environments (update-intensive or even queryintensive where the instances are not modified frequently) the semantic web implementation of the instance-based paradigm could potentially work as efficiently as it does in other database operations. Another benefit of the semantic web implementation of the instance-based paradigm is highlighted in running selection queries (e.g., finding all instances that have a color). In such cases, a simple SPARQL query (as shown in Table 19) will allow one to run such selection queries without facing any inefficiencies that relational databases or the twolayered architecture of the instance-based model would face (Parsons and Wand 2000). Table 19: Running a selection query on a triplestore Command  Description  SELECT ?node_variable_1 WHERE  List all RDF subjects that participate in the following RDF pattern.  {?node_variable_1 <hasColor> ?node_variable_2 . }  Finds all nodes that have the “hasColor” property.  90 90  However, it should be noted that in the domain of management information systems concerns related to practical possibility of performing operations, such as reclassifying instances without losing data, creating new abstract properties, and identifying new manifestations of properties that did not exist before, are more important than efficiency concerns. So far, this work and Parsons and Wand (2000) have shown that the twolayered architecture is able to perform such operations without losing data.  6.3 A semantic web database with inherent classification To highlight the benefits of the instance-based paradigm within the semantic web domain, this section will examine a hypothetical triplestore (an RDF database) that has not adapted any of the principles introduced in chapter 2. This means that in this scenario all the instances are bound to the classification. All the problems associated with inherent classification (i.e., schema design and database operation problems) will be studied, and how and if they would appear in this hypothetical triplestore will be analyzed. Multiple classification of an instance would not cause any problems in a triplestore, as each instance is represented by a single node and that node can be associated to many classes (even though they might not be generalization/specialization of each other) without redundancy. As well, updating the node would need to be done only once. However. view integration problems would still exist, since the instances are linked to classes, and generating a single global conceptual schema would make the users “adapt their view to the global view, or additional mechanisms need to be implemented to allow users to view the database in terms of their previously identified local schemas” 91 91  (Parsons and Wand 2000). The interoperability problem would still emerge owing to the inherent classification and the need to reconcile the class definitions in the integrated system (please refer to chapter 2). Schema evolution would still lead to problems in a class-based triplestore. In the case where a class definition changes, the instances that belonged to the class before the change might no longer qualify to be members; therefore, they have to be assigned to other classes (in order to conform to the class-based approach). As for the problems related to database operations, there would be no problem in storing instances with unique properties, as every node can be stored with as many properties (edges) as it might have, unlike in a relational model in which the number of possible properties are related to the number of the defined columns in the table. Also, in reclassifying instances properties would not necessarily be lost because there is no limit to the number of properties each instance (node in a triplestore) can possess. Therefore, loss of information would not be a major problem. Adding, removing, and updating instances in a class-based triplestore could be easily done by updating the instance in the location to which the URI points, and one modification on the instance would not proliferate the operations needed to add, remove, or update an instance (unlike the case in a relational database). Removing a class would not lead to loss of instances or properties in the class-based triplestore, since the instances (RDF nodes) and properties (RDF edges) would not be deleted from the triplestore if just the edge that determines the instances’ class is removed. Similarly, redefining a class would not lead to loss of instances because an 92 92  instance can still exist in the triplestore even though its class membership ties have been severed. However, the integrity constraints could become an issue; the discussion regarding those constraints is beyond the scope of this work. These problems were all related to not adhering to the first principle (separating instances from classes). The second principle was property abstraction and the facilities it provides in running selection queries and other special type of queries that were mentioned in the case study. Ignoring defining property precedence will limit the depth of information and knowledge that the tool can provide. Also, if the precedences are not defined, the lemmas introduced for the attribute-based semantic reconciliation of multiple data sources (in section 2.2) will not be usable. In-depth analysis of the problems caused by ignoring the property abstraction principle is itself an interesting research problem that is beyond the scope of this thesis. As for the third principle, the conditions introduced in section 2.3 (i.e., requirements for permissible classes and conditions constraining the possible class structure) “provide necessary conditions for a set of classes to reflect cognitive economy and inference” (Parsons and Wand 1997). Violation of these conditions “indicates something is ‘lost’ from a cognitive point of view” (Parsons and Wand 1997); however, it is not claimed that these guidelines must always be followed in designing a class structure (Parsons and Wand 1997), and as mentioned before, the underlying considerations of class structure are beyond the scope of this work.  93 93  6.4 Considerations and limitations in the proposed implementation This thesis proposed an information system that, in particular, is able to flexibly handle user-generated data over the web. In this thesis, “flexibility” means that the way the users view the domain may change, new uses for the data may arise, or multiple users might have different views that might need to be accommodated. However, these advantages may not make it an ideal approach for every application. In particular, certain efficiency issues may arise (Parsons and Wand 2000). Specifically, identifying members of a class in an instance-based database, may require scanning all the data. The inefficiency associated with this task might become a significant burden in some query-intensive applications. Therefore, this flexible information system approach would be beneficial for applications in which the benefits outweigh the costs. Measuring the benefits or identifying application mixes that benefit most from the proposed implementation are beyond the scope of this thesis and could be studied in the future. However, within the scope of this work types of applications for which the approach would be especially beneficial can be suggested. Consider, for example, crowdsourcing and citizen science applications. In citizen science projects users may discover new concepts which can result in changes to the schema.21. Therefore, the flexible paradigm would be advantageous for these projects. A small shoe store might need only one view over the data, and the uses of data are often known in advance (e.g., view the shoes in the inventory). In such cases, a traditional database likely would be a more feasible option.  21  Moreover, such new concepts may change information requirements.  94 94  Table 20 provides a general overview of the type of applications for which the proposed information system architecture can be beneficial. Table 20: Applications in which the flexible paradigm can be beneficial Requirements for the system  Description  Example  The benefit of the proposed implementation  Multiple Sources  The system in which the data are located across various sources.  Citizen science applications, with data dispersed over the web.  The data are represented using a standard format (i.e., RDF). By referring to the URI, data can be located anywhere on the web.  Changes in users’ needs, or introduction of new concepts, will be reflected in the users’ views. In situations in which the semantic changes happen often, the views become volatile.  Business intelligence applications in which new patterns are identified and tested.  Separating instances from semantics means that changes in the semantics would not be propagated to the data. If a new concept is identified or a definition of a class is changed, it would only be reflected in the ontology. After the ontology is updated, the operations/queries can be run as before, and the existing data can be used in new ways (Discussed further in section 6.5).  Many users with different needs may access the same data. They might need different sets of concepts and various levels of abstractions.  Medical database is accessed by various users (e.g., specialists), and they have different needs within their respective domains (section 6.5)  Volatile Users’ Views  Multiple Views  (Discussed further in chapter 7).  Also, by defining property abstractions, reconciliation of semantics across the sources would be facilitated.  Supporting property abstraction allows the meaning of the data to be abstracted to support multiple views, based on users’ needs. Also, a similar concept can be defined differently within each view. Separation of instances from classification allows users to view the data the way they need, without having to change the data (i.e., how they are organized and stored). The proposed implementation allows users to define the concepts as they need in their respective ontologies and view the data through the lens of their ontologies.  95 95  Another issue to consider in this implementation is the assumption that the data are stored in RDF format (i.e., triples in the form of <subject, predicate, object>). However, semantic web formats are becoming more common and important than before, and the amount of data in these formats is increasing (Lassila and Hendler 2007). The sources of unstructured data (e.g., XML) are also growing, and a flexible approach is required in managing such data. The legacy data can also be converted into the RDF data model, but the cost and the effort associated with converting legacy data into RDF, as well as maintaining and processing data in the RDF structure, should be considered before this approach is adopted. Although there are open source programs22 that automate the conversion of relational data, the efficiency and correctness of this process were not verified here. To successfully utilize the implementation proposed in this thesis, the users/designers need to build a database adapting the RDF format or convert the old data (as suggested earlier). It is also possible to use the ever increasing sources of RDF data over the web. For instance, DBpedia23 is a dataset containing extracted data from Wikipedia in the RDF format, and it is interlinked with other sources such as US Census, Freebase, and CIA World Fact Book (see http://wiki.dbpedia.org/Downloads32). Please refer to Appendix D for a visualization of these dataset linkages. Another possible use for the proposed implementation will be introduced in chapter 7. As for security concerns related to the proposed implementation, the security features provided by the triplestore chosen here (i.e., Allegrograph) have been studied. This 22  D2R Map: http://www4.wiwiss.fu-berlin.de/bizer/d2rmap/d2rmap.htm http://wiki.dbpedia.org/About  23  96 96  triplestore allows the database administrator to manage the access level of users in some generic areas such as the number of data catalogs that they can access, read/write privileges, and administrative rights (add/delete other users). With the RDF structure in mind, this triplestore allows the administrators to define security filters that can “prevent access (both read and write) to triples with a specified value for subject, predicate, object and/or graph” (Franz, Incorporated 2012). For example, the university clinic might need access to all students, faculty, and staff records within the university, but salary is privacy-sensitive data and should not be accessible to clinic staff. In this case the “hasSalary” property can be filtered from the clinic’s view, and the users in the clinic will not be able to view anyone’s salary or modify the salary value. Allegrograph was the triplestore chosen for experimenting with different scenarios that were analyzed in this thesis. In this particular triplestore, a user with administrative privileges can use the graphic interface of this database and define filters for roles by using “allow” or “disallow” keywords on certain RDF subjects, predicates, or objects. Figure 21 shows a screenshot of this graphic interface.  Figure 21: Defining filters in Allegrograph  Defining strong security filters can limit the contents users provide, especially within the emerging social web phenomena (e.g., crowdsourcing, citizen science). On the other hand, having many users without limited access rights would make it harder to ensure the integrity of data. However, these concerns are beyond the scope of this thesis. 97 97  6.5 Contribution of this research This work applied three innovative principles from Parsons and Wand (1997, 2000, 2003, 2008) to create a flexible information management system. These principles were implemented using semantic web technologies. The proposed implementation has advantages that could have not been achieved by the principles or the technologies alone. Sections 6.2 and 6.3, respectively, studied possible implementations in which only the principles or only the semantic web technologies were employed. The semantic web technologies are a powerful set of tools that are designed for managing the data over the web. The sources of information can be dispersed all over the web, and one can access them by knowing the URIs; in other words, instances have a uniform identification across the relevant universe. Using the semantic web technologies, users can define the domain ontologies in OWL and store RDF data in triplestores, but the technology alone cannot provide the advantages (in particular the flexibility) that the proposed implementation provides. The implementation proposed in this work is able to separate data from the semantics. Thus, changes at the semantic level do not require changes in the data. As one type of semantic change, users may add a new abstract level of meaning to the data by modifying the property lattice. Using the proposed implementation, users would still be able to utilize the existing data for the desired purposes (refer to section 5.2). However, in a traditional database, making queries after modifying the property hierarchies would not be useable, unless the data are updated to reflect the changes.  98 98  Another type of semantic change might be due to the evolution of the schema (whether adding, removing, or redefining classes). In this situation, one only needs to update the ontology and would still be able to use the existing data. To highlight this advantage, here a “traditional” medical database (which would likely be relational) was considered as an example. In this hypothetical system, diseases are defined by their symptoms, and patients would be classified based on their diagnosed diseases. Assume that researchers identify the emergence of a new virus that causes some particular symptoms in the patients. With the traditional database, if there are symptoms not recognized before, physicians might not be able to store them in the database. Consequently, the traditional database may not allow for queries related to the newly identified knowledge (e.g., to identify the patients suffering from the newly identified disease). Doing so necessitates changes in the structure of the data (and perhaps even some reorganization of the data. In contrast, if the data were stored using the flexible paradigm, physicians could simply add instances within their properties (e.g., symptoms) and modify the property lattice in the ontology. Then the data can be queried for the new phenomenon. Also, supporting property abstraction allows the meaning of the data to be abstracted to support multiple views. Abstracting the properties, along with separation of instances from semantics, allows users to view the data the way they need, without having to change the data or how they are organized and stored (and potentially compromising the integrity or quality of the data). In the proposed implementation different ontologies (or views) can be defined by the users to access and query the data. As an example, physicians who use a medical database require different views over the patients. An 99 99  ophthalmologist needs to know if his/her patients suffer from any heart complications; the details or the nature of the heart condition might not be relevant to him/her in performing his/her task. In other words, he/she would need to know an abstraction of the properties related to heart diseases; the manifestations of that property are not relevant in his view. Therefore, he/she could use an ontology in which the patients have properties that are related to the ophthalmology domain and a property that represents heart complications and abstracts the actual details related to the heart condition. A flexible information system in place allows the ophthalmologist to use an ontology that reflects his/her needs. In a traditional database, to create a relevant view for the ophthalmologist the properties need to be retrieved from the records kept by cardiologists, and that proliferates the number of operations. Furthermore, the proposed implementation is able to support concepts that have various meanings in different views. For example, a “patient with critical conditions” in the maternity ward is defined differently from such a patient in the ophthalmology ward. In the former, a patient in labour is considered to be in a critical condition, while in the latter, the cornea of a patient might be torn. In a traditional (e.g., relational) database separate classes may need to be created to represent patients at risk in each medical ward. That might lead to multiplication of patients’ information (there could be patients that are in critical conditions in both domains). Using the implementation proposed in this work, database designers could define separate ontologies for each ward/domain. Users in the maternity and ophthalmology wards could view the patients’ data (i.e., same repository of instances) using their relevant ontologies. In short, the general  100 100  abstract concept of a “patient with critical conditions”, would be applicable in various views (and would materialize differently for various users). Moreover, this implementation facilitates interoperability of various sources of information (assuming that the principles have been applied in each source). To make the systems interoperable, one needs only to reconcile the meanings across different sources (i.e., reconcile property lattices) and then view the data based on the reconciled semantics. For example, when two medical databases are merged the inter-source precedences should be identified (using lemmas in section 2.2). In one database, patient allergies might be recorded by the reaction that they show (e.g., patient 259 goes into shock after a penicillin injection), whereas in the other database only the allergy stimuli are recorded (e.g., patient 598 is allergic to penicillin). After reconciliation of semantics (e.g., having allergy to penicillin precedes going into shock because of penicillin), the two sources could be merged.  6.6 Summary and conclusion This thesis combined three innovative principles with the capabilities of the semantic web technologies to create a flexible information system. The technology enables accessing data across various sources over the web, with no central control. By employing the principles, the system is able to flexibly handle changes in semantics without making any changes to the data. Furthermore, the proposed implementation supports property abstraction, and by doing so facilitates viewing the data from different levels of abstraction. Finally, by following the guidelines regarding class structure,  101 101  database designers can make sure nothing is lost from an information point of view (Parsons and Wand 1997). This implementation supports multiple views of multiple users by allowing them to define their own ontologies and reason based on the inference rules that they define. The domain ontology can also evolve over time to reflect changes in users’ needs or emergence of new concepts (or obsolescence of old concepts). The semantic changes would not be propagated to the data. This flexible information system is able to run complex queries that were not possible before across multiple sources and views (refer to sections 5.1, 5.2, and 5.3).  102 102  7. Summary and Future Vision Traditional methods of data management, such as relational databases, have encountered problems with schema design and database operation that are due to binding the instances to well-defined classes or, in other words, the inherent classification of data. These problems would be more prominent in the new emerging phenomena of the social web (such as crowdsourcing and citizen science), in which the data are usually unstructured or semi-structured and uses of data are unanticipated. As a solution, three principles addressing separation of instances from classes, property abstraction, and underlying considerations of class structure were brought together in this thesis, and their implications within a semantic web implementation were analyzed. A triangular view of instances, properties, and classes was adapted to support these principles. To achieve a practical implementation of the solution, technologies of the semantic web domain were chosen. Not only are these technologies widely used in the ever-growing web applications, but also the facilities that these technologies offer provide a solid foundation for implementing the theoretical principles. Semantic web technologies, such as OWL, RDF, and triplestores enabled the implementation of a class-less information repository in which the instances are not tied to the class structure. The proposed approach allows instances to have unique properties and belong to multiple classes. It also facilitates the interoperability and semantic reconciliation of different data sources by making the operations more flexible. 103 103  The underlying principles of this approach can help database designers avoid problems of information loss (instances and/or properties), proliferation of operations, and proliferation of classes. These features make this approach suitable for handling the unstructured or semi-structured data which come from varying sources and are intended for various uses over the social web phenomena. In this thesis “categories” and “classes” were distinguished from one another. The difference is that classes support inferences about their instances. The inferences that can be drawn from membership of an instance in a given class (i.e., benefits of classes) were not in the scope of the project. Rather, the focus was on instances, their properties, and the classes to which they belonged. However, the additionally inferred information can indicate the relationships that exist between the properties of instances. Studying the patterns of these relationships could be a very interesting area of business Intelligence research. As mentioned in the first chapter, organizational big data is a very important phenomenon. The challenge is not just in managing the volume of data but also in not being able to utilize the big data beneficially (specifically in decision making). Currently, organizations base their decisions on a small portion of the data, because all of the possible applications of data are not known to them (Brown et al. 2011). If organizations manage to analyze the data and identify patterns in information, they could turn big data into corporate assets. Companies will be able to change their decision-making processes and base their decisions on the newly inferred information. Research shows that companies that use business analytics in their decision making return higher returns on equity (Brynjolfsson et al. 2011). 104 104  Common business intelligence applications use operational databases and assume that data are organized in some pre-defined structure (e.g., customer or inventory records, usually organized in tables), which reflects some anticipated uses. Using statistical methods, business intelligence applications identify patterns that exist within the data. As mentioned earlier, the applications of the organizational big data are not always known in advance. Therefore, flexibility in information management is needed to manage the data for which the future uses and applications might be unknown. Moreover, after the data are analyzed it would be important to test the newly learned information (such as new patterns). Using the implementation proposed in this work, analysts would be able to query new concepts without making changes to the existing data. To demonstrate the above idea, consider the example the Insurance Corporation of British Columbia (ICBC), ICBC collects data on the accident history of each insured client, value of each vehicle, purpose of using the vehicle (e.g., pleasure or work), and many other factors. Currently, ICBC uses accident histories and years of driving experience to determine insurance premiums. ICBC could analyze the data it collects to identify new patterns. The corporation might realize that for clients with the same accident histories, some might be less risky than others. In this hypothetical scenario, properties such as age, home location, vehicle type, and occupation might be identified as factors that correlate with risk. This hypothesis could be tested on the existing data, and if it is corroborated ICBC could add new properties or a new class to the domain ontology (called, for example, “Low-risk Driver”). ICBC employees and insurance agents  105 105  would be able to query the existing data to identify low-risk drivers using the new properties. In future research, techniques such as machine learning and statistical methods can be used to identify new concepts and classes. The flexible paradigm allows users to add the newly learned concepts to the structure without making any changes to data. The system allows for querying the new concepts over the existing data. Systems with volatile views (such as a business intelligence application in which new concepts are identified and are reflected in the users’ views) would benefit from the flexibility of the implementation proposed in this work. Beyond the applications of business intelligence, the possible research areas on classification and the instance-based paradigm are unlimited. The work done for this thesis does not end here, but opens new doors.  106 106  References Batini, C., Ceri, S., and Navathe, S. 1992. Conceptual Database Design: An EntitiyRelationship Approach, Benjamin Cummings Publ. Co., Inc., Redwood City, CA. Berners-Lee, T. 2005. “An RDF Language for the Semantic Web: Notation 3 Logic,” http://www.w3.org/DesignIssues/Notation3.html, accessed on 19/01/2012. Berners-Lee, T. A. 2009. “A Short History of “Resource” in Web Architecture,” http://www.w3.org/DesignIssues/TermResource.html, accessed on 07/12/2011. Berners-Lee, T., and Fischetti, M. 2000. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor, Harper, San Francisco, CA, ISBN 978-0-06-251587-2, Chapter12. Berners-Lee, T., Fielding, R., Urvine, U. C., and Masinter, L. 1998, “Uniform Resource Identifiers (URI): Generic Syntax,” http://tools.ietf.org/html/rfc2396, accessed on 07/12/2011. Berners-Lee, T., Hendler, J., and Lassila, O. 2001. “The Semantic Web,” Scientific American Magazine, (May), pp. 35–43. Brabham. D. C. 2008. “Crowdsourcing as a Model for Problem Solving: An Introduction and Cases,” Convergence: The International Journal of Research into New Media Technologies, (14:1), pp. 75–90. Brown, B., Chui, M., and Manyika, J., 2011. “Are you ready for the era of ‘big data’?”, McKinsey Quarterly, October No. 4. http://www.mckinseyquarterly.com/Are_you_ready_for_the_era_of_big_data_286 4, accessed on 24/04/2012 Brynjolfsson, E., Hitt, L. M., and Kim, H. H., 2011, “Strength in numbers: How does data-driven decision making affect firm performance?” Social Science Research 107 107  Network (SSRN), April 2011. http://ssrn.com/abstract=1819486 or http://dx.doi.org/10.2139/ssrn.1819486, accessed on 10/02/2012 Buneman, P., 1997. “Semistructured data,” Proceedings of the Sixteenth SIGMODSIGACT Symposium on Principles of Database Systems: PODS 1997, Tucsin, AZ, pp. 117–121. Buneman, P., Davidson, S. B., Hillebrand, G. G., and Suciu, D. 1996. “A query language and optimization techniques for unstructured data,” SIGMOD '96 Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada,1996, ACM, pp. 505–516. Chen, P. P., 1976. “The entity-relationship model—Toward a unified view of data,” ACM Transactions on Database Systems, (1:1), pp. 9–36. Clifton, C., Housman, A., and Rosenthal, A. 1997. “Experience with a Combined Approach to Attribute-Matching Across Heterogeneous Databases,” in Data Mining and Reverse Engineering. Proceedings of DS-7, IFIP,. pp. 428–453 Evermann, J., and Wand, Y. 2009. “Ontology Based Object-Oriented Domain Modeling: Representing Behavior,” Journal of Database Management, (20:1), pp, 48–77. Franz, Incorporated. 2012. “AllegroGraph 4.4.0.1 Security Implementation,” http://www.franz.com/agraph/support/documentation/v4/security.html, accessed on 30/01/2012. Holland, J., Holyoak, K., Nisbett, R., and Thagard, P. 1986. Induction: Processes of Inference, Learning, and Discovery, MIT Press, Cambridge, Mass. Klyne, G., and Carroll, J.J. (eds.). 2004. “Resource Description Framework (RDF). Concepts and Abstract Syntax,” http://www.w3.org/TR/rdf-concepts/, accessed on 07/12/2011. Lassila, O., and Hendler, J. 2007, “Embracing Web 3.0”, IEEE Internet Computing, (11:3), pp. 90–93. 108 108  McGuinness, D. L., and van Harmelen, F. (eds.). 2004. “OWL Web Ontology Language Overview,” http://www.w3.org/TR/2004/REC-owl-features-20040210/, accessed on 08/12/2011. Oracle. 2009. “Oracle Database Semantic Technologies,” http://www.oracle.com/technetwork/database/options/semantictech/whatsnew/index.html, accessed on 08/12/2011 O’Reilly, T. 2007. “What is Web 2.0: Design patterns and business models for the next generation of software,” Communications & Strategies, (65), pp. 17–37. Parsons, J., and Wand, Y. 1997. “Choosing Classes in Conceptual Modeling,” Communications of the ACM, (40:6), pp. 63–69. Parsons, J., and Wand, Y. 2000. “Emancipating Instances from the Tyranny of Classes in Information Modeling,” ACM Transactions on Database Systems, (25:2), pp. 228–268. Parsons, J., and Wand, Y. 2003. “Attribute‐based Semantic Reconciliation of Multiple Data Sources,” Journal on Data Semantics, (1), pp. 21–47. Parsons, J., and Wand, Y. 2008a. “A Question of Class,” Nature, (455), 23 October 2008, pp. 1040–1041. Parsons, J., and Wand, Y. 2008b. “Using Cognitive Principles to Guide Classification in Information System Modeling,” MIS Quarterly, (32:4), pp. 839–868. Pelligrini, T. 2006. “Jeen Brokstra: ‘The importance of SPARQL cannot be overestimated’,” Semantic Web Company Corporate News, http://www.semanticweb.at/news/jeen-broekstra-x22-the-importance-of-sparql-can-not-beoverestimated-x22, accessed on 08/12/2011. Prud’hommeaux, E., and Seaborne, A. (eds.), 2008. “SPARQL Query Language for RDF,” http://www.w3.org/TR/rdf-sparql-query/, accessed on 08/12/2011.  109 109  Rosch, E. 1978. “Principles of Categorization,” in Cognition and Categorization, E. Rosch and B. Lloyd (eds.), Erlbaum, Hillsdale, N.J., pp. 27–48. Smith, M. K., Welty, C., and McGuinness, D. 2004. “OWL Web Ontology Language Guide,” http://www.w3.org/TR/2004/REC-owl-guide-20040210/, accessed on 08/12/2011. Stanford Center for Biomedical Informatics Research. 2011. “Welcome to the Protege wiki,” http://protegewiki.stanford.edu/, accessed on 13/12/2011. Wand, Y., and Weber, R. 1990. “An Ontological Model of an Information System,” IEEE Transactions on Software Engineering, (16:11), pp. 1282–1292. Wand, Y., Storey, V. C., and Weber, R. 1999. “An Ontological Analysis of the Relationship Construct in Conceptual Modeling,” ACM Transaction on Database Systems, (24:4), pp, 494–528. www.wikipedia.org/wiki/Serialization, accessed on 05/03/2012. W3C. 1999. “Web Characterization Terminology & Definitions Sheet,” http://www.w3.org/1999/05/WCA-terms/, accessed on 07/12/2011. W3C. 2012. “Workshop Report: Linked enterprise Data Workshop,” www.w3.org, accessed on 07/12/2011.  110 110  Appendices Appendix A: SPARQL Table A-1: General SPARQL Commands Command  Description  Select  Specifies what the query should return.  Ask  Returns “yes” if the RDF pattern exists and “no” if it does not.  Describe  Returns triples within the graph, which are related to the query variable.  Construct  Creates the graph that matches the query pattern.  Where  Specifies the query pattern.  Table A-2: The structure of a general SPARQL query Comments  Description  SELECT ?query_variable(s)  One or more query variables could be specified to be displayed in the output.  WHERE{  The desired query pattern will be stated after the “WHERE” command.  ?subject ?predicate <Given_Instance>  Triples that have a given instance as the object in the RDF statement.  ?subject <Given_Predicate> ?object  Triples that have the Given_Property as their predicate in the RDF statement.  <Given_Instance> ?predicate ?object}  Triples that have a given instance as the subject in the RDF statement.  Any of the above graph patterns could exist, or they could be mixed to describe more complicated patterns.  111 111  Appendix B: Reasoning in a Triplestore, using Jena API This simple scenario demonstrates the reasoning process, using Jena API. Here a very simple ontology has been created in which there is only one object property and one class. The property is “operatingSystemUsed” and the class is named “Personal Computer”. This class has been restricted by the choice of operating system, i.e., every instance that uses Microsoft Windows®, i.e., “Windows”, as its operating system is a “Personal Computer”. In the code below, first the ontology was imported into the triplestore. Then an unclassified instance called “laptopX” was created, and it was asserted that “laptopX” uses “Windows” as its operating system. Finally, the code will try to run a reasoning command (in the form of a query); the output will show that the triplestore has inferred that “laptopX” is a “Personal Computer”. public class reasoning { public static void main(String[] args){ OntModel m = ModelFactory.createOntologyModel( OntModelSpec.OWL_MEM_MICRO_RULE_INF ); String inputFileName = "src/reasoning/PC.owl"; InputStream in = FileManager.get().open(inputFileName); if (in == null){ throw new IllegalArgumentException( "File:" + inputFileName + " not found"); } m.read(in, null);  Resource laptopX = m.createResource("http://www.example.org/PC.owl#laptopX"); Resource Windows = m.createResource("http://www.example.org/PC.owl#Windows"); m.add(laptopX, operatingSystemUsed, Windows);  112 112  String queryString = "select ?p ?o where {<http://www.example.org/PC.owl#mission5> rdf:type ?o}"; Query query = QueryFactory.create(queryString); QueryExecution qe = QueryExecutionFactory.create(query, m); System.out.println("\nLaptop X is a:"); try { ResultSet results = qe.execSelect(); while (results.hasNext()) { System.out.println(results.next()); } } finally { qe.close(); } } }  Executing the above code will give the following output: ( ?p = rdf:type )( ?o = owl:Thing ) ( ?p = rdf:type )( ?o = <http://www.www.example.org/PC.owl#PersonalComputer>) ( ?p = rdf:type )( ?o = rdfs:Resource )  In the output it can be observed that “laptopX” is an instance of the class “Thing” and it is also a “Personal Computer”  113 113  Appendix C: A portion of Instances that were used to Populate the Triplestore for the Purpose of the Case Study (Chapter 5) Table C-1: Instances in the triplestore (RDF format) Subject  Predicate  Object  John Smith  hasSibling  Amanda Smith  hasPostalCode  90525  hasSibling  John Smith  hasPostalCode  90210  hasChild  Daniel Cooper  hasChild  Sandra Cooper  hasBirthdate  1940-12-31  hasGender  Male  hasSibling  Sandra Cooper  hasChild  Robert Cooper  hasGender  Male  hasBirthdate  1995-09-15  hasGender  Female  hasChild  Jack Frost  hasChild  Miranda Frost  hasGender  Female  hasBirthdate  1998-01-10  hasSibling  Jack Frost  hasGender  Male  hasBirthdate  1980-05-21  Amanda Smith  Daniel Cooper  Robert Cooper  Sandra Cooper  Miranda Frost  Jack Frost  114 114  Appendix D: Linkages of Open Data Datasets  Figure D-1: Linkages of open data datasets (adapted from lod-cloud.net)  115 115  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073030/manifest

Comment

Related Items