Access Control in X M L PDMS Query Answering by Shuan Wang B . S c , Beijing Normal University, P.R. China, 2001 M . S c , Peking University, P.R. China, 2004 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L M E N T O F THE REQUIREMENTS FOR T H E DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia February 14, 2007 © Shuan Wang 2007 Abstract Peer data management system (PDMS) is a decentralized system, in which each peer is autonomous and has its own schema and database. With the help of pairwise schema mapping built between any two relevant peers, a query at one peer can be rewritten and broadcast to the whole P D M S . Then answers from multiple peers are returned to the querying peer. In our thesis, we exploit the access control issues in the query-answering process of the X M L P D M S . We propose a formal syntax for access control policy (ACP) to specify the fine-grained access control privileges on peers' local X M L database. We also design several query-answering algorithms that aim to handle access control in the P D M S , define the algorithm properties of Information Leakage Free and Completeness, and analyze every designed query-answering algorithm on the two properties. A comprehensive cost model, which consists of the major tasks and primitive operations, is proposed by us to assess the query-answering algorithms. We implement the designed query-answering algorithms, compare their running time, and test the scalability in different facets. ii Table of Contents Abstract ii Table of Contents iii List of Tables vi List of Figures vii Acknowledgements 1 2 Introduction ^. 1 1.1 Background 1 1.2 Motivation and Challenges 4 1.3 Problem Statement 6 1.4 Contributions . 1.5 Thesis Outline 7 : Related Work 2.1 3 ix 7 • Peer Data Management System (PDMS) 2.2 Query Containment 2.3 Access Control on X M L Documents . . 9 '. - 9 • • . . 12 14 Access Control in X M L P D M S 16 3.1 The Problem in General 16 3.1.1 General Sense of Information Leakage and Completeness . 17 3.1.2 General Methods of Access Control 17 3.1.3 Example 19 3.2 Access Control Policy (ACP) Formal Definition 20 3.3 P D M S Scenarios with A C P Examples 22 3.4 Semantics of P D M S Query-Answering under A C P s 28 iii Table of Contents 4 Strategies and Options for the Query-Answering Process . . . 4.1 30 Intuition 30 4.2 Formal Definitions for Strategy and Option 32 4.3 Basic Assumptions 33 4.4 Strategies and Options Designed 35 5 Information Leakage and Completeness for (S,0) Pairs . . . . 5.1 5.2 40 Information Leakage (IL) 40 5.1.1 Definitions 40 5.1.2 The Sufficient and Necessary Condition for IL-Free . . . . 42 5.1.3 IL-Free Analysis for all (S,0) pairs 43 Completeness •. 46 5.2.1 Definitions 46 5.2.2 The Sufficient and Necessary Condition for Completeness 48 5.2.3 Completeness Analysis for all (S,0) pairs 50 6 Cost Analysis for (S,0) pairs 54 6.1 The Cost Model 54 6.2 Cost Analysis Result 59 6.2.1 Query Transmitting Cost 60 6.2.2 Query Rewriting Cost 60 6.2.3 A C P Evaluation Cost 61 6.2.4 Answer Routing, Cost 62 6.2.5 A C P Distributing Cost 64 6.2.6 Annotating Cost 65 6.2.7 Annotation Shipping Cost 66 6.3 Hypothesis for Best (S,0) pairs 67 69 7 Algorithm Details 7.1 7.2 7.3 Algorithm for Query-Rewriting in light of A C P s 69 7.1.1 Algorithm Description . . . . • 69 7.1.2 Example. 70 Algorithms for 0 75 3 7.2.1 Safe-Peer-List Finding Algorithm 75 7.2.2 Answer Routing Algorithm 76 Algorithms for Option 5 7.3.1 Partitioning Semantics and Methodologies . • 80 80 iv Table of Contents 8 9 7.3.2 Data-level Partitioning Algorithm 81 7.3.3 Schema-level Partitioning Algorithm 82 Experimental Study ^ 86 8.1 Experimental Settings and Implementation 86 8.2 (S, O) Pair Comparison and Analysis 87 8.3 Scalability Results and Analysis 91 Conclusions and Future Work Bibliography 97 99 List of Tables 5.1 IL-free Result Matrix , . 5.2 Completeness II Result Matrix 6.1 Query Transmitting Cost (unit: query-hop) 60 6.2 Query Rewriting Cost (unit: qrewrite-acp) 61 6.3 A C P Evaluation Cost (unit: acp-eval) 61 6.4 Answer Routing Cost (unit: tuple-hop) 63 6.5 A C P Distributing Cost (unit: acp-hop) 64 6.6 Annotating Cost (unit: annot-update) 6.7 Annotation Shipping Cost (unit: annot-hop) . . 43 50 65 •• • 66 vi List of Figures 1.1 A Simple P D M S Example 1.2 Motivation Example for Access Control in an X M L P D M S . . . . '. 2 5 2.1 Data Integration System Example 10 2.2 P D M S Example 10 3.1 Example for General Methods of Access Control 3.2 Hospital P D M S Example 3.3 Conference Example . 3.4 Company Management P D M S Example 26 3.5 Example for Semantics of P D M S Query Answering under A C P s . 29 5.1 Problem in Completeness I (Ideal Completeness) 47 7.1 Query-Rewriting in light of A C P Algorithm 71 7.2 Example for Safe-Peer-List Finding Algorithm 75 7.3 Safe-Peer-List Finding Algorithm 77 7.4 Example for O3 Answer Routing Algorithm 78 7.5 03 Answer Routing Algorithm 79 7.6 Answer Partitioning Semantics 80 7.7 Data-level Partitioning Algorithm 83 7.8 Schema-level Partitioning Algorithm 8.1 Running Time Comparison of (S,0) Pairs, in case of Large Net- ' 19 . 22 • • 24 ; work Latency 8.2 89 Running Time Comparison of (S, O)'Pairs, in case of Small Network Latency . 8.3 85 Running Time Comparison of (5,0) 90 Pairs. Under this experi- ment setting, the P D M S topology and A C P distribution benefit O3 finding a short answer-routing path, but doesn't benefit OIA- 92 vii List of Figures 8.4 the Running Time of the Query-Answering Process V.S. the Database Size of the Target Peer for Each Compared (S, O) Pair. The network latency is 100 ms 8.5 93 the Running Time of the Query-Answering Process V.S. the Database Size of the Target Peer for (Si, 0\). The network latency is 1000 ms 8.6 94 the Running Time of the Query-Answering Process V.S. the Number of A C P s per Peer for Each Compared (S, O) Pair 8.7 95 the Running Time of the Query-Answering Process V.S. the Number of A C P s per Peer for ( S i , 0 i ) , with the Network Latency of 0 ms, 100 ms and 200 ms Separately 8.8 95 the Running Time of the Query-Answering Process V.S. the Length of the Longest Path from Source Peer to Target Peer for (Si, 0\). The experiment is done for both a P D M S with 20 peers and a P D M S with 30 peers . 96 viii Acknowledgements I would like to thank all the people who gave me support and help throughout my degree. First of all, I would like to thank my supervisors, Professor Laks V.S. Lakshmanan and Professor Rachel Pottinger, for their patient guidance and encouragement in my research work at U B C . Laks introduced me into the area of access control in X M L P D M S , and taught me the approaches how to explore unknown questions and study them clearly. Rachel brought forth many insightful ideas in our group discussion, guided my experiment study, and paid great effort in modifying my thesis. They both helped me a lot in improving my communication skills. Second, I would like to thank Professor Alan Wagner for dedicating his time and effort in reviewing my thesis. Third, many thanks to all my colleges and friends for their friendly support, especially to Jian X u , Jie Zhao, Shaofeng Bu, Wendy Wang, Suling Yang, X i aodong Zhou, Terence Ho. Thanks a lot to Holly Kwan for her kind help in my everyday life as our lab secretary. Last but certainly not least, I would like to thank my families for their endless love and firm support. ix Chapter 1 Introduction The peer data management system (PDMS) is emerging as a flexible distributed data management architecture. Moreover, with the significant increase of web data, X M L is now used as the underlying data model of peers in a P D M S . However, the existing P D M S research has paid little attention to the access control requirement in each peer for its database, which might greatly affect the query-answering process in a P D M S . The access control issues in an X M L P D M S will be explored in our thesis. In this chapter, we first introduce the background knowledge of P D M S , X M L , and X M L queries (Section 1.1), then motivate our work by a concrete example (Section 1.2). Section 1.3 concisely states the access control problem in an X M L P M D S . Our main contributions are summarized in Section 1.4. 1.1 B ackground In this section, we introduce the background knowledge of our work: peer data management systems, X M L and X M L queries. A peer data management system (PDMS) is a distributed database management system based on a peer-to-peer architecture. Each node in a P D M S is called a peer. A peer is autonomous, has its own database and schema. A peer can join and leave the P D M S dynamically. Unlike the data integration system, there is no server playing the central-control role in a P D M S . If two peers are considered to be similar, one of their administrators builds a mapping between the database schemas of the two peers. Such peers are called acquaintances. Thus, the topology of a P D M S is an arbitrary connected graph, in which each edge is such a pairwise mapping. A query can be put forth at any peer. The query is first evaluated at the peer's local database, then it is passed to each of its acquaintances. When the query is passed to each acquaintance, the mapping is used to translate the query into a new query over the acquaintance's schema. Similarly, it is then passed to all acquaintances of all those acquaintances and 1 Chapter 1. Introduction thus broadcast to the whole P D M S . Finally the answers at every relevant peer is returned to the querying peer. Vancouver General Hospital Schema Montreal General Hospital Schema Boston General Hospital Schema Toronto General Hospital Schema Figure 1.1: A Simple P D M S Example As an illustration, Figure 1.1 shows a simple P D M S with four peers: Vancouver General Hospital, Montreal General Hospital, Boston General Hospital, and Toronto General Hospital. In this example, Toronto General Hospital is an acquaintance of Boston General Hospital, so a mapping is built from Toronto General Hospital Schema to Boston General Hospital Schema (the mapping is denoted by an arrow from Toronto General Hospital Schema to Boston General Hospital Schema). Similarly, other pairwise mapping are built between peers. When a query Q is put forth at Toronto General Hospital, it is first evaluated locally. Then Q is rewritten into Q' according to the mapping from Toronto General Hospital Schema to Boston General Hospital Schema. Q' is sent to Boston General Hospital and evaluated there. The answer of Q' is routed back to Toronto General Hospital. By this way, rewritten queries are broadcast in the whole P D M S , and the answer from each hospital is returned to Toronto General Hospital. X M L (extensible Markup Language) currently is the W 3 C recommendation for publishing electronic data on the web. Nowadays, it is the de facto standard for web documents and data storage. A n X M L document is plain text interleaved with some markup, which divides the document content into character data, container elements, and attributes of the elements. There is one and only one root element in an X M L document. Sub-elements are embedded within an element. Thus, an X M L document is modeled as a tree structure, in which each node is an element or a character string. Normally, an X M L document is 2 Chapter 1. Introduction accompanied with an X M L Schema, which fully specifies the structure and data type information for this document. Therefore, X M L can be used as databases for peers in a P D M S . Mappings are built between schemas of X M L databases residing on acquaintances. The standard query form for X M L databases is XQuery. X P a t h is the main functional structure of XQuery, and it is the syntax to accurately address parts of an X M L document. A n XPath is a path expression for a sequence of steps from one node to another node. In each step, there are three components: (1) axis specifier: ' / ' denotes child, ' / / ' denotes descendant, '@' denotes attribute, etc; (2) node test: 'comment()' denotes a comment node, text()' denotes the l text value of a node, etc; (3) predicate: a mathematic expression put in a square bracket as a filter. Predefined operators can also be used in XPath, such as '|' denoting the union of two node sets. As the first example, the XPath expression "publication//paper/*[@id='001']" selects the element, whatever its name ('*'), if its id attribute value of '001', who is a child ('/') of a paper element that itself is a descendant ('//') of a publication element. As a more concrete example, the Xpath expression "publication//paper[/author/text()='Rachel Pottinger']" selects the paper element, if it is a descendant ('//') of a publication element and has an author child element ('/') whose text content ( text()') is l Rachel Pottinger. This Xpath expression retrieves the full paper list for Rachel Pottinger. XPath queries can be categorized into several fragments according to whether including ' / ' , ' / / ' , '[ ]','*', ' | \ Schema or D T D (another type of X M L schema). In this thesis, we concentrate on the X P a t h fragment only with '/'> '//'> '[ ]'• F ° instance, our second X P a t h example "publir cation//paper[/author/text()='Rachel Pottinger']" belongs to this XPath fragment. The tree pattern is the key construct for modeling XPath. A tree pattern includes two components: (1) a tree, in which the nodes are labeled with variables, (2) a set of formulas, which are constraints on the tree nodes and their properties (i.e. tags, attributes, contents). The tree has two types of edges: pc (parent-child) edges and ad (ancestor-descendant) edges, which correspond to ' / ' and ' / / ' in XPath. 3 Chapter 1. Introduction 1.2 Motivation and Challenges As a flexible data management environment, a peer data management system is suitable for many applications, such as the public medical institutions, the international company management, and the insurance system, etc. For example, a public medical institution environment may consist of several hospitals, healthcare centers, the Ministry of Health, and emergency units. Each institution is independent, has its own database and share the data across the web. Quite often these institutions need to collaborate. For instance, when a patient is transferred between hospitals, the patient's medical history needs to be shared. Probably there is no global schema for all the hospitals, so a data integration system does not help. A P D M S is useful at this time. With the help of the pairwise schema mapping, the patient's illness history can be easily transferred from one hospital to another one. Furthermore, a query asking for one patient's information can be put forth at a peer and broadcast in the whole P D M S , and results will be retrieved from every relevant peer. Although the existing P D M S projects [13, 35, 37, 40] can handle the problems of schema mapping and query rewriting, they do not effectively take into account the access control requirements of peers, i.e., all the data on each peer is public for other peers. This is not true for a realistic application. Because a peer is autonomous, it has the requirement to define access control privileges on its database, i.e., which peers have the right to access a specific part of its database. For example, a hospital may only allow other hospitals to access the illness history of a patient, but forbid any institution to access the personal information of a patient. Such access control requirements are so common in today's database management systems that they should not be ignored in a realistic P D M S . When access control exists in a PDMS,,security problems will arise. The existing query-answering algorithm does not work well in this case. Let us observe a concrete example. It is shown in Figure 1.2. There are four peers in the X M L P D M S : Vancouver General Hospital, Montreal General Hospital, Boston General Hospital and Toronto General Hospital. The schemas for their X M L databases are shown in the figure. The pairwise mappings of their schemas are denoted by dash arrows. For simplicity, we call the four peers Vancouver General Hospital, Montreal General Hospital, Toronto General Hospital and Boston General Hospital separately as 'vg', 'mg', 'tg' and 'bg'. And the database residing on each peer is 'vg.xml' for 'vg', 'mg.xml' for 'mg', 'tg.xml' for 'tg', 'bg.xml' for.'bg'. The possible message routing paths are denoted by 4 Chapter 1. Introduction Chapter 1. Introduction bold arrows. Suppose a query "retrieve the illness history of Mary Smith" is put forth at 'vg', rewritten according to schema mappings, and broadcast to the P D M S . Let us treat 'tg' as the current answering peer. The rewritten query is evaluated on 'tg.xml' to get the 'Event' elements for Mary Smith. If 'tg' does not specify any access control on its database, i.e. any peer can access all the data of 'tg.xml', the answer can be routed back to 'vg' via either the path "tg —> mg —> vg" or the path "tg —* bg —» vg". This is the existing query-answering algorithm. The case with access control may be different. Suppose in our scenario, 'mg' and 'tg' do not have a collaboration relationship such that 'tg' specifies the access control of forbidding 'mg' to access any information on it. Thus, the answer for the query "retrieve the illness history of Mary Smith" at 'tg' can not be routed via the path "tg —» mg —> vg". Otherwise, information leakage will arise, i.e., 'mg' will see data that it is forbidden to access by 'tg'. The answer can only be routed via the path "tg —> bg —* vg". From this example, we see that the access control requirements of peers affect the P D M S query-answering process. Furthermore, access control on a peer database can be more fine-grained and complicated than the previous example, especially when X M L is the data model. What is the impact of access control on the P D M S query-answering process is still unknown according to existing research work. The major challenges we are faced with in a X M L P D M S with access control include: (1) How can we specify the access control requirement for a peer's X M L database, which is fine-grained and expressive enough? (2) What is the semantics of P M D S query-answering with access control? (3) What kind of algorithms can be used for the P D M S query-answering process? (4) What is the security property of these algorithms? (5) How to build a rational cost model and assess the algorithms using this model? A l l these challenges will be tackled in this thesis. 1.3 Problem Statement A peer in a realistic peer data management system probably has access control requirement on its own database. Therefore, a precise syntax for specifying a access control requirement is necessary for an X M L P D M S . Furthermore, in a P D M S with access control, a naive query-answering algorithm no longer works in terms of the security issue. Thus, new query-answering algorithms 6 Chapter 1. Introduction need to be designed, theoretically ensuring no information leakage and other good properties. A cost model is also required to assess any query-answering algorithm for an XML PDMS. 1.4 Contributions The following contributions are made in this thesis: • We propose a formal syntax for the Access Control Policy (ACP), which is fine-grained and expressive enough for specifying the access control privilege on the XML database of a peer in the PDMS. Semantics of PDMS query-answering with ACPs is also presented. (Chapter 3) • We divide a query-answering algorithm into two parts: a (query transmitting) Strategy and an (answer routing) Option. Several strategies and options have been designed to handle the access control requirements in PDMS. (Chapter 4) • Some novel algorithms in the strategies and options, such as (i) query rewriting algorithm in light of ACPs (ii) safe peer list finding algorithm (iii) annotating and partitioning algorithm, are presented. (Chapter 7) • As important properties, Information Leakage Free and Completeness for an (Strategy, Option) pair are formalized. We propose the sufficient and necessary condition for the two properties, and analyze these properties for every (Strategy, Option) pair designed. (Chapter 5) • We build a comprehensive cost model, which includes the major tasks and the corresponding primitive operations and cost units. The cost model is used to assess the (Strategy, Option) pairs designed. (Chapter 6) •^We experiment on the designed (Strategy, Option) pairs, compare their execution speed, and test the scalability in terms of ACP amount pe peer, database size, etc. (Chapter 8) 1.5 Thesis Outline The remaining of the thesis is organized as follows. Chapter 2 reviews related works on peer data management system (PDMS), query containment, and access 7 Chapter 1. Introduction control on X M L documents. In Chapter 3, we present the general access control problem in the X M L P D M S , the formal definition of Access Control Policy (ACP), and the semantics of P D M S query-answering with A C P s . In Chapter 4, we divide a query-answering algorithm into two parts - a strategy and an option. Several strategies and options, which can handle access control, are also designed there. Chapter 5 presents the formal definitions of IL-free and completeness, the sufficient and necessary condition for each of them, and the analysis result for all (Strategy, Option) pairs designed. In Chapter 6, we propose a comprehensive cost model that is used to assess all (Strategy, Option) pairs. In Chapter 7, some novel algorithms adopted in our strategies and options are elaborated and illustrated in detail. Chapter 8 is the experimental study for algorithm comparison, algorithm scalability, etc. Finally, our conclusions are stated in Chapter 9, along with the future work. Chapter 2 Related Work As described in Chapter 1, the work of the thesis concentrates on the access control scheme of the X M L peer data management system. Peer data management system (PDMS) is the network environment we are working in; access control is the main issue we are researching on; and query containment is a necessary theoretical tool to design the query writing algorithm in the P D M S query-answering process and to ensure the algorithm correctness. Therefore, in this chapter we will summarize the previous research work on peer data management system (Section 2.1), query containment (Section 2.2) and access control on local X M L documents (Section 2.3). 2.1 Peer Data Management System (PDMS) Data integration systems have been researched and adopted in academia and industry for a long time [5, 8, 16, 22, 28, 29, 30, 31]. They work well for sharing information in a specific domain. However, data integration is faced with a big problem: it requires to.predefine a mediated schema before all nodes can share information. Thus the mediated schema has become a bottleneck' in a data integration system. Recently, the idea of a peer data management system (PDMS) [23] has emerged as a step beyond data integration systems. A PDMS is a distributed database management system based on a peer-to-peer architecture. In such a system, each web node is an autonomous peer and has its local database management system. The P D M S satisfies the need to have a decentralized, loosely-coupled data management environment, in which any web node can have different data model and contribute data, schema or mappings among schemas. Unlike data integration systems, a P D M S does not require a central control server or a global schema. Instead, mappings are constructed between the schemas of any two related peers. A simple example can help to understand the difference between data in9 Chapter 2. Related Work Figure 2.1: Data Integration System Example Figure 2.2: P D M S Example tegration system and P D M S . It is a scenario about sharing database researchrelated data that is used in the Piazza project [40]. The data integration system is shown in Figure 2.1, and the P M D S having the same functionality is shown in Figure 2.2. The schema and database of each independent node are put in a dotted frame. A n arrow denotes a mapping from one schema to another schema. The data integration system is a tree-based hierarchy, in which the Mediated Schema is the root node and all other machines are the leave nodes. There exists a mapping from the Mediated Schema to each node schema (e.g. UPenn Schema). Any query can only be put forth to the Mediated Schema, rewritten into some sub-queries (according to the schema mappings) and distributed to each leaf node (i.e. Princeton peer, UPenn peer, U W peer, Stanford peer and Berkeley peer). The sub-queries are evaluated at each leaf node, then the answers are returned to the root node. On the other side, the P D M S is an arbitrary connected graph of nodes/peers. There is no such a peer who holds a global mediated schema. Mappings are built between the schemas of any two relevant peer (e.g. the mapping from Stanford Schema to Berkeley Schema). A query can be put forth at any peer. Besides evaluated at the local database, the 10 Chapter 2. Related Work query is rewritten into new' queries with the help of the schema mappings and broadcast in the P D M S . Answers from every peer will be returned to the querying peer. For example, if a query Q is put forth at Stanford peer, Q is firstly evaluated at Stanford Database. Meanwhile, Q is rewritten into Q' according to the mapping from Stanford Schema to Berkeley Schema, sent to Berkeley peer and evaluated there. Q is also rewritten into Q" and sent to U W peer. By this way, the original query Q is broadcast in the whole P D M S . A l l answers will be returned to Stanford peer in the end. In the following, We will analyze some important P D M S projets. Piazza [23, 24, 40] is a classic P D M S using X M L as the peer data model. Because each peer may have a different schema, Piazza provides a pairwise schema mapping language similar to XQuery [12] and a query reformation algorithm for rewriting queries between peers. Piazza recognizes and motivates the access control as an important problem for a P D M S , but the only solution to the problem is a description of general plans to use encryption to enforce security. The details of this approach were left as,future work. The Hyperion project [37] is relational database-based P D M S . Hyperion builds and manages the mapping tables between peers at run time. And both schema-level and data-level mappings are supported. Hyperion's emphasis is query answering among heterogenous peers, and it doesn't address access control or security issues. HePToX [13] is a P D M S prototype using X M L as peers' underlying databases. Peers are heterogenous. HePToX emphasizes on semi-automatically generating Datalog-like mapping rules and the efficient query translation algorithm. Access control issue is not considered in HePToX. PeerDB [35] is an interesting system. The underlying database for each peer is a relational database system. However, there is no mapping between peer schemas. Instead, PeerDB uses the Information Retrieval (IR) technique to retrieve answers from different peers. Thus, PeerDB is a combination of database and IR systems. No access control issue is discussed in PeerDB. As a conclusion, we see that the existing typical P D M S systems have not studied the access control problem, although some of them have recognized it as an important issue for a realistic P D M S . That is the work we will exploit in this thesis. 11 Chapter 2. Related Work 2.2 Query Containment As we mentioned at the beginning of this chapter, query containment will be used as a theoretical tool to design our query rewriting algorithm in the P D M S query-answering process. Intuitively speaking, when a query Q is rewritten into a new query Q' in light of an access control rule R, it must ensure that the answer of Q' is contained by both the answer of Q and the answer of R. That is why query containment is important for our work. Query containment problem was firstly exploited for conjunctive queries. Conjunctive Query (CQ) is put forth by A . K . Chandra and P . M . Merlin [15]. A Conjunctive Query is defined as a datalog rule H <— G\, ...,G„, where H is the head, the right side hand is the body and G,(i = l..n) are relations, which are referred to as subgoals. The answer for a conjunctive query Q<evaluated on a relational database D is denoted as Q(D). In more details, Q(D) is the set of the head got by performing a possible value substitution for variables in Q, where the substitution turns every subgoal of Q's body into a tuple in D. Example 2.1 p(X, Y) <— a(X, W), b(W, Z), c(Z, Y) is a conjunctive query Q. Specifically, it states the following. The body describes the relations a, b, and c. The re-use of variables indicates that the values must be the same. So the body specifies that the second attribute of a must equal the first attribute of b, and that the second attribute of b must equal the first attribute of c. For each set of tuples that satisfy the body requirement, the head is instantiated. In this case, a new tuple of relation p is created, and the attributes of p are given the values of X and Y from the body. E.g., given a database D, there are three relational tables a, b and c in it. In a, there is one tuple (xi,wi); In b, there is one tuple (w\,z\); In c, there is one tuple (z\,y\). Then for the substitution {X = x\, Y = 2/i, Z = z\, W = W,}, each subgoal in the body of Q is a tuple in the database D. Thus, the instantiated headp(x\,y\) is in Q{D). Give the definition of C Q and the meaning of a C Q evaluated on a database, CQ Containment can be defined as: Let Qi and Q be two CQs. Q\ C Q iffV database D: Qi(D) C 2 2 Q2(D). There are two classic approaches to test C Q containment: (1) containment mapping, (2)canonical database. Containment Mapping can be defined as: 12 Chapter 2. Related < 2 i and Q2 are CQs, where Q\ : head\ head2 <— SG\,5G . Vi : u(SGi) <— sg\,.:.,sgk and Q2 : Then a containment mapping is a function m fi : vars(Q2) Work —> i>ars(Qi) such that (1) p(head2) = heqdi, (2) = sgj, for some j . If Qi and Q2 are CQs, then Q\ C Q iff 3 a containment mapping /x : vars(Q2) —> 2 •yars(Qi). Example 2.2 We fcai/e two CQ's: Q x : p{X,Z) <- o(X, W)> 6(IV, Z) and Q2 : p[X,Z) <— o ( X , W ) , 6(y, Z ) . to vars(Qi): W —> W, X —* X, Y —> IV, Z —> Z . There exisis a mapping fi from vars(Q2) /x makes the mapped head of Q2 as p(X, Z), which equals the head of Q\. For Q2's subgoals, we see that p(a(X, W)) = a(X, W), which is a subgoal in Qi's body, and p,(b(Y, Z)) = b(W, Z), which is a subgoal in Q\'s body. Thus fi is a containment mapping from Q2 to Qi, then Q\ C Q2. Canonical Database method is to build a small number of databases D\, D , such that Q\ C Q iff Qi(Dj) ^ Q2{Di), where i = 1, ...,n. This method n 2 is not used in our work, so we do not illustrate it here. For more details, please refer to [41]. The C Q containment problem has been recognized as NP-complete in [15]. Much attention have been devoted to finding special classes of queries that admit polynomial time algorithms for containment and minimization[6, 7, 11, 17, 26, 27, 43], With the popularity of X M L query applications, query containment research expands from CQs to X M L queries. • XQuery [12] is recognized as the X M L query standard. But it provides too many supportive structures, such as the F L W O R expressions and constructors. To avoid being distracted by these supportive structures, researchers have concentrated on X P a t h and its equivalent representation Tree Pattern, which is the main functional structure of XQuery. The concept and approaches for C Q containment have evolved to the work for containment of XPath queries (or tree patterns). The difference of X P a t h containment from C Q containment is that the query structures are trees and the queries may have recursions. The semantics of XPath containment is exactly the same as that of C Q containment. Existing approaches for checking C Q containment work for X P a t h query containment as well, but after some extension. The canonical model 13 Chapter 2. Related Work (database) technique [19, 32, 33], the homomorphism (containment mapping) technique [9, 20, 33, 36, 42] are widely used in the complexity ,analysis and algorithm design of X P a t h containment. For example, homomorphism is formally redefined for tree pattern containment in [20]: "A homomorphism h from a pattern q to a pattern p is a total mapping from the nodes of q to the nodes of p such that: (1) h preserves node types (i.e. Vu £ JV, : A (u) ^' *' => X (u) = A (/i(u)), 9 q p where u is a node in q, N is the node set of q, and AQ is the funcq tion to find a node tag.); (2) h preserves structural relationships (i.e. whenever v is a child (resp. descendant) of u in q, h(v) is a child (resp. descendant) of h(u) in p)." Checking query containment for many X P a t h fragments has been verified to be extremely hard. Fortunately, for some X P a t h fragments we can still find polynomial time algorithms. A l l the complexity results for containment of different XPath fragments are summarized in [38]. As mentioned in Section 1.1, in the thesis we deal with an X P a t h fragment only with ' / ' , ' / / ' , '[ ]', which is shown in [38] to have a polynomial time algorithm for finding a query containment. Most X P a t h expressions in usual X M L queries fall into this fragment and it is a good start for us to design algorithms for this XPath fragment. 2.3 Access Control on X M L Documents Access control in an X M L P D M S is the main problem we are working on. It is necessary for us to analyze the existing approaches for securing X M L documents. With the development of web-based applications, X M L has become the de facto standard of semi-structured data representation. It provides an easy way to publish information. Selective distribution and sharing of X M L documents requires enforcement of access control. This ensures that specific information is accessible only to authorized entities or roles. Different access control approaches for local X M L documents have been proposed. Among them, access control policy model is widely recognized as an expressive, fine-grained method. A n Access Control Policy is a rule defined to permit or deny the use of some objects/elements in an X M L document by a subject/user. XACML[39] presents an X M L schema for specifying access control policies on X M L documents. However, it is very complicated and even requires a spe14 Chapter 2. Related Work cific processing model to interpret the access control policies. Paper [25] defines an access control language using the concept of role, which is an abstract representation of a set of privileges and could be assigned to users. It supports both read/write and positive/negative policies. Paper [21] formalizes access control policies in a SQL security model compatible manner, but it doesn't support negative policies. For all these work, the permission/prohibition on an element is automatically propagated to its subelements. The above work concentrate on the formal expression of an access control policy, not on its usage. Access control policies can also be manipulated in different ways. The method of [18] is view-based. It allows the definition and enforcement of access control directly on X M L documents, then produces a separate view on the document for each user. The method of [10, 34] are encryption-based. They define a formal syntax of access control policies for X M L documents, and encrypt different portions of the same document according to different encryption keys. Then various users can use their own encryption keys to get the desired portion of the same encrypted document. Paper [14] is the first step to handle querying X M L data in light of access control policies. Its access control policies are XML-compatible. But only very simpleXQuries can be transformed to directly incorporate restrictions of access control policies on XQuery variables. In later chapters, we will see that our access control policy model supports both read/write and positive/negative privileges, and it plays an important role in the query rewriting algorithm. 15 Chapter 3 Access Control in X M L PDMS In Section 2.3, we have summarized the work of access control on local X M L documents. However, the existing research work does not reveal what problems will arise in a P D M S , where access control on distributed X M L data sources are' required. In Section 3.1, we describe a general view of the access control problems in a P D M S . Then we concentrate on our solution - Access Control Policy (ACP) in the X M L P D M S . We present the A C P formal syntax in Section 3.2, the A C P examples in Section 3.3, and the semantics of P D M S query answering under A C P s in Section 3.4. 3.1 The Problem in General Access control and its subsequent problems arise not only in local X M L documents, but in peer data management systems. As the owner of a database, a peer is not always ready to publish all its data for any other peer. Peers need to control their data in fine-granularity, i.e., which part of data can be accessible by which peers. When access control exists in peers of a P D M S , some problems will arise. In Section 3.1.1 we will present the general sense of two important problems in access control:, information leakage and answer completeness. There are multiple methods that can be used to enforce access control. Different possibilities have different characteristics. In Section 3.1.2, we analyze a few typical methods. In Section 3.1.3, we use an example to illustrate what the differences are in these methods. 16 Chapter 3. Access Control in XML PDMS 3.1.1 General Sense of Information Leakage and Completeness Generally speaking, information leakage means that some protected data are accessed by unauthorized subjects. Given certain access control requirements in a P D M S , information leakage must be avoided. It is the basic security issue for a P D M S . Information leakage in a P D M S mainly include the following aspects: • in the query-answering process, an answer tuple is routed to a peer, who is not authorized to see this tuple according to any access control rule; • some data is malevolently exposed to peer p\ by peer even p\ is not authorized to see these data by access control rules of the original data owner; • access control instances are improperly distributed to unauthorized peers. The first aspect is our focus, which can be effectively avoided by a well designed query-answering algorithm. The other two aspects are not issues that can be solved at an algorithm level. So they will not be in the scope of our work. Completeness refers to the answer completeness. After a query is put forth by a peer in a P D M S , completeness refers to that, the maximum answer set will be retrieved. Given enough time, network bandwidth and powerful local computation capability, we expect the requesting peer can get back the theoretically maximum answer set. A good query-answering algorithm for a P D M S can ensure the maximum and sound answer set to return. 3.1.2 General Methods of Access Control Now we have a general idea about access control. We need to know more about how access control can be enforced in a P D M S . Let us briefly study the general methods that can be used in the P D M S access control. (1) E n c r y p t i o n vs. N o E n c r y p t i o n When the intermediate answer is routed back to the source peer, the system must ensure no information leakage in this process. To ensure that there is no information leakage, either encryption or non-encryption method can be used. Using the encryption method means to encrypt the answer at the target peer, and decrypt the answer when it arrives at the source peer. Using the non- encryption method means to route the original answer from the target peer to 17 Chapter 3. Access Control in XML PDMS the source peer via a selected path (maybe answer transformations are needed during the process), while ensuring that every peer in the path have right to access the answer. Using the encryption method, the answer can be routed along any path. Its overhead is that the answer needs to be encrypted at the target peer and decrypted at the source peer. Furthermore, the target peer needs to know who is the source peer for each incoming query, such that the decryption key can be distributed. Using the non-encryption method, there is no overhead caused by the encryption and decryption, but there exists a risk of information leakage and if steps are taken to reduce it, the returned answer set may be incomplete. Thus, the answer routing algorithm should be carefully designed. In a P D M S , any peer can be a source peer or a target peer, so the aforementioned decryption key distribution in the encryption method is a heavy burden. Moreover, because of scheme heterogeneity, when an answer set is routed among peers, it is decrypted and rewritten adhering to the database schema of each passing peer, and then encrypted for routing to the next peer. That means, decryption and encryption are needed at each routing peer. This is another heavy burden. Thus, in the thesis we concentrate on query-answering algorithms without encryption. (2) Evaluating vs. Rewriting How is a query handled and computed in a P D M S with access control? There are two different methods: evaluating and rewriting. Evaluating a query means passing along a query as initially written (presumably along with some annotation of what the passing peers are), and then the target peer that is returning the answer is responsible for extracting only the tuples that are relevant according to the access control requirements. Rewriting a query means taking the query along the way and changing the query at each peer such that. it adheres to the access control requirements for the peer. Using evaluating, the query is enforced with access control rules once at the target peer. But it requires to keep record of all peers along a query transmitting path. Using rewriting, the query is enforced with access control rules at every peer along the query transmitting path. Rewriting requires all access control rules have been distributed to peers where they are needed. 18 Chapter 3. Access Control in XML PDMS Figure 3.1: Example for General Methods of Access Control 3.1.3 Example Let us take an small example to illustrate the methods in Section 3.1.2. There is a P D M S with four peers (a, b, c\, C2); whose topology is shown in Figure 3.1. In the P D M S , a is the source peer that puts forth a query Q, and b is the target peer that answers Q. For simplicity, we assume (1) all peers have the same schemas, (2)the database on 6 is a relational database. The database on b holds one table T, in which there are only two tuples t\ and t^. Peer b defines the access control rules R\ and / J : 2 R\: only peer a and c\ have access to tuple t\. R2: only peer a and C2 have access to tuple £2The query Q is "SELECT * from T". The access control rules R\ and R2 ensure that a can access to all tuples in table T, i.e., t\ and (2- Thus, the final answer set arriving at a should be { t i , ^ } First, let us consider the encryption and non-encryption methods for answer routing. If the encryption method is used, the target peer b evaluates the incoming query Q, gets the answer set S = {ii.te}, and encrypts S as S'. Then the 5" can be routed back to o along either the path b —* c\ —» a or the path b —> C2 —» a. The encryption ensures no information leakage. When S' arrives at a, it is decrypted back to S. If the non-encryption method is used, the target peer b evaluates the incoming query Q, and gets the answer set S = {ti, *2}• Pick the path b —» C\ —> a and route t\ back via this path, because 19 Chapter 3. Access Control in XML PDMS peer c\ just has access to t\ (according to R\) but no access to t . 2 Similarly, pick the path b —» c —> a and route t back via this path, because peer c just 2 2 2 has access to t (according to R ) but no access to t\. 2 2 Secondly, let us consider the evaluating and rewriting methods. Suppose we use the non-encryption answer routing method and route the computed answer backtracking the query incoming path. Consider the path a —» c\ —> b. If the evaluating method is used, the query Q is routed along a —» c\ —> b as initially rewritten, keeping an annotation of all passing peers {a,ci}. When Q arrives at 6, b notices the annotation {a,ci} and uses the relevant access control rules R\ and R to rewrite Q into Q'. The answer is evaluated from Q' and routed 2 back along b —+ c\ —» a. If the rewriting method is used, we must make sure that R\ has been distributed to o and c\, and R has been distributed to o and 2 c . Consider the path a —» c\ —» b. The query Q is first rewritten into Q' at o 2 according to rules Ri and R , which ensures the answer of Q' can be accessed 2 by a. Next, when Q' arrives at c i , it is rewritten into Q" according to R\, which ensures the answer of Q" can be accessed by c\. Thus Q", which will finally arrive at b, ensures its answer can be accessed by both a and c\. After Q" is computed at b, the answer can be safely routed back to a via b —* c\ —+ a. 3.2 Access Control Policy (ACP) Formal Definition Having a general idea about the access control problems in a P D M S , we will take the first step into our own solution. As shown in Chapter 2, access control policy (ACP) is a flexible, expressive and fine-grained approach. Once specified, A C P s are platform-independent and can be easily transformed and distributed in a P D M S environment. Thus our work adopts A C P as the access control model for peer databases. Our whole access control scheme is based on such an A C P model. Let us propose the A C P formal definition and syntax for the X M L P D M S : Definition 3.1 (Access Control Policy ( A C P ) ) An access control policy ACP is defined in the following form: +/ — R(u,x) <— SLA(target,u),q(x). Such a policy defines that a set of peers u has read access to some target data elements x under the restrictions of service level agreement SLA (target, u) and object constraint q(x). 20 Chapter 3. Access Control in XML PDMS • +/- denotes authorize/deny access. • R denotes that it is a READ ACP. That means, the authorized peers will have the READ privilege of the target data elements. • u denotes the set of subject/user/role, which are the identifiers of users in the system and often refer to peers. • x denotes the set of target data elements. Here we define target schema • as the data schema on which the ACP has effect. Target data elements are some elements in the data schema, which are those elements that the A CP is allowing or denying access to. • target refers to the target peer, who is the owner of the target schema. • SLA(target,u) denotes a predicate that tests the role of the peers, i.e. whether peers u have a service level agreement with the target peer. If u satisfies SLA(target,u), u will have the access privilege defined by this ACP. • q(x) is the DB predicate or value constraints, which expresses the constraints on the target XML document: q(x) can be a conjunction of atoms. An atom can be a variable binding, a relational expression of equality or inequality. However, the expression of q(x) doesn't mean these constraints only have the domain of the target elements x, normally they are the constraints on all related elements. In this abstract expression, we don't treat x as the domain of q(x). • The authorize/deny access on an element x is automatically propagated to its subelements. We believe it makes sense to be consistent with the semantics of XQuery answers. • Similarly, we use W(u, x) to denote a WRITE/EDIT ACP. That means, the authorized peers will have the WRITE privilege of the target data elements. Without any specification, the strength of SLA(target, u) in an A C P may become boundless. We place a limit on what SLA(target,u) can contribute: SLA(target, u) only checks the agreement relationship between peers, i.e. which peers are authorized the privilege on target database by this A C P . In some cases, an A C P needs to match the peer's ID with an element value in the target 21 Chapter 3. Access Control in XML PDMS database. For example, assume there is an element in the database of my peer about the visitor's ID. M y peer defines an A C P that only the classmate peers have access to my database information. The A C P needs a way to compare the visitor peer's ID with the element value in my database. In our A C P syntax, we use a function compatible(x, y) to deal with the match of a peer ID and an element value in the target database. The basic A C P structure and use of the SLA(target,u) and compatible(x,y) functions are illustrated in the examples of next section. 3.3 PDMS Scenarios with A C P Examples In the previous section, we introduced the formal syntax of an A C P . In this section we illustrate it with some concrete P D M S scenarios and A C P instances. The examples show that the A C P syntax is XPath-based and XQuery-compatible It makes a good basis for our later X M L query rewriting algorithm. The first example illustrates the basic structure of the R E A D A C P . The scenario is a hospital P D M S from the HePToX project [13]. It is shown in Figure 3.2. '•Vancouver General > ^Montreal General TorOT&General; Figure 3.2: Hospital P D M S Example The Vancouver General, Montreal General and Toronto General are the three peers in this P D M S . For simplicity, we call Vancouver General, Montreal General, Toronto General separately as 'vg', 'mg' and 'tg'. Suppose there is only one X M L database on each peer and they are 'vg.xml' for 'vg', 'mg.xml' 22 Chapter 3. Access Control in XML PDMS for 'mg', 'tg.xml' for 'tg'. Suppose there are two access control requirements on the schema of 'mg': 1. peer 'vg' has R E A D access to patient's admission and process information later than Jan 1, 1990 of peer 'mg'; 2. Nobody has read access to patients' admission and progress information later than Jan 1, 2004. Then we can create the corresponding A C P s for the above requirements: Rl: +R(u,a,p) <— SLA('mg',u), doc("mg.xml")/MG/Admission a, doc("mg.xml")/MG/Progress p, a/ID = p/PatRef, p/Symptom/Date > 'Jan 1, 1999'. Where only u = 'vg' satisfying SLA('mg',u). R2: —R(u,a,p) <— SLA('mg',u), doc("mg.xml")/MG/Admission .a, doc("mg.xml")/MG/Progress p, a/ID = p/PatRef, p/Symptom/Date > 'Jan 1, 2004'. Where for every peer u there is a ('mg',u) tuple satisfying SLA. In this example, we see the basic structure of a R E A D A C P . The first A C P Rl is positive, which authorizes a peer to have the R E A D privilege on elements a and p under restrictions. The second A C P R2 is negative, which denies a peer to have the R E A D privilege on elements a and p. Assume there is an S L A database. There is only one tuple < 'mg', 'vg' > for Rl in the SLA database, but every peer u has a tuple < 'mg',u > for R2 in the database. We also see any legal arithmetic expression in XQuery, such as p/Symptom/Date > 'Jan 1, 2004', can be used in an A C P . The second example illustrates the use of the compatible ^) function. The 1 scenario is an academic conference proceeding. The target peer is named 'conf, and the X M L database on this peer is 'conf.xml'. The schema of 'conf.xmP is shown in Figure 3.3. 23 Chapter 3. Access Control in XML PDMS mm.•Name Area Figure 3.3: Conference Example Suppose we want to express the following access control requirements on this database schema: 1. Every PC member has READ access to all papers in his area of expertise. 2. No P C member has READ access to any of his own papers regardless of his area. The corresponding A C P s are listed as follows: 24 Chapter 3. Access Control in XML PDMS Rl: +R{u,p) «- SLA('conf',u), doc("conf.xml")/PC/Member doc("conf.xml")/Papers compatible(u, pm/Area Where SLA('conf = pm, /paper p, pm/Name), p/Area. ,u) defines the membership relation of any user for the conference, the function compatibleQ checks matching of a peer ID u and a P C member's name. R2: -R{u,p) <- SLA('conf',u), doc("conf.xml")/PC/Member doc("conf.xml")/Papers/paper p/Author pa = p, pa, pm/Name, compatible(u, Where SLA('conf pm, pm/Name). ,u) defines the membership relation of any user for the conference, compatibleQ checks matching of a peer ID u and a P C member's name. In this example, we see the use of the function compatibleQ. in Section 3.2, the function compatibleQ As described checks to see if a peer ID matches an element in the target database. The third example illustrates the W R I T E A C P and the negative use of the compatibleQ function. The scenario is a company management P D M S . This company has several departments. Each department server is a autonomous peer. Each department has a manager and some employees. (The manager is also an employee.) The database schema for one department peer is shown in Figure 3.4: We name the peer in the figure as'd', its X M L database as 'department.xml'. Suppose we need to express the following access control requirements on the schema o f ' d ' : 25 Chapter 3. tDepartmentlrifb Access Control in XML PDMS '.•;Empibyee IvMager.-; DNarhe -Lo cati bri ^.eptlD Figure 3.4: Company Management P D M S Example 1. Every manager has W R I T E access to any employee's full information in his department. 2. Every manager is denied W R I T E access to any employee's information in other departments. 3. Every employee has R E A D access to his own information. 4. Every employee is denied R E A D access to other's salary information. T h e n the corresponding A C P s are specified as follows: 26 Chapter 3. Access Control in XML PDMS Rl: +W(u,e)<-SLA('d',u), doc("department.xml")/Department/Employee e, doc("department.xml")/'Department/Manager m, compatible(u,m/I D), m/DeptID = e/DeptlD. Where SLA('d', u) defines the relation for membership in this company, the function compatibleQ checks matching of a peer ID u and a manager's ID, "m/DeptID = e/DeptID" shows the manager and the employee are in the same department. R2: - W ( u , e ) <- SLA('d',u), docQ'department.xml") /Department/Employee e, doc("department.xml")/Department/Manager m, m/DeptID\ = e/DeptID. Where SLA('d',u) defines the relation for membership in this company, "m/DeptID\ = ejDeptID" shows the manager and the employee are not in the same department. R3: +R(u,e) «- SLA('d',u), doc("department.xml")/Department/Employee compatible(u,e/EID). e, Where SLA('d',u) defines the relation for membership in this company, the function compatibleQ checks matching of a peer ID u and an employ's ID. R4: -R{u,s)^-SLA{ d',u), doc{"department.xml")/Department/Employee e/Salary s, N O T compatible^, e/EID). i e, Where SLA('d',u) defines the relation for membership in this company, the function compatibleQ checks matching of a peer ID u and an employ's ID. In this example, we see the positive and negative W R I T E A C P s (Rl & R2). A n d we also see that compatibleQ function can be used with " N O T " to represent mismatching (R4). 27 Chapter 3. Access Control in XML PDMS 3.4 Semantics of PDMS Query-Answering under ACPs Given that all A C P s are specified and distributed as needed, what is the answer semantics of the P D M S query-answering? More specifically, what kind of answer do we expect to get after a query is put forth at a peer in a P D M S ? In this section, we will formalize the semantics of P D M S query answering under A C P s . The access control issue is orthogonal to the issue of schema heterogeneity. Thus, to simplify the problem, we assume that all peers use the same schema. This allows us to tackle access control without adding in the complications of schema heterogeneity. We leave the addition of schema heterogeneity to the problem as future work. Besides, we do not distinguish a peer with a requester. Several requesters may put queries on a peer to the whole P D M S . We assume all queries are put forth by a same peer. This simplification will help us to see the nature of the semantics problem. Firstly let us start with the answer semantics of a P D M S without access control. Given a P D M S with n peers (pi, P2,---, Pn), each peer has a local database DB i (i = l..n). In a practical P D M S , there may be some peers who P are virtual nodes and do not have a local database. However, this case is not considered here. Suppose a query Q is put on p\. The full answer set returned at p i is the union of the answer set from every peer. For each peer pi(i = l..n), the partial answer set is Q(DB ) (i = l..n). Thus, the semantics of answer Vi returned by the P D M S is \J Q(DB ) t Pi (i = l..n). Next let us add the factor of access control. Let AV (DB ) Pl be the access Pi view for peer p i on peer p^s database, which holds for a centralized system using any access control policy model. For each peer pi (i = l..n), the answer that p i has the permission to see on DB Pi is Q(AV (DB )). Pl Thus, naively, Pi one might expect the full answer returned to pi to be \J Q(AV t PI (DB )), Pi where i = l..n. But it is not correct. Because any answer tuple needs to be routed from the answering peer pi to p\ via some other peers. It must ensure that an answer tuple can be accessed by every peer along the routing path. Consider this problem in another way: when a query is transmitted from p i to Pi, the query will pass via some other peers, and these peers will add access control constraints on the access view of pi to make the final answer set smaller than Q(AV (DB )). pl Pi Consider the following example. Figure 3.5 is a P D M S topology. Suppose 28 , Chapter 3. Access Control in XML PDMS Figure 3.5: Example for Semantics of P D M S Query Answering under A C P s p sends a query Q, and peer q returns partial result by several routing paths. The label attached to each edge denotes the policy constraint. E.g. (p, 1,2) denotes a query through this path can only be evaluated on the access view AV {DB ) n AVi(DB ) n AV {DB ), or simply as AV (DB ). p q q 2 q ipAt2) final result returned from q to p is Q(AV( i )(DB )) Pi i2 Q(AV (DB )) {Pi3t2) q U Q{AV {DB )), (Pi3) q l)p Q{AVp (DB )), where P pq from pq q vq q Thus the q U Q(AV( i -)(DB )) Pt ]2i3 q U or simply as is a path from p to q. This answer is different AV {DB ). p q Generalize the above result for a P D M S with n peers pi,...,p , where p\ n puts forth a query Q and every peer answers it. The final answer set returned to pi is UiUp . Q(AVp (DB )) PIP PiPi Pi (i = l..n), where P PlPi is a path from p i to Pi. This is the semantics of the P D M S query-answering under A C P s . 29 Chapter 4 Strategies and Options for the Query-Answering Process Chapter 3 presented the syntax of the access control policy and showed that it is expressive enough to specify fine-grained access control on peers' local database. But we have not described how to enforce A C P s , or say, how A C P s are used in the P D M S query answering process. Intuitively, a query-answering process can be clearly divided into two parts: query transmitting and answer routing. Thus we separate a query-answering algorithm into two parts: a (query transmitting) strategy and an (answer routing) option. Study on the general methods for access control (Section 3.1.2) inspires our designing strategies and options. In this chapter, we present the intuition (Section 4.1) and formal definitions of a strategy and an option (Section 4.2), then describe the basic assumptions (Section 4.3) and our designed strategies and options that make use of access control polices (Section 4.4). 4.1 Intuition The security problem arising in a P D M S concentrates on the query-answering process. Such a query-answering process is under the control of a distributed, runtime algorithm. The algorithm distributes the query or its rewritten form from the source peer to many target peers and retrieves answers from these target peers to the source peer. The example used in Section 3.1.3 is helpful for illustrating the problem. Please refer to Figure 3.1. The example setting is exactly, the same, including the topology, peers, A C P s , query, and so on. Assume a non-encryption query-answering algorithm controls the query-answering pro30 Chapter 4. Strategies and Options for the Query-Answering Process cess of the P D M S . When a query Q is put forth at peer a, the query-answering algorithm transmits Q via every path from peer a to peer b, and for each path Q is rewritten to a new query according to relevant ACPs. Let the rewritten query via path a —> c\ —• b be Q', the rewritten query via path a —> c —> b be 2 Q". Then the answer set for Q' is { i i } , and the answer set for Q" is {t }2 If the query-answering algorithm routes {ti} back to o via b —• c —» o, the informa2 tion leakage arises. Because peer c is not authorized to access t\ by any A C P . 2 Thus, to ensure nd information leakage, the query-answering algorithm must route back to a via b —> c\ —> o. Likely, the query-answering algorithm must route {t } back to a via b —» c —> a. 2 2 To study the problem, we concentrate on the basic building block: transmitting a query asked by a single source peer to a single target peer, and then routing the answer set from that target peer back to the source peer. When a query Q is posed at a source peer c, the answer set for Q from any target peer containing relevant data needs to be computed and routed to c, modulo A C P s . Thus, the overall problem is built up oh basis of the simpler problem of single source peer and single target peer. The above pair-wise idea makes clear the building block of the query-answering algorithm. Next, let us consider the query-answering process for a pair of given source peer and target peer. The process can be clearly divided into two sequential, non-overlapping parts: I. q u e r y t r a n s m i t t i n g : informally speaking, transmitting the rewritten queries of the original query from the source peer to the target peer via some paths; II. answer r o u t i n g : informally speaking, routing back the set of answer, tuples from the target peer to the source peer via some paths. From now on, we call an algorithm handling Part I as Query Transmitting Strategy; and an algorithm handling Part II as Answer Routing Option, or Option. Then a distributed query-answering algorithm is composed Strategy, or of a Strategy and an Option. We will use a (Strategy, Option) pair to denote a query-answering algorithm, simply as an (S, O) pair. The properties of an (S, O) pair are those of the corresponding query-answering algorithm. 31 Chapter 4. Strategies and Options for the Query-Answering Process 4.2 Formal Definitions for Strategy and Option In this section, we will present the formal definitions for a Strategy and an Option. First of all, we define some terminology, which will be widely used in later discussion: • A C P of x for y: x's definition of what y can have access to x's data, where x and y are both peers • associated peer c of A C P A: peer c is defined in A C P A to access some data of another peer • V: a set of peers in a graph • o: the source peer • 6: the target peer • D: a database • £>(,: the database residing on b • Q: a query • Q{D): the database to hold the answer of evaluating Q on D. • t: an (answer) tuple in some Q{D). Here the word "tuple" refers to the building block of Q(D). For example, if D is a relational database and Q is a SQL query, t is a tuple in the relation Q{D)\ if D is an X M L database and Q is an XQuery, t is an X M L subtree or a combination of variable values. • PL(D): a new database, which defines part of database D that can be accessed for all peers in the set L . If L contains just one peer c, pi{D) can be simply written as p {D) instead of C p{ y(D). c • P: a path (sometimes it also refers to the set of all peers in a path if there is no ambiguity). Note that throughout we assume that any path conforms to the given topology. • P -,b- a path from a to 6 a • P — t h e set of all paths from a to fe a 32 Chapter 4. Strategies and Options for the Query-Answering Process The formal definitions for Strategy and Option that are used to transmit queries and answers over a single source peer and target peer set are shown in Definition 4.1 and 4.2. As mentioned in the previous section, this is the building block of the general case of one source peer and many target peers. Definition 4.1 (Strategy) Given source peer a, target peer b, query Q, choose a set of paths from a to b, and V such path: transmit some rewritten query Q' tob. Definition 4.2 (Option) Given source peer a, target peer b, query Q, a set of tuples S at b, choose a set of paths ¥b->v from b to V, where V is a set of peers, and send each tuple t € S down 0 or more paths € Pb^vThese definitions are at an abstract level. Note that "choose a set of paths" refers to the fact that the strategy or option will decide which paths the query or tuples will be sent down and does not reply that the path will be chosen apriori. The combination of a strategy and an option decides the distributed, runtime features of the query-answering process in a P D M S . Because there is little complication for query evaluation (i.e. generating answer tuples at the target peer), it is regarded as a separate phase between a Strategy and an Option, and not included in either of them. 4.3 Basic Assumptions To evaluate the approach, we created a number of general strategies and options. These strategies and options cover quite a broad spectrum, so we believe that most other strategies and options are variants of them. In this section, we propose a few important assumptions, which build the basis for our strategies and options. 1. Databases residing on all peers have the same schema. Although this assumption is not true for a realistic P D M S , schema heterogeneity will not affect the essence of security problem. Query rewriting or data transformation among different schemas is orthogonal to access control. This assumption simplifies the linguistic expressions, and allows us to concentrate on security issues in the query-answering process. So in later discussion, we ignore the query rewriting only with respect to schema heterogeneity. We leave the addition of schema heterogeneity to the problem as future work. 33 Chapter 4. Strategies and Options for the Query-Answering Process 2. When a query Q is transmitted along a path P, P is noted for Q. Assume there is a trivial way to record passing peer ID's with the routed query Q. It is handy and costs little. We ignore the cost of this task in later discussion. 3. Peers don't collude with each other. In other words, peers will not viciously share information to seek unauthorized data. Given peers c\, c , 2 C3, tuple t S D , and c has got t from c\, which is authorized by ci's Cl 2 A C P s . Then c can not share t with C3 unless c have enough knowledge 2 2 (annotations, ACPs) from c\ to verify C3's right to access t. Else we call it an illegal behavior. We don't consider such illegal behaviors, even when we discuss information leakage in Section 5.1. 4. Assume a tuple t can be accessed by peer c\ and has been routed to ci.'t can be distributed from peer c\ to another peer c , only if 2 C\ has sufficient witness from i's source peer s, on whose database t is computed. That means, even c\ has the right to access t, it doesn't have the right to willfully distribute t. The concrete cases we are concerned include: (1) ci must respect t's annotation. More specifically, if t has been routed to ci together with its annotation At (the set of safe peer ID's), c\ obeys t's annotation only to share it with peers d € At- (2) ci must have all A C P s of s for c to determine if c can access t. The precondition 2 2 for this assumption is: each peer behaves legally according to the queryanswering algorithm and trusts data from other peers. More specifically, if peer c\ receives tuple t, Ci trusts any information about t received from other peers. 5. Let Ai be an A C P of the target peer b for peer c\. If A\ is required for rewriting a query Q at c\, A\ must have been distributed to ci in a safe way. According to the definition, Strategy is a runtime algorithm. Distributing A C P s to requiring peers is the preparation for a strategy. Although we don't consider how to perform this work in a strategy, it is indeed a precondition of a strategy. Later we will elaborate on this task and count in its cost in the cost model and in (Strategy, Option) pairs' costs. 6. Let A\ be an A C P of the target peer b for peer c\. Assume c\ and c are adjacent peers in the P2P network, and c attempts to 2 2 route an answer tuple t to c\. If A\ is required at c to determine 2 34 Chapter 4. Strategies and Options for the Query-Answering Process whether c\ has access to tuple t, A\ must have been distributed to C2 in a safe way. One might think it is risky to distribute the A C P A\ for peer c\ to peer C2, which will cause information leakage. But in fact it is safe on condition of Assumption 3 and Assumption 4. Under Assumption 3 and 4, even C2 knows A C P A\, there is no way for C2 to get illegal data from c\. Because ci will use A C P s of target peer b for C2 to determine whether send 6's data to C2- According to the definition, an option is a runtime algorithm. Distributing A C P s to requiring peers is just the preparation for an option. We don't consider how to conduct it in an option. But later we will elaborate on this task and count in the cost of it in our cost model and in (Strategy, Option) pairs' costs. Without special claim, Assumption 1 to 4 hold for any strategy and option. Assumption 5 and 6 hold when the "if" conditions are met. 4.4 Strategies and Options Designed In this section, we will present the strategies and options we designed. The terminology for these strategies and options is listed at the beginning of Section 4.2. The basic assumptions are listed in Section 4.3. The following are the strategies we worked out, which make use of A C P s . We use Si to 54 to denote them. 51 Proactive Rewriting: Assumption 5 holds. V path P € P -»6, S i transa mits Q along P by: at each peer c £ P, when the query, thus far Q', is transmitted to it, it transmits Q" to the next peer in P, where Q" — Q' rewritten to adhere to A C P s of b for c. 52 Lazy Rewriting - Dumb: For one path P € P ->&, 5 transmits Q via 0 2 P. When Q is transmitted at b, Q is rewritten into Q' — Q rewritten to adhere to A C P s of b for all peers in the P D M S . 53 Lazy Rewriting - Path: V path P £ P -»i>, 53 transmits Q via P. When a Q is transmitted at b, Q is rewritten into Q' = Q rewritten to adhere to A C P s of b for all peers in P. 54 Jobless: V path P € P -.&, 54 just transmits Q via P. 0 35 Chapter 4. Strategies and Options for the Query-Answering Process In order to make Si (i=1..4) work well, there are necessary obligations for peers in the P D M S : • For Si, V peer c (except the target peer b): c use all A C P s of b for c to correctly rewrite the received query Q' into the new query Q" A c route Q" to the next peer. • For 52 and 53, V peer c (except the target peer b): c forwards the received query Q to the next peer. For the target peer b: b correctly rewrites the received query using required A C P s . • For 54, V peer c (except the target peer b): c forwards the received query Q to the next peer. The following are the options we designed, which make use of A C P s . We use Oi to Oe to denote them. Oi Whole — Backtracking: Given target peer 6, database D at 6, query Q that has been transmitted at b via some path P £ P ->&, Q and P have a been chosen by some strategy. Q{D) is regared at b as the answer set. Annotate answer set Q(D) with path P. A t each peer c £ P, c routes Q(D) to the previous peer in P until c = a. O2A Whole — Subl: Given target peer b, database D at b, query Q that has been transmitted at b via some path P € P ->6 , a Q and P have been chosen by some strategy. pp(Q(D)) C Q(D) is regarded as the answer set at b. Annotate answer set with path P. Choose a path P' s.t. P' £ Pi,_» a A P ' C P , Route p (Q(D) down P', P O2B Whole - Sub2: Given target peer b, database D at b, query Q that has been transmitted at b via some path P € P ^ 6 , Q and P have been Q chosen by some strategy. Q{D) is regarded as the returned answer set at b. Annotate answer set with path P. Choose a path P' s.t. P' £ Pb-» A a P' C P. Route Q(D) down P'. O3 Whole — Target Annotating: Given target peer 6, database D at b, query Q that has been transmitted at b via some path P £ P ->6 , a Q and P have been chosen by some strategy. The returned answer set at b is A = pp(Q(D)) C Q(D). Use A C P s of b to decide the safe peer list 36 Chapter 4. Strategies and Options for the Query-Answering Process • L = {c\p (A) = A}. Annotate the answer set A with L. Choose a path c P' s.t. P' 6 P b - a A P' C L. Route A x {L} down P'. O4A Whole - Dynamic Routing: Assumption 6 holds. Given target peer b, database D at b, query Q that has been transmitted at b via some path P £ P -»6, 0 Q a n d P have been chosen by some strategy. The returned answer set at b is A = pp(Q(D)) C Q(D). V peer c that received A: c routes A to peer d if pd{A) = A, until A arrives at a. O4B Whole - Dynamic Routing: Assumption 6 holds. Given target peer b, database D at 6, query Q that has been transmitted at b via some path P. £ P -.6, 0 Q and P have been chosen by some strategy. The returned answer set at b is A = pp(Q(D)) U S, where 5 is the set of supporting elements. V peer c that received A c routes A to peer d if pd{A) = A, until A arrives at a. O5 Partition - Target: Given target peer b, database D at b, query Q at b, Q has been chosen by some strategy. Q(D) is regarded as the answer set. (1) Partition Q(D) as follows: let the partition K = K \ , K , where n n [JKi = Q{D) A Ki n Kj = 0(t ^ j) Vffi (i = l..n), use A C P s of 6 to compute its annotation L i , where Li = {c\p {Ki) = Ki) c (2) MK~i(i = l..n): if ifj x {Lj} arrives at peer c, c routes ifj x {Li} to all its neighbors d, where d £ Li. OQA Dynamic Routing: Assumption 6 holds. Given target peer b, database D at b, query Q at 6, Q has been chosen by some strategy. The returned answer set at b is Q{D). Let c be a peer who receives a subset K C Q(D). Vc V its neighbor d: c routes Pd{K) to d. Notice: all parts sent by c to its neighbors may not be disjointed. O&B Dynamic Routing: Assumption 6 holds. Given target peer b, database D at b, query Q at 6, Q has been chosen by some strategy. The returned answer set at 6 is A = Q(D)US, where S is the set of supporting elements. Let c be a peer who receives a subset K C A. Vc V its neighbor d: c routes Pd{K) to d. Notice: all parts sent by c to its neighbors may not be disjointed. 37 Chapter 4. Strategies and Options for the Query-Answering Process In OtB and OQB, there is a new concept "supporting element". S u p p o r t i n g elements refer to the elements that should be projected out in the answer tuples but are needed for later usage, especially as filters of A C P s to determine the safe peers. In OUB and O&B, without help of tuple annotations, it is necessary to keep supporting elements within the answer tuples during the answer routing process. In 05, the notation "Ki x { L i } " is mainly intended for logical correctness. It doesn't necessarily mean to attach each answer tuple t in Ki with L j . A n implementation for O5, which factors away the common repeating L ; for all the tuples in a set Ki, is entirely consistent with this notation. Likewise the explanation works for "A x {L}" in O3. In order to make O, (i=1..6) work well, there are necessary obligations for peers in the P D M S : • For O i , V peer c (except the source peer a) who receives/has an answer set Q(D): c correctly routes Q(D) to the previous peer in path P. The target peer b correctly annotates the answer set Q(D) with the path P. • For O2A and O2B, V peer ci (except the source peer a) who receives/has an answer set: ci correctly routes the answer set to a peer c € P. The 2 target peer b correctly annotates the answer set with the path P . • For O3, V peer ci (except the source peer a) who receives/has an answer set: ci correctly routes the answer set to a peer c £ L. The target peer 2 b correctly annotates the answer set with the list L. • For OAA and O4B, V peer ci (except the source peer a) who receives/has an answer set: Ci uses all related A C P s to correctly find a safe peer c for 2 the answer set, and routes the answer set to c . 2 • For O5, V peer ci (except the source peer a) who recieves/has a partition Ki\ 6\ correctly routes Ki U Li to a peer c G L j . The target peer b cor2 rectly partitions Q(D) and annotates each partition Ki with an annotation •Li. • For OQA and O&B, V peer c (except the source peer a) who recieves/has a partition K, V peer d who is adjacent to peer c: c uses all related A C P s to correctly compute and route Pd{K) to peer d. 38 Chapter 4. Strategies and Options for the Query-Answering Process The combination of any strategy and any option in this section forms a full query-answering algorithm in a P D M S . In the next chapter, we will analyze the information leakage and completeness properties of each (S, O) pair. 39 Chapter 5 Information Leakage and Completeness for (S,0) Pairs The properties of a query-answering algorithm in a P D M S are those of an (S, 0) pair. Information leakage and completeness (of the answer) are among the most important properties for an (S, O) pair. In Section 5.1, we present the definition of information leakage (IL), the sufficient and necessary condition for IL-free, then analyze the IL-free property for all (S, O) pairs; in Section 5.2, we present the definition, the sufficient and necessary condition for completeness, then analyze the completeness property for all (S, O) pairs. 5.1 Information Leakage (IL) The general sense of information leakage has been given in Section 3.1.1: some protected data are accessed by unauthorized subjects. Information leakage for a P D M S query-answering algorithm will be formally defined and studied in this section. 5.1.1 Definitions Avoiding information leakage is an important security issue in the P D M S queryanswering process. The information leakage we are concerned with concen- trates on the query-answering process. Informally speaking, during the queryanswering process under control of an (5,0) pair, if peer c happens to receive tuple t but isn't authorized access to t, information leakage arises. Because which peer to receive a tuple is determined by an (5,0) pair, we regard infor- 40 Chapter 5. Information Leakage and Completeness for (S,0) Pairs mation leakage (or no information leakage) as a runtime property of an (5,0) pair. Now let us make a formal definition for information leakage. Definition 5.1 (Information Leakage (IL)) Given source peer a, target peer b, database D at b, query Q, (5,0) has information leakage iff (i) 3 path from a to b, P a - , b is chosen by S: S will send Q' to b through P ->b a (S defines P -*b a the rewritten query Q' from Q), (ii) 3 a tuple t £ Q'(D), (iii) 3 path P b ^ from x b to peer x, P b - , x is chosen by O: O will send t down that path A 3 c on that path s.t. t £ p {D). {S,0) c is IL-free iff it has no information leakage, V a, b, Q. IL-free is defined as the negation of Information Leakage in the above definition. For clarity, we reword the IL-free definition as follows: Definition 5.2 (IL-Free) Given source peer a, target peer b, database D at b, query Q, (S, O) has no information leakage iff (i) V path P _ , 6 from a to b, a which is chosen by S: S will send Q' to b through P ->6 a query Q' from Q), (ii) V tuple t £ Q'(D), (iii) V path (S defines the rewritten Pb-+ X from b to peer x, is chosen by O: O will send t down that path A V c on that path s.t. Pb—x t £ p (D). (S,0) c is IL-Free iff it has no information leakage, V a, b, Q. The expression "t £ Pc{D)" in the above definition needs to be explained. As mentioned in Section 4.2, "p (D)" denotes the part of database D that can be c accessed by peer c. Thus, "r. £ p (Z?)" means "tuple t is computed from D and c can be accessed by peer c". From the viewpoint of schema, t and D have different schemas and they are not comparable. Nevertheless, from the viewpoint of information containment, t is contained in D. Since we are discussing IL-Free in terms of information containment, we accept the expression "t £ p (D)". c Another similar expression is Q(D) C D". Given specific schemas of Q(D) and a D, there is no reason to say Q(D) C D. But from the viewpoint of information containment, we accept Q(D) CD" as a fact. a The above discussion is based on Assumption 3 & 4 of Section 4.3. We don't discuss the cases violating these assumptions. Furthermore, we assume no caching here. As a common approach to accelerate the query-answering process, caching is useful and worth noting. Nevertheless, as a supportive approach for query-answering, caching is not in the central place. A system works smoothly without caching. So at this moment we omit caching. We leave the security problems involved in caching as future work. 41 Chapter 5. Information Leakage and Completeness for (S,0) Pairs 5.1.2 T h e Sufficient a n d N e c e s s a r y C o n d i t i o n for IL-Pree IL-free is a security property for an (S, O) pair. However, we cannot expect to use the IL-free definition ,to check if an (S, 0) pair is IL-free. First, because the IL-free definition is based on execution, you would need to run (S, O) on all possible source peers, target peers, queries, query routing paths, etc. Secondly, the IL-free definition has a tuple-based granularity, which is far different from the description of a strategy or an option. So we need a general condition that can be directly applied to the description of any (S, O) pair and check if it is IL-free. The following ideas are illuminating for finding such a condition: • By the definitions of strategy and option, the answer tuples are computed only at the target peer, and no more tuples are created in the answer routing process. The option O determines whether to send an existing tuple to a peer, but cannot create a new tuple and send it to a peer. Thus, if peer Co has an answer set T and option O will route the set T' from peer CQ to its neighbor c, it must have T' C T . • By the definition of IL-free, if the option O sends a tuple t to some peer c, c must have access to t. In other words, O will send to c only the tuples that c has access to. Based on the above observation, we propose the sufficient & necessary condition (SNC) for an arbitrary (S, O) pair to be IL-free. SNC: For an (S,0) pair, V source peer o, V target peer b, V query Q at o, V peer Co, V neighbors c of CQ: CQ has answer set T and routes set T" C T to c^T'C P c (T). Proof: 1. S N C is Sufficient. We shall show if (S,0) satisfies SNC, it also satisfies the IL-free definition. Assume (S, O) satisfies SNC. Given source peer a, target peer b, database D at b, query Q at o, rewritten query Q' of Q at 6, peer co, co's neighbor c, let the answer set at peer CQ be T and the set sent to peer c be T' C T. By SNC, we know V C p (T). c Let an arbitrary tuple t e T'. Since T' C p (T), it follows t £ p (T). B y the definition of Strategy and c c Option, we can infer T C Q'(D). Since we also have the fact Q'{D) C D in sense of information containment, it follows T C D. Applying exactly the same restriction p () to both sides of this term, we get p {T.) C p {D). c c c Since we 42 Chapter 5. Information Leakage and Completeness for (S,0) Pairs already have t £ p (T) and p {T) C p (D), it follows t £ p (D). c c c c The expression t £ p {D) holds for every peer c, via which tuple t is routed. A n d it holds V a, c 6, D, Q, Q', t £ £/(£>). B y the definition of IL-free, (S,0) satisfies the IL-free definition. 2. S N C is Necessary. We shall show if (S, O) satisfies the IL-free definition, it also satisfies SNC. Assume (S, O) is IL-free. Given source peer a, target peer b, database D at b, query Q at a, rewritten query Q' of Q at 6, let t £ Q'{D) be an arbitrary answer tuple. B y the definition of IL-free, t is routed by O down some paths. Let Pb be one of these paths. Let Co and c be arbitrary adjacent x peers on path Pf, and CQ routes t to c. By the definition of IL-free, we know x t £ p (D). Without loss of generality, t is in set T C Q'(D) at Co, and t is in c set T" C T routed from Co to c. Since we have (i) T C Q'(D) and (ii) the fact Q'{D) C O in sense of information containment, it follows T C. D. There is another fact: given T C D, p {T) is exactly T np (D). c ATQD/\teTAt£ Since we have this fact c p {D), it follows t £ p (T). c t € p (t) c holds for any c tuple t £ V. Thus, V C p ( T ) . This expression holds V a, b, Q, c , c. By the c 0 statement of SNC, ( 5 , 0 ) satisfies SNC. • 5.1.3 IL-Free Analysis for all (S,0) pairs In the previous section, we get the sufficient and necessary condition S N C for IL-free. A n ( 5 , 0 ) pair is IL-free, if and only if ( 5 , 0 ) satisfies S N C . Table 5.1 is the IL-free Result Matrix for the ( 5 , 0 ) pairs we have designed in Chapter 4. It summarizes which (5,0) pairs guarantee to be IL-free, where " Y " denotes "guarantee IL-free" and " N " denotes "may cause information leakage". Si s s s 2 3 4 Oi Y Y Y N 0A 2 Y Y Y Y 0 2 B Y Y Y N o Y Y Y Y 3 0A 4 Y Y Y Y 0 4B Y Y Y Y . o 5 Y Y Y Y o 0 Y Y Y Y 6B eA Y Y Y Y Table 5.1: IL-free Result Matrix Let us analyze the result in the matrix by using the sufficient and necessary condition SNC. We will discuss the matrix in a column-by-column order. First, consider ( S i , O i ) and (S3,0i). B y the descriptions of S i , S3, O i , the answer set is pp(Q(D)) at each routing peer, where path P £ P ->b- Let Co and a c be adjacent peers on the reversed path of P. There exists a fact: for c £ P and 43 Chapter 5. Information Leakage and Completeness for (S,0) Pairs tuple set X, pp{X) = p (p (X)). c we get {Q(D)) = PP Since (i) p (Q(D)) (p (Q(D))). Pc Therefore, p (Q(D)) P c {p {Q{D))). Pc P and (ii) by Oi, peer CQ has answer set P to c, we see that (5i,Oi) and (53, Oi) and routes exactly p (Q{D)) P C P C p {p (Q(D))) P p {Q(D)) Replacing X by Q(D) in the previous term, P P satisfy SNC. Let us consider (S2,Oi). B y S2 and Oi, the answer set is PA(Q{D)) at each routing peer, where A is the set of all peers in the P 2 P system. Let a path P G P -»6i 0 c o and c be adjacent peers on the reversed path of P. There exists a fact: for c G P and tuple set X, PA(X) — p (p (X)). c Q(D) in this term, we get p (Q(D)) = P {PA(Q{D))). A PC(PA{Q{D))). Since (i) p (Q(D)) A Therefore, p (Q(D)) C C p (p (Q(D))) c A answer set p/i(<3(L>)) and routes exactly p (Q(D)) A Replacing X by A C A and (ii) by Oi, peer c 0 has to c, we see that (52, Oi) satisfies SNC. Let us consider (S4,Oi). B y S4 and Oi, the answer set is Q(D) at each routing peer. Let a path P G P ->6> Co a and c be adjacent peers on the reversed path of P. By Oi, Co has the answer set Q(D) and routes exactly Q(D) to That is to say, (54, Oi) c. However, there is no guarantee Q(D) C p (Q(D)). C doesn't satisfy SNC. Let us consider (5j,0 .4) (i = 1,3,4). B y 5, (i = 1,3,4) and 0 A , the 2 answer set is p (Q(D)) 2 at each routing peer, where path P P G P -.6a Let co and c be adjacent peers on the returning path P'. By 02^, we know P' C P . Thus, c G P. There exists a fact: for c G P and tuple set X, p {X) P = p (p {X)). c P Since we already have c G P, replace X by Q{D) in the previous term, then get (Q{D)) PP p (Q(D)) P = (p (Q(D))). Pc Therefore, p (Q(D)) P P C ( {Q{D))). PC PP C P C ( P P ( Q ( L ) ) ) and (ii) by 02>i, peer Co has answer set ) Since (i) p (Q(D)) P to c, we see that (£1,02.4) (i = 1,3,4) satisfies and routes exactly p (Q(D)) P SNC. Let us consider (52,02,4)• By S2 and 02,4, the answer set is p (Q(D)) A at each routing peer, where A is the set of all peers in the P 2 P system. Let co and c be adjacent peers on a returning path P'. Since A is the set of all peers in the P 2 P system, it follows c G A. There exists a fact: for c G A and tuple set X, p {X) = p (p {X)). A c get {Q(D)) = (PA{Q{D))). (i) PA{Q(D)) C PC{PA{Q{D))) PA PA(Q(D)) Pc A Replacing X by Q(D) in this term, we Therefore, p {Q{D)) A C (p {Q(D))). Pc A Since and (ii) by S2 and O2A, peer Co has answer set and routes exactly PA[Q{D)) to c, we see that (52,02A) satisfies SNC. Let us consider (Si, 0 B) (i = 1,3). Analyzing the case exactly as (Si, 0 A) 2 2 44 Chapter 5. Information Leakage and Completeness for (S,0) Pairs (i = 1,3,4), we will get that (Si,0 ) 2B Let us consider (S ,0 B)2 get that (S ,0 B) 2 (i = 1,3) satisfies SNC. Analyzing the case exactly as (S ,0 A), 2 2 we will 2 satisfies SNC. 2 Let us consider (S4,02B)- B y 54 and 0 B, the answer set is Q(D) at each 2 routing peer. Let Co and c be adjacent peers on a returning path P'. B y 0 B, 2 Co has the answer set Q(D) and routes exactly Q(D) to c. However, there is no guarantee Q(D) C p (Q(D)). C Thus, ( S , 0 B ) doesn't satisfy S N C . 4 2 Let us consider (Si, 03) (i = 1,3,4). By Si (i — 1,3,4) and 0 , the an3 swer set is pp(Q(D)) at each routing peer, where path P € Let CQ P -»6a and c be adjacent peers on the returning path P'. B y O3, we know P' C L and L = {c\ (p {Q{D))) Pc = p (Q(D))}. P {c\Pc(pp(Q{D))) = p (Q(D))}, it follows p (pp(Q(D))) = PP(Q(D)). P fore, PP(Q(D)) C Since c £ P' A P' C L A L = P There- c p (pp(0(£>))). Since (i) c by O3, peer co has answer set pp(Q(D)) p (Q(D)) P C p (p (Q(£>))) c P and routes exactly pp(Q(D)) and (ii) to c, we see that ( S i , 0 ) (i = 1,3,4) satisfies SNC. 3 Let us consider (52,03). The analysis is almost the same as (Si,03) (i = 1,3,4). The only difference is: the answer set is PA(Q(D)) at each routing peer, where A is the set of all peers in the P2P system. We will get that (S2,03) satisfies SNC. Let us consider (Si, OUA) (i = 1-4) and (Si, 04B) (* = 1..4). By 04,4 or O4B, the answer set is always some A at each routing peer. Let Co and c be adjacent peers on a returning path. B y 0\A or 0 B , we know p (A) = A. Therefore, 4 4 C p (A). c c Since (i) A C p (A), (ii) by O4A or 04B, peer CQ has answer set c A and routes exactly A to c, we see that (Si, 04^) (i = 1..4) and (Si,04s) (i = 1..4) satisfy SNC. Let us consider (Si.Os) (i = 1..4). B y O5, the answer set is divided into several partitions Ki (i — l..n). Let Li be K^s annotation, Co and c be adjacent peers on a path to route Ki back. Since by O5 we know that Li = {c\p (Ki) = c Ki} A. peer c G Lj, it follows p (^i) = c Ki. Then C p (Ki). c Since (i) .K"i C p (Ki), (ii) by 05, peer CQ has answer set i f i and routes exactly Ki to c, c we see that (Si,0s) (i = 1..4) satisfies SNC. Last, let us consider (Si,06^i) (i = 1..4) and (Si,06s) (i = 1..4). Let Co and c be adjacent peers on a returning path, be the answer set that Co has, K' be the set that co routes to c. By 06,4 or 06B, we know K' = p (K). c K' C (K). Pc Thus, (Si,0 yi) (i = 1-4) and (Si,0 ) 6 6B Then (i = 1-4) satisfy S N C . 45 Chapter 5. Information Leakage and Completeness for (S,0) Pairs 5.2 Completeness The general sense of completeness has been given in Section 3.1.1. It will be formally defined and studied in this section. 5.2.1 Definitions Given the source peer o, target peer 6, and query Q at a, the answer set A arriving at peer a may be different for different (S, O) pairs. The set A is expected to be maximized, or Complete. Because the returned answer set A is largely decided by the query-answering algorithm, the completeness is expected to be a property of an (5,0) pair. To define completeness, we need to make the following assumptions: • A s the property of an (5,0) pair, completeness is independent of the query Q at source peer a. In other words, the completeness of the answer set is not query-sensitive. If an (S, O) pair has the completeness property, it ensures the completeness of the returned answer set for every query Q. • A s the property of an (S, O) pair, completeness is independent of the source peer a, the target peer b, and the database D at b. Whatever a, b or D is, the completeness means that the corresponding maximum answer set A should be retrieved from b to a. The completeness of the answer set is independent of the source peer and target peer, given an {S,0) pair. Firstly, let us define the Ideal Completeness: Definition 5.3 (Completeness I (Ideal Completeness)) Given source peer a, target peer b, database D at b, query Q at a, (S, O) is complete for (a, b, Q) iff: V tuple t 6 p (Q(D)), 3 path P from a to b A 3 path P' from b to a, S a sends Q' to b Ate Q'(D) A O sendst to a via P' At £ p (D), V peer c on P'. c (S, O) is complete iff it is complete for (a, b,Q), V a, b, Q. The Ideal Completeness definition implies "maximal completeness based on soundness". In other words, it is the completeness on the condition of no information leakage. Because the condition "£ £ Pc{D), V peer c on P'" in the above definition ensures no information leakage. However, the Ideal Completeness depends on the network topology and the target peer's A C P s for other peers. 46 Chapter 5. Information Leakage and Completeness for (S,0) Pairs a ' C b Figure 5.1: Problem in Completeness I (Ideal Completeness) Here we have an example for this dependency. It is shown in Figure 5.1: a is the source peer; b is the target peer; D is the database at b; b has an A C P saying that a can access all 6's data; b has another A C P saying that c can access none of 6's data; Q is the query put forth by a. We see that p {Q{D)) equals a Q(D). But for the restriction of the topology and A C P s , the answer set Q(D) will be blocked at c and nothing can be routed back to a. Any (S,0) cannot satisfy Completeness I, no matter how smart (S,0) is. This is not what we want. If Completeness is to be a property to distinguish (5,0) pairs, another assumption needs to be made: Completeness is not affected by the network topology and A C P distribution. Now we will define another type of Completeness, which accounts for the network topology and A C P distribution: Definition 5.4 (Completeness II) Given source peer a, target peer b, database D at b, query Q at a, let the ideal answer set L = { t | t G Q{D) A 3 path P £ Pb_ •(* £ PP{D))}; let the returned answer set L'•= { t | 3 path P 6 P - 6 0 a 3 path P' € P6_a (Q' is the written query of Q defined by.S A S transmits Q' to b via P A t G Q'{D) A O routes t to a via P')}• (S,0) is complete for (a, b, Q) iff L = L ' , V database D at b. (S, O) is complete iff (S, O) is complete for (a,b,Q), V a, b, Q. Note that in the above definition, we have an equivalent form for the ideal answer set L : L = { t | 3 path P G P b _ {t G p (Q(D)))}. a P This form of L is more concise than the original one, and will be used in our later discussion. The Completeness II definition likewise implies " maximal completeness based on soundness". Because it requires not only L C L ' but also L ' C L . What differs Completeness II from Completeness I is that the former is independent of the network topology and A C P distribution. Therefore, Completeness II can be regarded as a property of an (5, O) pair. By the Completeness II definition, The ideal answer set L is independent of any (5, O) pair. Thus, a fact can be inferred from the definition of Completeness 47 Chapter 5. Information Leakage and Completeness for (S,0) Pairs II: Given the PDMS topology, the query Q, the source peer a, the target peerb and its database D. Assume (Si,Oi) and (Sj,Oj) both satisfy Completeness II. Let L{ be the returned'answer set of (Si,Oi), and Lj be the returned answer set of (Sj,Oj), where the returned answer set is defined as L' in the Completeness II definition. Then we have Li = Lj. The proof for this fact is trivial. Let L be the ideal answer set as defined in Completeness II. By the definition of Completeness II, we know L* = L and Lj = L . Therefore, we have Li = Lj. Because Completeness II allows us to account for the conditions imposed by the topology, throughout this thesis we consider Completeness II when we consider completeness. 5.2.2 The Sufficient and Necessary Condition for Completeness We cannot expect to use the definition of Completeness II (as defined in the previous section) to check if an (S, O) pair satisfies Completeness II. First, L ' in the Completeness II definition has an execution manner, you have to run (S, O) on all possible source peers, target peers, queries, query routing paths, etc. Secondly, both L and L ' have the tuple-based granularity, which is far away from the description of a specific strategy or a specific option. So there is a need for a condition that can be easily applied to the description of any (S, O) pair and check if it satisfies Completeness II. Since (i) the ideal answer set L is independent of any (S, O), (ii) L has an equivalent form as we pointed out after the Completeness II definition, we get the following sufficient and necessary condition for checking if an (S, O) pair satisfies Completeness II. SNC: For an (S, O) pair, V source peer a, V target peer 6 and its database D, V query Q at a: the returned answer set V at a is U PP(Q( ))D P€P -. a b Proof: 48 Chapter 5. Information Leakage and Completeness for (S,0) Pairs Before proving that S N C is a sufficient and necessary condition, we need some preparation work. (i) The fact is: if there is a path P € Pb_ai there must be a path P' G P ->b> 0 where P' is the reversed sequence of P. B y this fact, if P and P' are treated as sets of peers, we have P = P'. Therefore, for any Q(D) and P, pp(Q(D)) — PP'(Q(D)). (ii) By the definitions of set and set union, we have (J p (Q(D)) = {t\3pathPeP ^ (tepp(Q(D)))}. P a b P6P„-b Since we have (i) and (ii), it follows that |J p (Q(D)) = {t\3pathPen-+a(t£pp(Q(D)))}. P Next, let us prove that S N C is a sufficient and necessary condition for Completeness II. 1. S N C is Sufficient. We need to show if (S, O) satisfies SNC, it also satisfies Completeness II. Assume (5, O) satisfies SNC. Given source peer a, target peer b, database D at b, query Q at a, let L' be the answer set at a that (S,0) returns. B y SNC, we know £'= U M W ) . P€P„„6 Since we also have (J p (Q(D)) = {t\3pathP P G P „ _ ( i G pp(Q(D)))}, a P€P -b a it follows V = {t\3pathP G P t _ ( * € p (Q{D)))}. a P pleteness II, the ideal answer set L = {t\3pathP Since V = {t\3pathP G P i _ ( t 6 PP{Q{D)))} a By the definition of ComG Pb-.o(< £ PP(Q(D)))}- and L = {t\3pathP G P - ( * € 6 a p {Q{D)))}, it follows L = L'. The term L = L' holds V a, 6, Q. B y the P definition of Completeness II, we see that (5,0) satisfies Completeness II. 2. S N C is Necessary. We need to show if (S,0) satisfies Completeness II, it also satisfies SNC. Assume (S, O) satisfies Completeness II. Given source peer a, target peer 6, database D at b, query Q at a, let L be the ideal answer set, and V be the answer set at a that ( 5 , 0 ) returns. B y the definition of 49 Chapter 5. Information Leakage and Completeness for (S,0) Pairs Completeness II, L = {t\3pathP £ P - ( t £ p {Q{D)))} 6 a we have L' = {t\3pathP £ Pb~. (t £ p(Q(D)))}. a (J and L = V. Thus, P Since we also have P pp(Q(D)) = {t\3pathP£V ^ (t£p (Q(D)))}, b a P PePo-6 it follows P€P -b a This equation for V holds V a, 6, £>, Q. By the description of SNC, we see that (5,O) satisfies SNC. • 5.2.3 Completeness Analysis for all (S,0) pairs In the previous section, we get the sufficient and necessary condition S N C for Completeness II. A n (5, O) pair satisfies Completeness I I , if and only if (S,0) satisfies S N C . Table 5.2 is the Completeness II Result Matrix for all (5, O) pairs we designed in Chapter 4. It summarizes which (5, O) pairs have the completeness II property, where " Y " refers to "Completeness II holds" and "N" refers to "Completeness II doesn't hold". If an (5,0) pair isn't IL-free, we skip it and fill in Si 5 5 2 3 SA Ox Y N Y - 0 Y N Y Y 2A 0 2 B Y N Y -' o Y N Y Y 3 o 4A N • N N N o 0 H N N N AB Y N Y Y 5 o 6A N N N N o 6B N N N N Table 5.2: Completeness II Result Matrix Let us analyze the result in the matrix by using the sufficient and necessary condition SNC. We will discuss the matrix in a column-by-column order, except Si. • First of all, let us consider (5 ,Oj) (i = 1..6). B y the descriptions of 5 2 and Oi (i = 1..6), the answer set computed at the target peer b is 2 p (Q{D)), A where A is the set of all peers in the P2P system. Since answer tuples are only computed at b and no more tuples are created in the answer routing process, it follows that the returned answer set at a is V C p (Q(D)). A Since A is the set 50 Chapter 5. Information Leakage and Completeness for (S,0) Pairs of all peers in the P D M S , normally we have , PA(Q(D))C Since V C p (Q(D)) (J PP(Q{D)). and A \J PA(Q(D))C PP{Q(D)), P€P -b 0 it follows L'C [j (Q(D)). PP PePo-b Therefore, ( S , O i ) (i = 1..6) doesn't satisfy SNC. 2 and ( S ^ G Y ) . By the descriptions of S i , S3 and Let us consider (S\,Oi) Oi, V path P e P _ 6 : the answer set at b is p (Q(D)). a P P 6 P _ b the answer set pp(Q(D)) Then by O i , V path is routed at a via the reversed path of P. : a Thus, the returned answer set L' at a is U PP(Q( ))D P€P„_6 Therefore, ( S i , O i ) and ( S , O i ) satisfy SNC. 3 Let us consider (SI,02A) (i — 1, 3,4). By the descriptions of Si (i = 1,3,4) and O2A, V path P e P — t h e answer set at b is pp(Q(D)). a V path P G P - » 6 : the answer set pp(Q(D)) Then by O2A, is routed at a via some path P', 0 where P' € Pb—o A P ' C P . Thus, the returned answer set L' at a is U PP(Q( ))D Therefore, (Si, 0 >i) (i = 1,3,4) satisfy SNC. 2 Let us consider ( S i , 0 B ) (i = 1,3). Analyzing the case exactly as (Si,0 >i). 2 2 (i = 1,3,4), we will get that ( S i , 0 f l ) (* = 1,3) satisfies SNC. 2 Let us consider (Si,C>3) (i = 1,3,4). B y the descriptions of Si (i = 1,3,4) and O3, V path P € P _ b : the answer set at b is pp(Q(D)), a annotated with a safe peer list L. Then by O3, V path P € P -,b: the answer set pp(Q(D)) Q is routed at o via some path P', where P' £ Pb-»a A P' C L. (Hint: In the worst case, P' could be the reversed path of P, because P C L. ) Thus, the returned 51 Chapter Information 5. Leakage and Completeness for (S,0) Pairs answer set L' at a is U PP(Q(°))- P€P -.b 0 Therefore, (S , 0 ) t (i = 1,3,4) satisfy SNC. 3 Let us consider (Si,OiA) (i = 1,3,4). By the descriptions of Si (i = 1,3,4) and 0 4 A , V path P € P - * 6 : the answer set at b is p (Q(D)). A P e P —b'- the answer set pp(Q(D)) a PP(Q(D)). By 0 ^ , V path 4 P is routed to peer d if d(pp(Q(D))) = P However, without the help of tuple annotations and supporting elements, the answer set pp(Q(D)) might be blocked during the routing process. Thus, the returned answer set at a might be L'c Therefore, (S O ) u iA |J PP(Q(D)). (i = 1,3,4) doesn't satisfy S N C . Let us consider ( 5 J , 0 4 B ) (i = 1,3,4). The analysis is similar with that of ( 5 J , 0 4 A ) (* = 1,3,4). The only difference is as follows. By C>4B, V path P e P _,b: the answer set pp(Q(D)) 0 U S might be blocked at some peer c, where S per se cannot be safely accessed by c. Therefore, (Si, O 4 B ) (i = 1,3,4) doesn't satisfy SNC. . Let us consider (Si,05) (i = 1,3). By the descriptions of Si (i = 1,3) and 0 , V path P £ P -.&: the answer set at b is p (Q(D)). 5 Q P P G PQ-,6: the answer set pp(Q(D)) Then by 0 , V path 5 is partitioned and routed back at a via some paths. (Hint: In the worst case, pp(Q(D)) cannot be partitioned and is routed back to a via the reversed path of P.) Thus, the returned answer set L' at a is U PP(Q( ))D Therefore, (Sj.Os) (i = 1,3) satisfy S N C . Let us consider ( S 4 , 0 5 ) . By the descriptions of 5 4 and 0 , V path P € P -.&: 5 A the answer set at b is Q(D). By 0 , Q(D) is partitioned and routed. Notice that 5 only the subset pp(Q(D)) is possible to be routed back to o, where P e P —6A Thus, the returned answer set V at a is U PP(Q(D)). P 6 P a ^ b Therefore, (S ,0 ) 4 5 satisfies SNC. 52 Chapter 5. Information Leakage and Completeness for (S,0) Pairs Let us consider (Si, 06.4) (i = 1,3). By the descriptions of 5, (i — 1,3) and C-6A, V path P € P ->6: the answer set at b is pp(Q(D)). B y OQA, V path A P £ P -t& 0 : P e e r c, who has the subset K C p(Q(D)), P routes d{K) to peer d. P However, without the help of tuple annotations and supporting elements, some answer tuples in pp(Q(D)) might be blocked/lost during the routing process. Thus, the returned answer set at a might be L'c Therefore, (Si,0 ) 6A U PP(Q(D)). (i = 1,3) doesn't satisfy SNC. Let us consider (S4,OQA)- By the descriptions of S4 and 0§ , V path P £ A P _ i , : the answer set at 6 is Q(D). However, by OQA, the problem of no tuple a annotations and no supporting elements still exists. Some answer tuples might be blocked during the routing process. Therefore, (S4, OQA) doesn't satisfy SNC. Let us consider (5I,C>6B) (* = 1,3,4). It is similar to the case of (Si, 06.4) (i — 1,3,4). Even with help of supporting elements, 06B might block/lost some answer tuples during the routing process, if the supporting elements per se cannot be safely accessed by a routing peer. Therefore, (Si,06B) (i = 1,3,4) doesn't satisfy S N C . 53 Chapter 6 Cost Analysis for (S,0) pairs Thus far, we presented the designed strategies and options in Chapter 4, and the IL-free and Completeness property analysis for all (S, O) pairs in Chapter 5. But the cost related to each (S, O) pair has not been studied. In this chapter, we build the cost model for an (S,0) pair (Section 6.1), conduct the cost analysis for all (S,0) pairs (Section 6.2), and compare the analysis to hypothesize which (S, O) pairs perform the best under which conditions (Section 6.3). 6.1 The Cost Model A cost model estimates the cost of the query-answering process in a P D M S , in control of an (S,0) pair. The following are the assumptions for the cost model: • We only consider the cost for given one source peer and one target peer. The cost estimation can be extended to a general case with one source peer and multiple target peers. • We assume that databases residing on all peers have the same schema. As mentioned in Section 3.4, the access control issue is or- thogonal to the issue of schema heterogeneity. Tackling query-answering among different peer schemas is the main task of previous research work in PDMSs (Section 2.1), which is beyond the scope of the thesis. • In the cost model, we do not count in the cost of Answer Generating, which is the cost for the target peer to compute answer tuples for a query upon its local database. The reason of not including the cost lies in that (1) this, cost is mandatory for every (S, O) pair, and the costs 54 Chapter 6. Cost Analysis for (S,0) pairs are similar and nondistinctive, (2) answer generating is a local task, whose time cost is fairly low comparing to those of the networking transaction. Under the above assumptions, there are several major tasks in the query answering process of a PDMS: • Query Transmitting: transmit a query Q from the source peer a to the target peer b. • Query Rewriting: rewrite a query Q using ACPs when it is transmitted from the source peer a to the target peer b. • ACP Evaluation: use ACPs to determine if a peer has access to certain answer tuples. • Answer Routing: ship answer tuples back to the source peer a. • ACP Distributing: distribute ACPs from the target peer b to other appropriate peers. • Annotating: associate every partition of the answer tuples with a specific annotation. • Annotation Shipping: ship annotations together with the "pure" answer tuples back to the source peer a. Now let us identify the primitive operations and the corresponding cost unit for each task. In order to find reasonable primitive operations and cost unit for each major task, the following approximation assumptions need to be made. Assumption 1: The numbers of. constraints in different ACPs do not differ too much. Assumption 2: Using each constraint for a query rewriting will cost approximately constant time. Assumption 3: Using each constraint as a filter to determine whether a peer has access to an answer tuple will cost approximately constant time. Assumption 4: In a PDMS, the time to transmit a message between any two adjacent peers is approximately constant. Thus, it takes approximately equal time for a small size,message to be transmitted between any two 55 Chapter 6. Cost Analysis for (S,0) pairs adjacent peers. Of course, networking costs do differ, but they are small enough that the differences are dominated by the other factors. Assumption 5: The sizes of different answer tuples do not differ too much. Assumption 6: The sizes of different A C P s do not differ too much. Assumption 7: The sizes of different annotations do not differ too much. Assumption 8: A n annotation is in the form of a set of peer ID's. Insert- ing/deleting an peer ID into/from an annotation costs approximately constant time. Assumption 9: The sizes of a query and its rewritten forms do not differ too much. Assumption 10: A n annotation is directly associated and shipped with a set of tuples. It requires no supportive structure. ; Each of the following cost is the time cost for a major task in the query answering process of a P D M S . The Query Transmitting Cost refers to the cost of transmitting a query Q from the source peer a to the target peer b. According to Assumption 4 and 9, it can be inferred that the cost of shipping a query down one network link is approximately an'constant time. "Shipping a query down one network link" is then the primitive operation. We identify the cost unit as "query-hop", which is the charge associated with the primitive operation. Thus the Query Transmitting Cost can be measured in terms of query-hops. The Query Rewriting Cost refers to the cost of rewriting a query Q using A C P s when it is transmitted from the source peer o to the target peer b in the framework of an (S, O) pair. According to Assumption 1 and 2, it can be inferred that rewriting query Q using A C P A\ will cost approximately equal time to that of rewriting query Q using A C P A?, no-matter what Q is, or what A\ and A are. So the primitive operation for query rewriting can be regarded 2 as "rewriting a query using one A C P " . We identify the cost unit as "qrewriteacp", which is the charge associated with the previous primitive operation. Thus the Query Rewriting Cost can be measured in terms of qrewrite-acps. E.g. rewriting 1 query using 100 A C P s costs 100 qrewite-acps, which has the same cost of rewriting 2 queries using 50 A C P s each. 56 Chapter 6. Cost Analysis for (S,0) pairs The A C P Evaluation refers to using A C P s to determine if a peer is authorized to access certain answer tuples. It happens in two different places: (1) when certain answer tuples are to be routed from peer c\ to peer c , related 2 A C P s are used at ci to determine if c is authorized to access these answer 2 tuples, (2) A C P s are used as filters at a peer, usually the target peer, to decide the set of peers that is authorized to access each answer tuple. According to Assumption 1 and 3, it can be inferred that evaluating an A C P over a tuple costs approximately constant time, no matter how the answer tuple looks like. So the primitive operation for A C P evaluation is "evaluating an A C P over an answer tuple to decide the related safe peers". The cost unit is identified as "acp-eval", which is the charge associated with the primitive operation. It's measured on a per tuple basis, e.g. if we evaluated 1 A C P over 1000 tuples versus 10 A C P s over 100 tuples each, both cases incur the same cost: 1000 acp-evals. Thus the A C P Evaluation Cost can be measured in terms of acp-evals. The Answer Routing Cost refers to the total cost of shipping answer tuples back to the source peer. According to Assumption 4 and 5, it can be inferred that the cost of shipping one answer tuple down one network link is approximately an constant time. "Shipping one answer tuple down one network link" is then the primitive operation. We identify the cost unit as "tuple-hop", which is the charge associated with the previous primitive operation. Thus the Answer Routing Cost can be measured in terms of tuple-hops. E.g. if 100 tuples are sent down a path of 10 links, the cost is 1000 tuple-hops, which is also the same charge if 1 tuple is sent down a path of length 1000. Note that we are considering the amount of work in the network. Actually it is faster to send 100 tuples down a path of 10 links than to send 1 tuple down a path of 1000 links. The former does the primitive operations in a concurrent way. In our cost model, we simply sum up all primitive operations as if they are done sequentially. Likewise, the assumption applies to, the Query Transmitting Cost and the A C P Distributing Cost. For certain (S, O) pairs, A C P s need to be distributed from the target peer to other peers. The A C P Distributing Cost refers to such kind of distributing cost. According to Assumption 4 and 6, it can be inferred that the cost of shipping one A C P down one network link could be treated as an approximately constant time. Thus "shipping one A C P down one network link" is the primitive operation for A C P Distributing Cost. We identify the cost unit as "acp-hop", which is the charge associated with the previous primitive operation. So the A C P Distributing Cost can be measured in terms of acp-hops. E.g. if 100 57 Chapter 6. Cost Analysis for (S,0) pairs A C P s are sent down a path of 10 links, the cost we charge is 1000 acp-hops, which is the same cost if 1 A C P s are sent down a path of length 1000. The Annotating Cost is the cost of associating every partition of the answer tuples with a specific annotation. As mentioned in Assumption 8, an annotation is in the form of a set of peer ID's. Then the task of annotating refers to the operations of inserting/deleting peer ID's into/from annotations. The primitive operation then can be treated as "insert/delete a peer ID into/from an annotation". Such an primitive operation is the atomic step for any annotating algorithm. It works for both tuple-based and partition-based algorithms, i.e. an algorithm annotating one tuple at a time or an algorithm annotating a set of tuples at a time. According to Assumption 8, the cost unit is identified as "annot-update", which is the charge associated with the aforementioned primitive operation. So the Annotating Cost can be measured in terms of annotupdates. E.g. if we insert 4 peer ID's into an annotation, then delete 2 peer ID's from another annotation, the cost we'd charge is 6 annot-updates. Generally speaking, in a specific annotating algorithm, the task of annotating is often interleaved with the following tasks: • answer generating, i.e. computing answer tuples for a query • A C P evaluation, i.e. using A C P s as filters to decide the peers that have access to each answer tuple in the answer set We ignore the cost of answer generating, as mentioned at the beginning of this section. For A C P evaluation, it is included in the A C P Evaluation Cost. In some query-answering algorithms, the annotations are routed together with the answer tuples. This will increase the workload for the whole network. This cost is called the Annotation Shipping Cost. According to Assumption 4, 7 and 10, it can be inferred that the cost of shipping one annotation down one network link can be treated as an approximately constant time. Then "shipping one annotation down one network link" is the primitive operation. The cost unit is identified as "annot-hop", which is the charge associated with the primitive operation. Thus the Annotation Shipping Cost can be measured in terms of annot-hops. E.g. if 10 annotations are sent down a path of 4 links, the cost charged is 40 annot-hops, which has the same cost if 2 annotations are sent down a path of length 20. Now we have identified the cost unit for each major task in the cost model. Although these cost units are different from each other, we can certainly find 58 Chapter 6. Cost Analysis for (S,0) pairs the relationship for some of them. For instance, "tuple-hop", "acp-hop" and "annot-hop" are quite similar. The only difference is the size of "cell" to be shipped. Coefficients can be assigned to illuminate the relationship: • • 1 tuple-hop = C\ • acp-hop • 1 acp-hop = C 2 • annot-hop where c\ and c are application-specific coefficients. With these relationships, 2 it is possible to sum up the costs of Answer Routing, A C P Distributing and Annotation Shipping, which helps us to calculate the best/fastest (S,0) pairs. 6.2 Cost Analysis Result The cost model in Section 6.1 can be used to assess an (S, O) pair. In this section, we analyze and compare the costs for every (5,0) pair we already designed. Each cost presented in this section is for O N E query, O N E source peer, and O N E target peer. Because IL-free and Completeness are necessary properties for an (S, O) pair, only (S, O) pairs that are both IL-free and complete are analyzed in this section. In the results of this section, we sum up the cost of a major task for each (S,0) pair, as if the primitive operations in this task are done sequentially. But in a real P M D S , we can pipeline the primitive operations, then a task takes less time. For clarity, we define the following terminology, which is used in later discussion. Given the P D M S topology, all A C P s , the source peer a, the target peer b: • The number of all paths from a to b is | P _ ; , | ; a • Let No be the number of all peers in the P D M S ; • Let path P, € P —& (i = 1, ••, | P — N i a a be the number of peers (except b) in Pi, or the length of Pi; • When considering only one routing path in P _>b 0 (e.g. in S ) , we will 2 use the simplified symbols: P denotes the path, N denotes the number of peers(except b) in P or the length of P; • If Cj is a peer, let X , be the number of the target peer b's A C P s for Cj, dj be the shortest path from the target peer b to cy, 59 Chapter 6. Cost Analysis for (S,0) pairs • If Pi is a path, let X Pi be the number of the target peer fc's A C P s , each of which is for at least one peer in Pi\ • Let Y be the total number of the target peer b's ACPs; • If query Q' is transmitted at the target peer b via path Pi G P ->6, a let Ti be the number of returned answer tuples for Q' at b. 6.2.1 Query Transmitting Cost For Query Transmitting Cost, "Shipping a query down one network link" is the primitive operation, and the cost unit is identified as "query-hop", which is the charge associated with the primitive operation. Table 6.1 is the matrix summarizing the Query Transmitting Cost for every (5,0) pair. We do not.assess the (S,0) pairs, who are either not IL-free or incomplete. Ei^ (5 ,O )(i = l,2A,2B,3,5) (S ,Oi)(i = 1..6) 1 i - 2 ( 5 O i ) ( i = 1,2,4,25,3,5) 1 n 3) {S Oi){i = 2A,3,5) it Table 6.1: Query Transmitting Cost (unit: query-hop) Getting the result in the matrix is not hard: by S\ or S3 or S4, the query Q is transmitted along every path Pi € 6.2.2 P ->6 a (i = 1. ••, |Pa->6|)- Query Rewriting Cost For Query Rewriting Cost, "rewriting a query using one A C P " is the primitive operation, and the cost unit is identified as "qrewrite-acp", which is the charge associated with the primitive operation. Table 6.2 is the matrix summarizing the Query Rewriting Cost for every (S, O) pair. We do not assess the (S, O) pairs, who are either not IL-free or incomplete. As a fact, Query Rewriting Cost is related only to strategies, not to options. By Si, the query Q is transmitted along every path € P ->6 0 (i = 1> ••, |Pa-»b|)i Q is rewritten at every peer CJ [j = l-.Ni) on path P , "using Xj ACPs. Thus, the query rewriting cost for (Si,Oi) is E'LT 6 ' t E J ^ I Xj qrewrite-acps. 60 Chapter 6. Cost Analysis for (S,0) pairs (S Oi){i = l,2A,2B,3,5) (S ,Oi)(i = 1..6) (S ,Oi)(i = l,2A,2B,3,5) (S ,Oi)(i = 2A,3,5) 2^i=l u 2^7 = 1 ^3 - 2 2si=\ 3 ^3 l~ii=\ 0 4 Table 6.2: Query Rewriting Cost, (unit: qrewrite-acp) The case of S3 is similar to that of S i . The only difference is that S3 rewrites the query Q when it is transmitted at the target peer b. Thus, the query rewriting cost for (S3, Oi) is also EiLV ' Ej2i ^3 6 qrewrite-acps. By S4, the query Q is never rewritten. Thus, the query rewriting cost for (S ,Oi) is 0 qrewrite-acp. 4 6.2.3 A C P Evaluation Cost "> For A C P Evaluation Cost, "evaluating an A C P over an answer tuple to decide the related safe peers" is the primitive operation, and the cost unit is identified as "acp-eval", which is the charge associated with the primitive operation. Table 6.3 is the matrix summarizing the A C P Evaluation Cost for every (S, O) pair. We do not assess the (S, O) pairs, who are either not IL-free or incomplete. We will go though the matrix entries column by column. o Oi Si 0 s - 2 S3 0 S4 - 0 0 - 0 Ei=r m •Y) bl 2 S3 5 E ! r (Ti-Y) - bl - - - Ei=r m-n EJ!r (Ti-Y) bl bl 4 - -X )) Pi - - 0 6 s T,tr (Ti-(Y b] - 0A Si OiA 2B 2A 0 = - - T,f r"KTi-(Y-x )) Y&r {Ti-(Y-X )) Pi - - bl - ' Pi 0 6B - - Table 6.3: A C P Evaluation Cost (unit: acp-eval) Let us consider (Si, O i ) (i = 1,3). By O u for any path P £ P _ , the A 6 answer set at b is routed via the reversed path of P. Thus, there is no A C P evaluation cost for (Si,Oi) (i = 1..3). Let us consider (Si,0 >i) (i = 1,3). By S (i = 1,3) and 0 /i,'we know that 2 { 2 for any path P e P -*b, the query at b can be directly evaluated and the answer a 61 Chapter 6. Cost Analysis for (S,0) pairs set is routed via some path P', where P' G Pb—a A P'. C P. Thus, there is no A C P evaluation cost for (Si,C>2A) (i = 1,3). Let us consider (SA,02A)- B y 54, for any path P G P -.b, Q(D) computed a from Q is not exactly p(Q(D)), which is the answer set O2A will route back. P Thus, O2A needs to evaluate every A C P over every tuple in Q(D) to decide the returned answer set. Since we know (i) for every path P the number G P -»6, a of tuples in Q(D) is Tj and (ii) the total number of the target peer's A C P s is Y, it follows the A C P evaluation cost is E Let us consider (5,,C>2B) ' L T " ' ^ ' " Y ) - (i = 1,3). B y O2B, for any path P G P ->t, 0 the answer set at b is directly computed by the query Q at b and routed via some path P', where P' G Pf,_ A P' C P. Thus, there is no A C P evaluation cost 0 for (5i,0 ) (t = 1,3). 2B Let us consider (Si, O3) (i = 1,3,4). By O3, for every P we know.(l) G P ->6, a a query Q is transmitted at the target peer b, and the number of answer tuples is Ti, (2) for each answer tuple t, every A C P of b needs to be evaluated over t, except the X A C P s for the peers in Pi (They have been evaluated over i). Pi Thus, the A C P evaluating cost for (Si, 0 ) (i = 1,3,4) is £ i = f ( i bl r - 3 (Y-X )). Pi Let us consider (5,, O5) (i = 1,3,4). By O5, we know that for every incoming path, for every tuple t in the answer set, each A C P of the target peer needs to be evaluated over t. Thus, the A C P evaluation cost is Ei=V' 'C^ ' w h e r e Ti > is the number of answer tuples at the target peer and Y is the total number of the target peer's A C P s . However, this is the theoretical cost. In practice, the implementation of O5 may annotate (partition) the answer set before the answer tuples are computed,, and the A C P evaluation cost for (Si, O5) (i = 1,3,4) will decrease dramatically. 6.2.4 Answer R o u t i n g Cost For Answer Routing Cost, "shipping one answer tuple down one network link" is the primitive operation, and the cost unit is identified as "tuple-hop", which is the charge associated with the primitive operation. Table 6.4 is the matrix summarizing the Answer Routing Cost for every (5,0) pair. We do not assess the (5,0) pairs, who are either not IL-free or incomplete. We will go through the matrix entries column by column. Let us consider (Si,0\) (i = 1,3). B y 0\ and 5j (i = 1,3), for any path Pi € P -.6, 0 the answer set is returned via the reversed path of Pj. Since the number of tuples in the answer set is Tj and the length of Pi is Ni, it follows 62 Chapter 6. Cost Analysis for (S,0) pairs Oi 0 0A 2B 2 Si T}!=r KT -n ),n < N Ei=r (r*-n ),n <iV b s 2 - S3 El=r w-M) SA - i - bl 2 S3 i s 2 S3 — - i Si Si - E!:=T m-mi) 'Ejlr'm-mi) Eiir (ri-m ) 6l SA i i i i - fcl bl s i i E!=r (Ti-ni),n <N i o OiA,iB SI i fcl 5 « Ei'V'Wi-(<*)"]/(!-cfc)-Ti} «Ei=r {Mi-(cfc) ]/(i-'*)-ri} 6l n Table 6.4: Answer Routing Cost (unit: tuple-hop) that the answer routing cost for (Sj.Oi) (i = 1,3) is E i = f '( i ' ^ 0 b Let us consider (5j,C>2.4) (i = 1,3,4) and (S 0 B) it 2 T (i = 1,3). By 0 , 2A 0B 2 and Si, we know for any path Pi € P - > 6 , the answer set is returned via P[ C Pj. a Since the number of tuples in the answer set is Tj and the length of Pi is Ni, it follows that the answer routing cost for (Si,0 A) 2 (i = 1,3,4) and (Si,0 B) 2 (i = 1,3) should be Ei=r (^i' m), m < bl Let us consider (Si, O3) (i = 1,3,4). By Si and O3, for every path Pi € P ->& a (length of Pi is iVj), the answer set is annotated with annotation Li at the target peer and returned via path P[ C Li. Let be the length of path P-. Since the number of tuples in the answer set is Ti, it follows that the answer routing cost for (Si,0 ) 3 (i = 1,3,4) is Ei=f "'(^i • rm). Normally, rm < N . ( Before turning to (Si,Os) (i = 1,3,4). (i = 1,3,4), Let us study (Si,Oi ) Although (SUOAA) and (Si,OiB) A and (Si,0 ) 4B (i = 1,3,4) cause either in- formation leakage or incompleteness, the analysis for (Si,OiA) and (Si,OiB) (i = 1,3,4) helps to find the cost of (Si,0*,) (i = 1,3,4). B y OiA and O i B , for any path Pj € P - > 6 , for every peer c i that received the answer set Af. Ai a is routed from c i to c if p ( 4 j ) = Ai. Let k be the average fanout of a peer 2 C2 (average number of neighbors of a peer), c be a coefficient between 0 and 1 (the average reduction factor of a peer's "safe" neighbor number over all its neighbor number), n is the average length of an answer routing path. It is not hard to 63 Chapter 6. Cost Analysis for (S,0) pairs see the number of peers receiving Ai is the sum of a geometric sequence: cfc , c k ..c k .The 2 2 3 n k, sum is fc[l - (cfc) ]/(l - ck). Since the number of tuples n+1 n r in the answer set Ai is T*, it follows the answer routing cost is approximately ElL°r'{*[i4 (ckry(i-ck)-Ti}. Let us consider (Si,0s) (i = I.A). By O5, the answer set is partitioned and routed back to the source peer. In a global view, the answer routing cost in this case, in terms of "tuple-hops", has no difference from the case of routing the answer set as a whole in ( S i , 0 i ) and (Si,C?4B) (i = 1,3,4). Thus, we'd 4/ like to adopt the approximate costs for (Si, O^A) and (Si, 0 4 B ) (* = 1,3,4), i.e., « Ef=r bl 6.2.5 (Mi -(ckry(i-ck)-Ti}. A C P Distributing Cost For A C P Distributing Cost, "shipping one A C P down one network link" is the primitive operation, and the cost unit is identified as "acp-hop", which is the charge associated with the primitive operation. Table 6.5 is the matrix summarizing the A C P Distributing Cost for every (S, O) pair. We do not assess the (S, O) pairs, who are either not IL-free or incomplete. Oi Si s 2 2 - 0 0 2 S3 5 4 - - - 0 0 0 OeA,6B 04.4,4B Si s 03 E-MXi-di) - 4 0 B 2A 0 S3 s 0 £f=°i {Xi-di) - - -• - 0 0 Table 6.5: A C P Distributing Cost (unit: acp-hop) An A C P of the target peer b for peer c is distributed only to c if it is needed, not to any other peer. Only such A C P distribution cost is considered. On the other side, if peer c's neighbor c has the requirement to possess A C P s for c, c 2 can share.ACPs with c with little cost. We don't count in this part of cost. 2 The fact is: any (S, O) pair requiring A C P distribution will have the same A C P distributing cost, i.e., distributing A C P s from the target peer to the relevant peers. More specifically, for (S, 0 ) requiring A C P distribution, the A C P 64 Chapter 6. Cost Analysis for (S,0) pairs distributing cost is J2i=i(^i ' ^i), where iVo is the number of all peers int the P2P system, Xi is the number of A C P s of the target peer b for peer c di is the it distance from b to c*. By the descriptions of strategies and options, we know that only S\ requires A C P distribution. Thus, the corresponding matrix entries for Si are Ei=i( i x 6.2.6 • i)d Annotating Cost For Annotating Cost, "insert/delete a peer ID into/from an annotation" is the primitive operation, and the cost unit is identified as "annot-update", which is the charge associated with the primitive operation. Table 6.6 is the matrix summarizing the Annotating Cost for every (S, O) pair. We do not assess the (S, O) pairs, who are either not IL-free or incomplete. We will go through the matrix entries column by column. Oi 0A Si 0 0 0 s s - - - 5 - 2 3 4 2 0 0 0 o o 3 2B 5Xr - 1 u 0 2^i=i - Vl <—*l 7 o OiAAB - 2^i=i p - 2^7=1 i • l - - '» O^AfiB 5 2^i=i 2^7=1h l^i-l 2^7 = 1 3 l Table 6.6: Annotating Cost (unit: annot-update) Let us consider {S Oi) (i = 1,3), {Si,0 ) (i = 1,3,4), (Si,0 ) u 2A 2B (i =1,3). By the descriptions of Oi, 0 A and 0 , they don't require to annotate answer 2 2B tuples. Thus, the annotating cost for them is 0. Let us consider {Si,0 ) [i — 1,3;4). By 0 , for each path P; € P -.b, the 3 3 a returned answer set is annotated with the list L j . Let Z, be the number of peer ID's in L j . Thus, the annotating cost for {Si,0 ) (i = 1,3,4) is the sum of k, 3 Let us consider (S,,C>5) (i — 1,3,4). B y O5, for each path Pi £ P —b, a the returned answer set is partitioned and annotated. For a query incoming path Pi £ PQ-,6, let ki be the number of partitions of the answer set,(j be the number of peer ID's in the j-th partition. Thus, the annotating cost for (Si, O5) ( » = 1,3,4) is ElLr'Eji,^-- 65 Chapter 6. Cost Analysis for (S;0) pairs 6.2.7 Annotation Shipping Cost For Annotating Shipping Cost, "shipping one annotation down one network link" is the primitive operation, and the cost unit is identified as "annot-hop", which is the charge associated with the primitive operation. Table 6.7 is the matrix summarizing the Annotating Shipping Cost for every (S, O) pair. We do not assess the (S, O) pairs, who are either not IL-free or incomplete. We will go through the matrix entries column by column. o o s s Oi 0 0 Si - Si ~ Elr {ki[i Si 2 3 2A o OiA,iB - - 3 2B 0 0 0 0 0 - - Os s s bl OQA,6B ni Pi - 2 » E[-T {*i[l - (ciki) ]/(l - Ciki)-pi} ll 3 Si - - ( c A ) ] / ( i - cih) . } ni « S b ^ ' t e l i - M i ) ' ] / ( i - cih) - } n Pi Table 6.7: Annotation Shipping Cost (unit: annot-hop) Let us consider (Sj.Oi) (i = 1,3), (S ,0 ) t (i = 1,3). B y the description of Oi, 0 2A (i = 1 , 3 , 4 ) and 2A and 0 , 2B (Si,0 ) 2B none of them requires answer annotating, thus no annotation shipping cost. Let us consider (Si,0 ) {i = 1,3,4). By Si and 0 , we know for every path 3 3 Pi £ P -»6 (length of Pi is Ni), the answer set is annotated with an annotation L , 0 at the target peer and returned via path P[ C L j . Let m , be the length of path P[. Thus, the annotation shipping cost for (Si,0 ) (i = 1,3,4) is Ei=°f"''' «m 3 Normally, we expect m ^ Ni. t Let us consider (Si, Or,) (i = 1,3,4). There is no way to accurately quantify the annotation shipping cost in this case. However, we can do the approximation. By 05, for any path Pi € P ->&, for every peer c i that received a the answer set Ai and its annotation Lf. Ai x {Li} is routed from ci to c 2 if p (Ai) = Ai. Let ki be the average fanout of a peer (average number of C2 neighbors of a peer), Cj be a coefficient between 0 and 1 (the average reduction factor of a peer's "safe" neighbor number over all its neighbor number),rc.jis the average length of an answer routing path. It is not hard to see the number of peers receiving Ai x {Li} is the sum of a geometric sequence: ki, Cikf, 66 Chapter 6. Cost Analysis for (S,0) pairs c?/v?,...c" fc i ni+1 t . The sum is - (c /c )" ]/(l - ah). Let the number of partii i i tions (annotations) be pi. Thus, the annotation shipping cost is approximately El=r {fci[i-(c^) ]/(i-c fc )-p }. kl ni i 6.3 i i Hypothesis for Best (S,0) pairs Now let us make the hypothesis on which (S, O) pairs are the best in (1) being IL-free and complete as proved in Chapter 5, (2) fastest one as estimated by the results in Section 6.2. First, (Si-,0\) and (S ,0 ) 4 are excluded because they are not IL-free. 2B Besides, {S ,Oj) {j = 1-6), (S^OAA) (» = 1-4), (Si,0 ) 2 4B (t = 1-4), (Si,0 ) 6A (i = 1-4), (Si,OeB) (i = 1-4) are excluded because they are not complete. Next let us pick out the best (S, O) pairs according to each major cost: • Query Transmitting Cost is non-discriminative, because this cost for each (S, O) pair is exactly the same. • For Query Rewriting Cost, the best ( 5 , 0 ) pairs are (Si,Oi)(i = 2/1,3,5). • For A C P Evaluating Cost, the best (5, O) pairs are (Si, Oj) (j = 1,2A, 2B), (S ,Oj) (j = l,2A,2B). 3 • For Answer Routing Cost, the best (S, O) pairs are (Si, O2.4) (i = 1,3,4) and (Si, 0 2 ) (* = 1,3). (Si, O3) (i = 1,3) cost at the same level of those B best (S, O) pairs. • For A C P Distributing Cost, the best (S, O) pairs are (S3, Oj) (j = 1,2>1,2B,3,5) and (S ,Oj) U = 24,3,5). 4 • For Annotating Cost, the best (S, O) pairs are (Si, Oi) (i = 1,3), (Si, 0 A) 2 (i = 1,3,4) and 02B)(t.= 1,3)'. • • For Annotation Shipping Cost, the best (S, O) pairs are (Si, 0\) (i = 1,3), {Si,0 ) (i = 1,3,4) and (S 0 ) 2A U 2B (t = 1,3). In the above list, (S3, 0 A) and (S3, 02B) are always among the best, except 2 for the Query Rewriting Cost. As we know, query rewriting is done at peers locally, and is normally faster than the network transportation, such as query transmitting and answer routing. Thus, if the local computing speed of peers in a P D M S is much faster than the network transportation speed, {S3,0 A) and 2 67 Chapter 6. Cost Analysis for (S,0) pairs (53,02B) perform better than any other (S,0) pair. Notice that when S3 is used, given source peer a, target peer b, database D at b, query Q that has been transmitted at b via some path P € P -.i>, we have pp(Q(D)) = Q(D). That 0 means, ( 5 3 , 0 A ) and ( 5 , 0 2 B ) behave exactly the same. Therefore, if the 2 3 local computing speed of peers in a P D M S is much faster than the network transportation speed, (£3,02.4) is the best (S, O) pair. Another notable point is: if the local computing speed of peers in a P D M S is much faster than the network transportation speed, (S3,02A) may not always be the best/fastest (S, O) pair. If the local computing speed of peers in a P D M S is much faster than the network transportation speed, Answer Routing Cost is the major cost and really matters. From the above bullets, we know that (S,,03) (i = 1,3) perform fairly well at answer routing, though they are not among the best (S, O) pairs according to most other costs. Given a good P D M S topology, and A C P distribution, (Si,0 ) 3 (S, O) pairs, even faster than [i = 1,3) could be the best (£3,02,4). 68 Chapter 7 Algorithm Details We have implemented the strategies and options that can be combined to form an (S,0) pair being both IL-free and complete. They are Si, S3, S4, Oi, O2A, O2B, O3, and O5. In this chapter, we present the crucial algorithms in these strategies and options. 7.1 Algorithm for Query-Rewriting in light of ACPs By the descriptions of Si and 53 in Section 4.4, during the query transmitting process controlled by these two strategies, a query is rewritten into a new query at some peer to adhere to A C P s defined by the target peer. Therefore, the algorithm for query-rewriting in light of A C P s plays the central role in Si and S . We present this algorithm in Section 7.1.1, and illustrate it by an example 3 in Section 7.1.2. 7.1.1 A l g o r i t h m Description Given a query Q and an A C P R, the intuition of the query rewriting algorithm is to rewrite Q into a new query Q' such that Q' satisfies any structure/value constraint in either Q or R. In another word, for any database, the answer for Q' is contained by both the answer for Q and the answer for R, where Q, Q', R are treated as tree pattern queries. According to the query containment concept in Section 2.2, we have Q'.Q Q and Q' C R. In our algorithm, the containment mapping approach is used to ensure these query containment relationships. First, let us go through some terminology that will be used in the algorithm. Let Q be a query, R be an A C P . Both Q and R can be expressed as tree patterns. For a tree pattern, there exists an output node that means only instances corresponding to this output node are returned as the answer set. The output element in query Q is called return element, and the output element in A C P 69 Chapter 7. Algorithm Details R is called visible element. There are two types of edges in a tree pattern, i.e., pc (parent-child) edge and ad (ancestor-descendant) edge. Secondly, we assume no more than two nodes in R, or the extension form of R, have the same tag name. That is the A C P fragment our algorithm currently handles. The algorithm for Query-Rewriting in light of A C P is shown in Figure 7.1. Let us explain more on few steps of the algorithm. In step 3 of the algorithm, it requires that the schema scm is available, at least for all the ancestors and descendants of visible nodes in TP.R. In step 6, it ensures that the nodes in TP-R.ext, which are attached with new constraints in step 5 (b), must be in the range of mapping M. Because the nodes in TP-Q-ext with the original constraints are in the domain of M (in fact, every node in TP-Q-ext is in the domain of M), thus their mapped nodes in TP.R.ext to accept these constraints are in the range of M. So these nodes will not be pruned in step 6 according to the second condition. As mentioned earlier, the algorithm ensures Q 3 Q' and R 2 Q'. Q 2 Q' because there exists a containment mapping from the tree pattern of Q to the tree pattern of Q' (TP-R.ext), as described in step 5. RD Q' because there is a obvious containment mapping from the tree pattern of R to the tree pattern of Q' (TP.R.ext) since all we've done is adding elements to TP-R.ext. 7.1.2 Example Let us use an example to illustrate the algorithm in the previous section. In our example, the output nodes in tree patterns are capitalized. The database schema scm and the results after each step of the algorithm are shown in later figures. After step 1, the corresponding tree patters of given query Q and A C P R are built. We directly show the tree patterns here, skipping the string expressions of Q and R. After step 2, TP-R is marked, where marked nodes are tagged with a "*". , After step 3, TP-R is expanded to TP-R.ext. (Please refer to the schema scm.) After step 4, TP.Q is expanded to TP.Q.ext. (Actually nothing changes, just TP.Q has been replicated as TP.Q.ext.) Step 5(a) finds the obvious mapping. Because every node in TP.Q.ext can be mapped and every edge is preserved. Specially, the "ad" edge "c => b" in TP-Q-ext is mapped to a path " c - > g- > f- > 6" in TP-R.ext. The 70 Chapter 7. Algorithm Details Algorithm Q u e r y R e w r i t e A C P Input: query Q, ACP R, database schema son Output: the rewritten query Q', or NULL if Q cannot be rewritten in light of R 1. Let TP_R be the tree pattern of R Let TP_Q be the tree pattern of Q Build TPJi and TP-Q 2. Mark each element in TP.R 3. Expand TP-R to TPJi-ext. TP.R.ext includes: (1) allofTP_R (2) all ancestors to visible nodes exposed (3) all descendants of visible nodes exposed 4. Expand TP-Q to TP.Q.ext. TP.Q.ext includes: all of TP.Q 5. Do the containment mapping related work: (a) Attempt to find a containment mapping M, from TP.Q.ext to TPJi-ext: Given that nodes of TPJi-ext have distinct tags, finding M is as follows: V node x in TP-Q-ext: define h(x) = node y in TPJi-ext where x.tag = y.tag. Then test if (1) this mapping M preserves edges/paths, i.e. a "pc" edge is mapped to a "pc" edge, while an "ad" edge is mapped to a path with arbitrary number of nodes; (2) the return element x in TP-Q-ext is mapped to a node y that is a descendant of the visible element (including itself) in TPJi-ext. If the test succeeds, the containment mapping M exists. If the containment mapping M doesn't exist, return NULL (b) Identify the constraints from TP.Q-ext after converting the variable names. and attach them to TPJi-ext, 6. Prune each element e e TPJi-ext s.t. (1) e is not marked (see step 2) (2) e is not in the range of M (i.e., it is not mapped to) (3) TP-R.ext remains a tree 7. Set the return element in TPJi-ext element in TP-Q-ext 8. Translate TPJi-ext as the mapped node of the return to a query Q', and return Q' Figure 7.1: Query-Rewriting in light of A C P Algorithm 71 Chapter 7. Algorithm Details mapping succeeds. And we can see that element d in TP.R.ext, which is the mapped element of the return element D in TP.Q.ext, is a child of the visible element B in TP.R.ext. Step 5(b) identifies the constraint "c > 8" from TP.Q.ext and attach it to TP.R.ext. ' After step 6, TP.R.ext is pruned. The difference is that node g,f,e,m,n have been deleted because they are neither marked nor mapped to by TP.Q.ext. Furthermore, to keep TP.R.ext a tree, the "ad" edge between c and B is compensated. After step 7, the return element in TP.R.ext is set as D. After step 8, TP.R.ext is translated into an XQuery Q'. after step 1: TP_Q = I TP_R = a 1 1 1 c (c>8)1 1 1 b i 1 D a 1 \ c i 1 1 1 1 B i 1 1 1 1 d 72 Chapter 7. Algorithm Details after s t e p 2 : TP_Q = 1 TP_R = a a* 1 1 1 c (c>8)1 1 \ c* i * 1 1 b i 1 i II'.' B* . D 1 1 1 I 1 1 d* after s t e p 3: 1 TP.R.ext = TP_q'= 1 TP_R = a 1 I 1 c (c>8)1 II 1 1 1 b 1 1 D 1 1 1 1 1 a* 1 \ c* i * 1 i B* 1 a* 1 1 \ 1 c* i * I I i i 1 g I I i i 1 f 1 1 i i 1 i . d* IB* 1 1 1 1 1 \ d* e l\ m n after step 4: TP_Q = 1TP.Q.ext1 TP.R = a | a 1 1 II 1 c (c>8)1 c (c>8)1 II b i 1 D 1 II l b t i II ID 1 l a * 1 \ c* i * 1 1 1 1 II B* 1 1 1 1 1 1 1 1 1 1 1 1 1 | a* I l 1 TP.R.ext = d* 1 1 l 1 \ c* i * g i i If 1 i 1 1 1 B* 1 1 \ I d* e 1 1 l\ m n 73 Chapter 7. Algorithm-Details after step 5: TP_Q = |TP_Q_ext| TP_R = a* l 1 \ c* i * 1 1 1 1 1 1 B* . 1 1 I s l 1 l d* i1 ii 1 f 1 i ii | a 1 1 II c (c>8)I c 1 (c>8)1 a II b l 1 D III l b I I I I 1 I D 1 TP_R_ext = 1 | | 1 | 1 | 1 1 1 I 1 1 1 1 a * 7 ' \ c*(c>8) i * I B * ' " 1 \ d* e l \ mn after s t e p 6: TP_Q = lTP_Q_ext| TP_R = 1 TP_R_ext = a a* l 1 \ 1 / c* i * i i i i B* 1 c*(c>8) l a I I I I 1 c ( c > 8 ) I c (c>8) 1 II b i 1 D 1 II l b I I 1 1 l 1 1 I I I D , 1 1 d* 1 I a I * I 1 A 1 i * ' 1 B* 1 1 I I I d * after step 7: TP_Q = |TP_Q_extl TP_R = 1 TP_R_ext = a l a 1 a* 1 1 II 1 1 \ 1 1 c II b I 1 D (c>8)1-c 1 II l b l l I I I D (c>8)I 1 1 l 1 I c* i * 1 1 B* 1 1 d* a* / \ c*(c>8) i * 1 II 1 b* I 1 1 1 1 D* 74 Chapter 7. Algorithm Details after step 8: Q': FOR $ n l IN doc("example.xml")/a, $n2 IN $nl/c, $n3 IN $n2/b, $n4 IN $n3/d, $n5 IN $ n l / i WHERE xs:integer($n2)>8 RETURN ($n4) 7.2 Algorithms for 0 3 By the description of option O3 in Section 4.4, a safe peer list L for the returned answer set needs to be computed. After that, the answer set is routed back via peers in L. In this section, we go into the details of the safe-peer-list finding algorithm and the answer routing algorithm adopted in O3. 7.2.1 Safe-Peer-List F i n d i n g A l g o r i t h m Assume only positive A C P s are considered. The intuition of finding the safe peer list is: find the peers, each of which satisfy the intersection of the A C P sets defined by the target peer for all peers in the query incoming path. Here is an example. R2; Figure 7.2: Example for Safe-Peer-List Finding Algorithm In Figure 7.2, a is the source peer, b is the target peer. 6 defines A C P s for every other peer. There are three A C P s R\, R and R3. R\ is defined by b for 2 a and c\\ R is defined by b for a, c and C 3 ; R is defined by b for a and c . 2 2 3 2 75 Chapter 7. Algorithm Details Suppose the query Q is transmitted along the path a —» c\ —> c —* b. Let the 2 rewritten query at b be Q'. The initial safe peer list L is {a,c\,c }- The target 2 peer b knows that the A C P set (defined by itself) for a is S = {R\, R , R3}, a the A C P set (defined by itself) for ci is S by itself) for c is S 2 5 C3 = {R }2 C2 2 = {R\,R }, the A C P set (defined Cl 2 = {R ,Rz}, the A C P set (defined by itself) for C3 is 2 b notices that the intersection of the A C P sets for all peers in the query incoming path is I = 5 n 5 a C l C\S C2 = { ^ 2 } and / C 5 . That means, the C3 data, which can be accessed by o, c\ and c , can also be accessed by C3. Thus, 2 c can be added into the safe peer list: L = {a, c\, 3 c , C 3 } . 2 Then the answer set for Q' can be routed via any peer in L. The answer can be routed via the path b —* C3 —+ a, which is shorter than the path b —> c —• c\ —» a. 2 The Safe-Peer-List Finding Algorithm described by the above example is shown in Figure 7.3. For clarity, the algorithm uses two hash tables H\ and H . 2 Hi is the hash table keeping all (peer ID, {(ACP ID, target peer ID)}) pairs, where {(ACP ID, tar get peer ID)} is the set of (ACP ID, target peer ID). H is the hash table keeping all ((ACP ID, target peer ID), {peer ID}) pairs, 2 where {peer ID}) is the set of peers for whom this A C P is defined by the target peer. 7.2.2 Answer Routing Algorithm After the safe peer list L is found, O3 routes the answer set back to the source peer via peers in L. There exists an opportunity for 0 3 to find a better/shorter answer routing path than the reversed path of the query incoming path. In this section, we describe the answer routing algorithm designed for O3. Because no peer in the P D M S has the complete knowledge about the P D M S topology, there is no algorithm to find the optimal answer routing path. But given the safe peer list L, a peer, who is routing the answer set, is able to find a "local" shortcut. Here we use an example to illustrate the idea. Please refer to Figure 7.4. In the P D M S , S is the source peer, and T is the target peer. The routed answer set is accompanied with two supported structures: a stack ST with the current routing peer on the stack top, a safe peer list L created by the Safe-PeerList Finding Algorithm in the previous section. When the answer set is routed from T, the initial status of ST is a stack containing all peers in the query incoming path. In our example, the query incoming path is S —» ... —» C —» X Y -* Z -* A T, so the initial ST is {S,...,C,X,Y,Z,A,...,T}. 76 Chapter 7. Algorithm Details Algorithm F i n d S a f e P e e r s Input: the set S of all peer IDs along the query incoming path, the database DBR of tuples (ACPJD,targetPeerID,peerID) Output: the safe peer list L Let Hi be the hashtable for (peerID,{(ACP-ID,targetPeerID)}) Let H2 be the hashtable for ({ACPJD,targetPeerID),{peerID}) pairs pairs 1. Traverse DBR to build Hi and H2 2. Initialize L = S 3. Let / be the intersection of the A C P set V peer G S Initialize / as the key set of H2 F O R each peS { (a) From Hi, get the set Si = {(ACPJD,targetPeerID)} (b) Update / = / n S i } 4. F O R each p e the key set of Hi { Let S be the set {(ACPJD,targetPeerID)} 2 (a) Get 5 2 for p from Hi (b) IF S 2 / { Update L = L U {p} 2 }. } Return L Figure 7.3: Safe-Peer-List Finding Algorithm 77 Chapter 7. Algorithm Details Stack ST: Safe Peer List L: Figure 7.4: Example for 0 3 Answer Routing Algorithm 78 Chapter 7. Algorithm Details At the moment, the answer set is at A. The corresponding stack ST and safe peer list L is shown in Figure 7.4. Assume there is an easy way for each peer to know its neighbor's adjacent peers, e.g., a peer sends a message to ask its neighbors for such information. (In our implementation, we use a similar way to achieve it.) For instance, in Figure 7.4, A knows the adjacent peers of Z are {j4,y}. Then A can utilize some peer in L to skip a few peers in stack ST. (The naive answer routing method is to route the answer set via peers in ST one by one.) In our example, A checks the safe peer list L, finds that B is in L and one of B's neighbor is C, which exists in stack ST. Then the answer set is routed from A to C via B. A l l the entries above C in stack ST is popped. By this tactic, two hops are saved. (Instead of being routed via A —> Z —*Y—> X —> C , the answer set is routed via A —> B —> C.) Repeat the tactic at each passing peer, until the answer set arrives at S. The O3 Answer Routing Algorithm is formalized in Figure 7.5. In this algorithm, if such peer B can not be found, the answer set is routed to the top element (peer) of the stack ST. This ensures that the answer set will be routed back to the source peer. Algorithm AnswerRouting03 Input: stack ST, safe peer list L 1. pop a peer ID A from ST • 2. IF A is the source peer{ RETURN • } 3. find peer ID B s.t. (1) B is A's neighbor (2) B € L (3) B's neighbor C £ ST 4. IF such B exists { «. (a) Pop all entries above C from ST (b) Route the answer set to C via A —» B —* C } ELSE { (a) Get the top entry TV of ST (b) Route the answer set to N } , ' Figure 7.5: O3 Answer Routing Algorithm 79 Chapter 7. Algorithm Details 7.3 Algorithms for Option 5 According to Chapter 4, option 0 5 partitions the answer set and associates each partition Ki with an annotation L j , where L j is the safe peer list for Ki. Ki is routed via peers in L j . In this section, we present the partitioning semantics and methodologies (Section 7.3.1), the data-level partitioning algorithm (Section 7.3.2) and the schema-level partitioning algorithm (Section 7.3.3). 7.3.1 Partitioning Semantics and Methodologies Given query Q, the database D at the target peer b, all A C P s of b for other peers. A n annotating and partitioning algorithm of O 5 divides the answer set Q(D) into several non-intersecting partitions, and annotates each partition with a set of safe peers, which are authorized by A C P s to access tuples in the partition. The partitioning semantics is explained by Figure 7.6. Iff' :'|. {a;b,c} {<?}; \ernptyset Figure 7.6: Answer Partitioning Semantics In the P D M S , there are three peers a, b, c, besides the target peer. The rectangle denotes the whole answer set Q(D). The three circles separately 80 Chapter 7. Algorithm Details denote the set of answer tuples that can be accessed by a, b, c, according to the A C P s defined by the target peer. As we see, the answer set Q{D) is divided into eight non-intersecting partitions, separately attached with annotations {a}, {b}, {c}, {0,6}, {a,c}, {6,c}, {a,b,c}, 0. Each partition is maximized and its annotation set is maximized. More specifically, after partitioning, every answer tuple t G Q{D) is put into a partition that has an annotation S, such that S is the maximum set of safe peers for t. This is the semantics for partitioning and annotating. There are two possible methods of conducting the annotating and partitioning: Method 1. Interleave A C P checking with evaluation of Q{D). In another words, whenever computing an answer tuple according to Q and an A C P , modify this tuple's, annotation. After all tuples have been computed and annotated, partition them according to their annotations. Method 2. First evaluate Q(D) and get the answer set. Then find an algorithm for checking all A C P s on all the answer tuples and annotating them. Finally partition tuples according to their annotations. Method 2 relies on the supporting elements, which might have been projected out but are required to be kept in the answer tuples of Q(D). However, Method 1 doesn't have such a restriction. So we choose to use Method 1 in our partitioning algorithm. There are two types of ACPs: data-level and schema-level A C P s . Datalevel A C P s do not affect the schema of answer tuples, which all adhere to the same schema; while schema-level A C P s will project on some elements and thus affect the schema of answer tuples. We will work on the data-level partitioning algorithm in Section 7.3.2 and the schema-level partitioning algorithm in Section 7.3.3. 7.3.2 Data-level Partitioning Algorithm Assume all ACPs defined by the target peer are data-level ACPs. Then A C P s do ? not affect the schema of answer tuples. Therefore, it is easy to check if an answer tuple exists in an answer set. From this conclusion, the intuition of our datalevel partitioning algorithm is as follows. Initialize the answer set as an empty set. Each A C P is independently combined with the original query to compute answer tuples. For each computed answer tuple, check whether it exists in the 81 Chapter 7. Algorithm Details current answer set. If so, we expand its annotation to include peers associated with the current A C P ; else we add the tuple to the answer set, annotating with peers associated with the current A C P . After the process, the annotation of each answer tuple is maximized. Then according to their annotations, the tuples can be grouped to form proper partitions, which have the same property as Figure 7.6. The Data-level Partitioning Algorithm is shown in Figure 7.7. In Algorithm DataLevel Partition, step 1 and step 2 are responsible for computing and annotating answer tuples, step 3 call Procedure Grouping to form partitions. Procedure Grouping traverses the annotated answer tuples once and groups them into several partitions. Any tuple t is put into a partition with the same annotation of t. 7.3.3 Schema-level Partitioning A l g o r i t h m The partitioning algorithm in the previous section can only handle data-level ACPs. Now let us extend the algorithm to tackle both data-level and schemalevel ACPs. In order to partition answer tuples in different schemas, we must have a clear idea on what an answer tuple schema and an answer tuple are. A n answer tuple schema is the schema of an answer tuple. It is a set of attributes. A n answer tuple is a set of attribute values. More specifically, in a relational query, each answer tuple is a set of table attribute values; in an XQuery, each answer tuple is a set of user-defined variable values , if not considering result restructuring. There is a useful relationship between two answer tuples in different schemas. We define it as a new operator, Tuple Containment: Definition 7.1 (Tuple Containment) Let T\ be an answer tuple and S\ be T\'s schema. Given an attribute value v € T\, v.attrSi is v's corresponding attribute € Si. Let T be an answer tuple and S 2 2 be T 's schema. Given an 2 attribute value v € T , v.attrS is v's corresponding attribute £ S . T\ is Tuple2 2 2 Contained by T if and only ifVv € T\: 3v £ T s.t. v.attrSi = v.attrS . It 2 2 2 If T, < T and Si C S , we say Ti is strictly Tuple- is written as T, < T . 2 2 Contained by T , written as Ti 2 2 <T . 2 Here is an example for the Tuple Containment. The schema for answer tuple t is (Ai,A ) and t = ('a',, 'a' ); the schema for answer tuple t' is 3 3 (Ai,A ,A ) 2 3 and t'—('a'i, 'a' , 'a' ). According to the tuple containment definition, we have 2 t<?: 3 . 82 Chapter 7. Algorithm Details Algorithm DataLevelPartition Input: query Q, target peer b, database D at peer b, all A C P s Ai (i = l..fc) of peer b for other peers. Output: partitions of Q(D), where every answer tuple t is put into a partition that has an annotation A s.t. A is the maximal set of safe peers for t. 1. Initialize the answer set 5 = 0. 2. F O R each A C P R { (a) Let Si be the peer set associated to R Use R to rewrite query Q, and compute the answer set I (b) F O R each answer tuple t £ I { IF (t £ S) { i. Let So be t's current annotation Update So = So U Si } l/t^S ELSE { i. Assign Si as t's annotation. ii. Update S = SU{t}. } //ELSE }//FOR } //FOR 3. Call the procedure Grouping to return partitions of Q(D) Procedure Grouping Input: a set S of answer tuples with annotations Output: partitions of these answer tuples. Each partition has an annotation, which is a set of peer IDs. Each answer tuple t is put into the only partition with the same annotation of t. 1. Let Si be the set of tuple partitions Let S2 be the set of tuple annotations Initialize Si = 0, S = 0 2 2. F O R each tuple teS { (a) Let A be t's annotation IF(A € S ) { 2 i. A d d t to partition P where P £ S i A P has annotation A. } E L S E { 11A $ S 2 i. A d d A to S . 2 ii. Create a new partition P', with annotation A in Si. iii. A d d t to P'. } }//FOR 3. Return Si Figure 7.7: Data-level Partitioning Algorithm Chapter 7. Algorithm Details Intuitively, answer tuple t\ is tuple-contained by t if and only if all infor2 mation in t\ is covered by t . It infers a useful conclusion: if t\ <l t and t 2 2 2 can be accessed by peer p, then t i can also be accessed by p. With this conclusion, we design a new partitioning algorithm that extends the partitioning algorithm in the previous section to handle both data-level and schema-level A C P s . It is shown in Figure 7.8. The procedure Grouping called in step 3 is exactly the same as in the previous section. A n supporting data structure is required for every answer tuple in Algorithm SchemaLevelPartition. This datastructure is called Affected Tuple Set. The idea is: given tuples t\ and t , if t < t\, t is put in t\s Affected Tuple Set. 2 2 2 Therefore, a tuple t's Affected Tuple Set can accurately identify which tuples' annotations need to be modified when t's annotation is modified. For example, if t < t\, then t is in ti's Affected Tuple Set. When peer p is added to t i ' s 2 2 annotation, p should also be added to t2's annotation. For an answer tuple t, the intuition of the algorithm is: (1) Check whether t is in the current answer set. If so, expand t's annotation to include peer IDs associated with this A C P ; accordingly expand the annotations of tuples identified in t's Affected Tuple Set. (2) Else t isn't in the current answer set. Assign peer IDs associated with this A C P as annotation of t. Furthermore, check whether there exists an answer tuple t' in the current answer set such that t O t'. According to our previous conclusion, the annotation of t will be expanded to include peer IDs in the annotation of t'. (1) and (2) ensure the annotation for each answer tuple is maximized. Thus, the algorithm returns the correct partitions. 84 Chapter 7. Algorithm Details Algorithm SchemaLevelPartition Input: query Q, target peer b, database D at peer 6, all A C P s Ai (i = l..fc) of peer 6 for other peers. Output: partitions of Q(D), where every answer tuple t is put into a partition with an annotation A, such that A is the maximal set of safe peers for t. 1. Initialize the answer set S = 0. 2. F O R each A C P R { (a) Let Si be the peer set associated to R (b) Use R to rewrite query Q, and compute the answer set T (c) F O R each tuple t 6 T { i. I F t G 5 { A . Let So be the existing annotation of t. Update So = So U Si. B . Let ATS be the Affected Tuple Set of t. For every tuple h identified in ATS, update ti's annotation St = St, U S i . t }//IF E L S E { Ht $ S A . Assign t's annotation So = Si. B. F O R every answer tuple t', where f ' e S and t < t' { • Let S' be the annotation of t'. Update So = So U S'. • Let ATS , be the Affected Tuple Set oft'. Update ATS > = ATSf U {t}. }//FOR C. A d d t (with So) to S. D. Let ATS be the Affected Tuple Set of t. Compute ATS. For every tuple ti identified in ATS, update ti's annotation Stj = Stj U So. t t } //ELSE } //FOR } //FOR 3. Call procedure Grouping to return partitions of Q(D) Figure 7.8: Schema-level Partitioning Algorithm 85 Chapter 8 Experimental Study In Chapter 3, we introduced the information leakage and completeness problems of the query answering process in a P D M S with access control requirements. Then in Chapter 4, our solution for the problem was presented: we designed some strategies and options to handle access control. Furthermore, we built a cost model to theoretically analyze the cost for each (S, O) pair that ensures ILfree and completeness in Chapter 6, where a hypothesis for best/fastest'(5, O) pairs are proposed by us. In this chapter, we use experiments to verify our hypothesis for best (S,O) pairs, and study the algorithm scalability. Specifically, we describe the experiment implementation in Section 8.1,'compare the running time of (5,0) pairs in Section 8.2, and study the scalability in different facets in Section 8.3. 8.1 Experimental Settings and Implementation To setup the P2P networking environment, FreePastry [3] is used in our experiment. FreePastry is an open-source P2P overlay network implementation. It provides an efficient algorithm for message routing, whose complexity is 0(logN), where N is the number of nodes in the network. Moreover, userspecified applications can be easily integrated with existing FreePastry source codes. In our experiment, FreePastry version 1.4.4 is used. As to the emulation test bed, Emulab [2] is adopted in our experiment. Emulab holds a collection of hundreds of PCs for allocation. For an experiment at Emulab, the user can freely specify the topology of a network, the type of PCs in the network, latency, bandwidth, and so on. During the experiment life cycle, the user has full control on the allocated PCs. Thus, user applications can be loaded on any P C in the experiment. In our experiment, 47 PCs in Emulab are required and allocated: 31 of them work as peers in a P D M S , and the remaining 16 as the delay nodes that controls the networking traffic shaping. Unless specified otherwise, in later experiments, the network bandwidth is 50 86 Chapter 8. Experimental Study M B , the latency is 100 ms. The bandwidth is big enough to avoid the bottleneck, comparing with the size of query/answer messages (at most few K B each). Every P C allocated has 3.0 GHz 64-bit Xeon processor and 2 G B R A M , with Testbed version of RedHat Linux 9.0 as the operating system. Our P D M S application built on FreePastry and the database are loaded on each P C . To the best of our knowledge, Qizx [4] is the fastest open-source Java X M L query engine. So it is used in our experiment for peers to query their local X M L databases. Qizx supports the standard XQuery language, and also provides Java APIs to invoke the XQuery engine. In our experiment, Qizx version 1.0 is used. In order to make the X M L databases on peers general enough, we choose XMark [1] data generator to randomly create X M L data. XMark project provides a benchmark suite for users. The XMark data generator can produce random X M L documents modeling an auction website. Important structure features in a typical X M L document is included in an XMark-created X M L document. In our experiment, we create several X M L databases, whose sizes range from 10 M b to 40 M b . We manually build the schema for Xmark-created X M L databases and a library of 20 A C P s . We design a topology generator to randomly create the P D M S topology we need. A l l the strategies and options, which can be combined while keeping IL-free and Completeness properties, are implemented. The peer application, which specifies its strategy and the option, is loaded on each P C (peer) allocated by Emulab. Our implementation is written in Java 1.5 to make it cross-platform. 8.2 (S,0) Pair Comparison and Analysis In this section, we experiment to compare the running time of the queryanswering process controlled by (S, O) pairs that are both IL-free and complete. The running time here and in the next section is for O N E source peer, O N E target peer and O N E query, which adheres to the setting of the theoretical cost analysis for an (S,0) pair (Chapter 6). The first reason lies in that using the same setting, the experiment result can directly verify our theoretical cost analysis and hypothesis. The second reason is that the result for one query, one source peer and one target peer can be extended to a general case with one query, one source peer and multiple target peers, which doesn't violate our 87 Chapter 8. Experimental Study existing conclusion. The compared (S,0) pairs include ( S i , 0 , ) (j = 1,24,3,5), (S ,Oj) (j = 3 1,24,3,5), (54,05). Our experiment setting is: 50Mb bandwidth, 100ms latency, 10 peers, average 2 neighbors per peer, 10M database, 1 acp defined for each peer. The P D M S topology is fixed but randomly created. For each running time value, we execute the experiment for three times and get the average value. The result is shown in Figure 8.1. Figure 8.1(a) is the running time of the query-answering process for all (5,0) pairs. We can see that ( 5 ; , 0 5 ) (i•= 1,3,4) is much slower that other (5,0) pairs. To clearly see which (5,0) pair is the fastest, we extract the first three groups of bars and put them into 8.1(b). In this figure, we see that (1) for Oj (j = 1,24,3), {S ,Oj) is slightly faster than ( 5 i , 0 , ) ; (2) (Si,0 ) 3 2A (» = 1,3) are faster than others. Thus, ( 5 , 0 i ) is the fastest among all (5,0) pairs, 3 2 / which adheres to the hypothesis we made in Section 6.3: if the local computing speed of peers in a P D M S is much faster than the network transportation speed, (53,02.4) is the best (5,0) pair. Now let us retain the setting of the previous experiment, except decreasing the network latency to 10 ms, and repeat the experiment. This time the network transportation speed is so fast that the assumption "the local computing speed of peers in a P D M S is much faster than the network transportation speed" no longer holds. So the hypothesis (S3,0 A) U 2 is the best (5,0) pair" may not be true. The experiment result is shown in Figure 8.2. We see that (5j,Os) (i = 1,3,4) is still much slower that other (5,0) pairs. But there is no (5,0) pair that is apparently faster than others. However, as we mentioned in Section 6.3, even given the condition "the local computing speed of peers in a P D M S is much faster than the network transportation speed", (S3, 0 A) may not always be the best/fastest (5, O) pair; 2 if given a proper P D M S topology and A C P distribution, (Si, O3) (i = 1,3) could be the best, even faster than (53,02A)- T O verify the hypothesis, we conduct another experiment. The experiment setting remains the same as the first one: 50Mb bandwidth, 100ms latency, 10 peers, average 2 neighbors per peer, 10M database, 1 acp defined for each peer. But this time, the P D M S topology and A C P distribution are carefully designed to benefit O3 finding a short answerrouting path. More specifically, the topology and A C P distribution enables O3 to find a shortcut for the reversed path of the longest query incoming path, with the help of safe peers outside the query incoming path. (Otherwise O3 has to route the answer back to the source peer via the reversed query incoming path.) 88 Chapter 8. Experimental Study (a) t h e R u n n i n g T i m e o f t h e Q u e r y - A n s w e r i n g P r o c e s s for A l l (S, O) P a i r s B e i n g C o m p a r e d Algorithm Comparison 1a.2 • S1 • S3 (b) t h e R u n n i n g T i m e o f t h e Q u e r y - A n s w e r i n g P r o c e s s for A l l (S, O) p a r e d , E x c l u d i n g ( S j . O s ) (i = Pairs Being C o m - 1,3,4) Figure 8.1: Running Time Comparison of (S,0) Latency Pairs, in case of Large Network 89 Chapter 8. Experimental Study Algorithm Comparison 1b.1 40.000 35.000 30.000 25.000 time (in seconds) 01 02A 03 Options (a) t h e R u n n i n g T i m e o f t h e Q u e r y - A n s w e r i n g P r o c e s s f o r A l l ( S , O ) P a i r s B e i n g C o m p a r e d Algorithm Comparison 1b.2 time (in • S1 seconds) • S3 (b) t h e R u n n i n g T i m e o f t h e Q u e r y - A n s w e r i n g P r o c e s s f o r A l l (5,0) pared, E x c l u d i n g (S ,Os) t Pairs Being Com- (i = 1,3,4) Figure 8.2: Running Time Comparison of (5,0) Pairs, in case of Small Network Latency 90 Chapter 8. Experimental Study On the other side, the topology in this experiment doesn't benefit O2 finding a shortcut for the reversed path of the longest query incoming path, i.e., if a path is treated as a set of peers, there doesn't exist an answer routing path, which is a subset of the longest query incoming path. The experiment result is shown in Figure 8.3. We can clearly see that (Sj,03) (i = 1,3) are the fastest, even faster than (SZ,C>2A)Because (Sj,Os) (i = 1,3,4) is always much slower than other ( 5 , 0 ) pairs and even intolerable, in the experiments on scalability we will not consider ( 5 i , 0 ) (i = 1,3,4). 5 8.3 Scalability Results and Analysis In this section, we experiment on the ( 5 , 0 ) pair scalability in different facets. (1) Scalability o n Database Size In this experiment, we test the running time trend of the query-answering process for ( 5 , 0 ) pairs with the change of database size on the target peer. The experiment setting is: 50Mb bandwidth, 100ms latency, 10 peers, average 3 neighbors per peer, 1 acps per peer. In the experiment, the P D M S topology is fixed but randomly created. For each running time value, we execute the experiment for three times and get the average value. The experiment result is shown in Figure 8.4. We can see that the running time for any ( 5 , 0 ) pair is proportional to the database size of the target peer. The result is reasonable: normally, the database query time and the returned answer set size are linear functions of the target database size, which in turn determines the query-answering time is a linear function of the target database size. What is the effect if we increase the network latency? As a comparison, let us retain the setting of the previous experiment, except increasing the network latency to 1000 ms, and do the experiment again for ( S i , O r ) . The result is shown in Figure 8.5. We can see that the running time is still approximately a linear function of the target database size, but the slope is much more flat. This result is not hard to explain: with the increase of network latency, the affect of the target database size is diluted. The total running time now is mainly decided by the network transportation, which is irrelevant to the target database size. As an ultimate case, if the local computing time is by far smaller 91 Chapter 8. Experimental Study Algorithm Comparison 2.1 80.000 70.000 BSI • S3 • S4 03 02A 05 Options (a) t h e R u n n i n g T i m e o f t h e Q u e r y - A n s w e r i n g P r o c e s s for A l l (S, O) P a i r s B e i n g C o m p a r e d Algorithm Comparison 2.2 4.100 4.000 E3S1 • S3 Options (b) t h e R u n n i n g T i m e o f t h e Q u e r y - A n s w e r i n g P r o c e s s for A l l (S, O) p a r e d , E x c l u d i n g ( S j . O s ) (i = 1 , 3 , 4 ) Pairs Being C o m - Figure 8.3: Running Time Comparison of (5,0) Pairs. Under this experiment setting, the P D M S topology and A C P distribution benefit 0 finding a short answer-routing path, but doesn't benefit 0 A3 2 92 Chapter 8. Experimental Study running time v.s. db size 1 80 70 -•— (S1,01) •«-(S1 02A)| (S1,03) «* (S3.01) -»-(S3,02A)| -•—(S3.03) 60 eS I O 1 $ 40 I I 30 20 10 10 15 20 25 30 35 40 database size (in megabytes) Figure 8.4: the Running Time of the Query-Answering Process V . S . the Database Size of the Target Peer for Each Compared (S,0) Pair. The network latency is 100 ms. than the network latency, the curve for our experiment result is expected to be a horizontal line. (2) Scalability on Number of A C P s per Peer A C P s are defined by the target peer for other peers. In this experiment, we study the running time trend of (S, O) pairs with the change of the number of A C P s defined by the target peer for each peer. The experiment setting is: 50Mb bandwidth, 2ms latency, 10 peers, 2 neighbors per peer, 10M database. The P D M S topology is fixed but randomly created. For each running time value, we execute the experiment for three times and get the average value. The experiment result is shown in Figure 8.6. By this curve, the running time seems to be a polynomial function of the number of A C P s per peer. But it is hard for us to explain where this result comes from. So we conduct the second experiment to discover the hidden fact. In the second experiment, we test the running time trend of (S\,0\) with the changes of both the number of A C P s per peer and the network latency. The experiment setting is: 50Mb bandwidth, 10 peers, 2 neighbors per peer, I M database. The result is shown in Figure 8.7. From the result curves, we see that the running time is a polynomial function of the number of A C P s per peer, and 93 Chapter 8. Experimental Study running time v.s. db size 2 120 T 100 | 80 -(S1.01) 60 40 20 0 10 15 20 25 30 35 40 database size (in megabytes) Figure 8.5: the Running Time of the Query-Answering Process V . S . the Database Size of the Target Peer for (S\,Oi). The network latency is 1000 ms. a linear function of the network latency (because the distance between each two curves approximately remains a constant, and the distance between the curve of 0 ms latency and the curve of 100 ms latency equals the distance between the curve of 100 ms latency and the curve of 200 ms latency). Hinted by the experiment result, we reach a theoretical explanation: total running time T = message transmitting time + local query evaluation time = 2 * n * I + E * r", where n is the longest path from source to target, I is the network latency, E is the local evaluation time for a query, r is the number of A C P s per peer. The expression r is the number of rewritten queries at the target peer, which is den cided by S\. From the above equation, it is clear that the total running time T is a polynomial function of r and linear function of /, which explains the results in Figure 8.6 and Figure 8.7. (3) Scalability on Length of the Longest Path The running time of the query-answering process for an (S, O) pair might be largely affected by the length of the longest path for a message having a round trip between the source peer and the target peer. For ( S i . C ^ A ) and (£,,03), such a longest path is hard to decide because the answer-routing path is decided by both the topology and A C P distribution. To make our experiment clear, we 94 Chapter 8. Experimental Study running time v.s. acps/peer 1 100 90 71 80 •o 70 z 60 50 5 40 -(S1,01) -(S1,02A)| (S1,03) (S3.01) - (S3.02A) | 30 -(S3.03) 20 10 acp U per peer Figure 8.6: the Running Time of the Query-Answering Process V.S. the Number of A C P s per Peer for Each Compared (S, O) Pair running time v.s. acps/peer 2 Figure 8.7: the Running Time of the Query-Answering Process V.S. the Number of A C P s per Peer for (S\,0\), with the Network Latency of 0 ms, 100 ms and 200 ms Separately 95 Chapter 8. Experimental Study choose to study ( 5 i , O i ) , for whom such a longest path is simply twice the length of the longest path from the source peer to the target peer. In this experiment, we test the running time trend of the query-answering process for ( S i , 0\) with the change of length of the longest path from the source peer to the target peer. The experiment setting is: 50Mb bandwidth, 100ms latency, 1M database, 2 neighbors per peer, 1 acp per peer. Given the same setting, we do the experiment on two P D M S of different sizes: one P D M S with 20 peers and the other P D M S with 30 peers. The experiment result is shown in Figure 8.8. By the result, we see the running time is proportional to length of the longest path from the source peer to the target peer, but nearly has no relation to the number of peers in the P D M S (because the two lines overlap). This result can be also explained by the aforementioned formula: total running time T = message transmitting time -I- local query evaluation time = 2 * n * I + E * r , where n is the longest path from source to target, / is the network n latency, E is the local evaluation time for a query, r is the number of A C P s per peer. In our experiment setting, r = 1. Thus, T = 2*n*l + E. It indicates that T is proportional to n. running time v.s. length of longest path 1.200 1.000 I 0.800 • 20 peers • 30 peers | 0.600 c | 0.400 0.200 0.000 10 12 14 16 18 length of longest path (in hops) Figure 8.8: the Running Time of the Query-Answering Process V.S. the Length of the Longest Path from Source Peer to Target Peer for ( S i , O i ) . The experiment is done for both a P D M S with 20 peers and a P D M S with 30 peers. 90 Chapter 9 Conclusions and Future Work In this thesis, we have studied the access control issue in the X M L peer data management system. To the best of our knowledge, our work is the first attempt to systematically analyze the access control problems in the P D M S . Our main contributions include: • A formal syntax for the Access Control Policy ( A C P ) is proposed. The A C P syntax is fine-grained and expressive enough for specifying the access control privilege on the X M L database of a peer in the P D M S . (Chapter 3) • We design several (query transmitting) Strategies and (answer routing) Options, whose combinations form the query-answering algorithms and can handle the access control requirements in a P M D S . (Chapter 4) • Some novel algorithms used in the strategies and options, such as (i) query rewriting algorithm in light of A C P s (ii) safe peer list finding algorithm (iii) annotating and partitioning algorithm, are designed. (Chapter 7) • We formalize the definitions for Information Leakage Free and Completeness, which are important properties of an (Strategy, Option) pair. Furthermore, we propose the sufficient and necessary conditions for them, and analyze every (Strategy, Option) pair designed. (Chapter 5) • We propose a cost model, which consists of the major tasks and the corresponding primitive operations and cost units. A l l (Strategy, Option) pairs are assessed by this cost model. (Chapter 6) • Experiments are conducted on the designed (Strategy, Option) pairs, comparing their execution speed and testing the scalability in terms of the tar- 97 Chapter 9. Conclusions and Future Work get peer database size, A C P amount pe peer, length of the longest path from the source peer to the target peer. (Chapter 8) There are some directions that we would like to pursue in our future work: • In our work we assume that all peers use the same schema, which avoids adding in the complications of schema heterogeneity. In a realistic P D M S , the schema heterogeneity will force A C P s to be rewritten if they are distributed among the P D M S . A n d the schema heterogeneity may also affect the query-rewriting in light of A C P s algorithm. • Caching is not discussed in the thesis. However, as a common approach to accelerate the query-answering process, caching is worth noting. If caching is used in a P D M S , we need to be more careful to avoid information leakage. The IL-free and Completeness definitions in the thesis need to be modified. And other strategies and options can be designed to utilize caching. • Thus far, our algorithm for query-rewriting in light of A C P s can only handle one fragment of tree patterns, whose corresponding XPath is with '/'> '//'>' 'I ]'• And It also requires, that nodes of TP.R.ext have distinct tags. We would like to remove the restrictions and make the algorithm to handle more general cases. 98 Bibliography [1] XMark A n X M L Benchmark Project, 2003. http://monetdb.cwi.nl/xml/. [2] Emulab - Network Emulation Testbed, 2006. http://www.emulab.net/. [3] Pastry: A substrate for peer-to-peer applications, 2006. http://www.freepastry.org/. [4] Qizx/open, 2006. http://www.axyana.com/qizxopen/. [5] S. Adali, K . Candan, Y . Papakonstantinou, and V . Subrahmanian. Query Caching and Optimization in Distributed Mediator Systems. In Proc. of the ACM SIGMOD International Conference on Management of Data, 1996. [6] A . V . Aho, Y . Sagiv, and J . D. Ullman. Efficient Optimization of a Class of Relational Expressions. ACM Transactions on Database Systems (TODS), 4(4):435-454, 1979. [7] A . V . Aho, Y . Sagiv, and J . D. Ullman. Equivalence of relational expressions. SIAM Journal on Computing, 8(2):218-246, 1979. [8] J . L . Ambite, N . Ashish, G . Barish, C. A . Knoblock, S. Minton, P. J . Modi, I. Muslea, A . Philpot, and S. Tejada. A R I A D N E : A System for Constructing Mediators for Internet Sources (system demonstration). In Proc. of the ACM SIGMOD International Conference on Management of Data, 1998. [9] Sihem Amer-Yahia, SungRan Cho, Laks V . S. Lakshmanan, and Divesh Srivastava. Tree Pattern Query Minimization. The VLDB Journal, 11 (4):315— 331, 2002. [10] Elisa Bertino, Barbara Carminati, and Elena Ferrari., A Temporal Key Management Scheme for Secure Broadcasting of X M L Documents. In Proc. of the 9th ACM Conference on Computer and Communications Security (CCS 2002), 2002. 99 Chapter 9. Conclusions and Future Work [11] J . Biskup, P. Dublish, and Y . Sagiv. Optimization of a subclass of conjunctive queries. Acta Informatica, 32(l):l-26, 1995. [12] Scott Boag, Don Chamberlin, Mary F. Fernandez, Daniela Florescu, Jonathan Robie, and Jerome Simeon. XQuery 1.0: A n X M L Query Language, 2006. http://www.w3.org/TR/xquery/. [13] Angela Bonifati, Elaine Qing Chang, Laks V.S. Lakshmanan, Terence Ho, and Rachel Pottinger. HePToX: Marrying X M L and Heterogeneity in Your P2P Databases. In Proc. of the 31st International Conference on Very Large " Databases (VLDB 2005), 2005. [14] Sabrina De Capitani, Stefania Marrara, and Pierangela Samarati. A n Access Control Model for Querying X M L Data. In Proc. of the 2005 Workshop on Secure Web Services (SWS 2005), 2005. [15] Ashok K . Chandra and Philip M . Merlin. Optimal Implementation of Conjunctive Queries in Relational Data Bases. In Proc. of the 9th annual ACM Symposium on Theory of Computing (STOC 1977), 1977. [16] Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey Ullman, and Jennifer Widom. The TSIMMIS Project: Integration of Heterogeneous Information Sources. Journal of Intelligent Information Systems, 8(2):117-132, 1997. [17] Chandra Chekuri and Anand Rajaraman. Conjunctive Query Containment Revisited. Theoretical Computer Science, 239(2):211-229, 1998. [18] E R N E S T O D A M I A N I . A Fine-Grained Access Control System for X M L Documents. ACM Transactions on Information and System Security, 5(2):169-202, 2002. [19] X i n Dong, Alon Y . Halevy, and Igor Tatarinov. Containment of Nested X M L Queries. In Proc. of the 30th International Conference on Very Large Databases (VLDB 2004), 2004. ' [20] S. Flesca, F. Furfaro, and E . Masciari. queries. On the minimization of Xpath In Proc. of the 29th International Conference on Very Large Databases (VLDB 2003), 2003. [21] Alban Gabillon. A Formal Access Control Model for X M L Databases. In Proc. of the VLDB Workshop on Secure Data Management (SDM), 2005. 100 Chapter 9. Conclusions and Future Work [22] L . Haas, D. Kossmann, E . Wimmers, and J . Yang. Optimizing Queries across Diverse Data Sources. In Proc. of the 23rd International Conference . on Very Large Databases (VLDB 1997), 1997. [23] Alon Y . Halevy, Zachary G . Ives, Jayant Madhavan, Peter Mork, Dan Suciu, and Igor Tatarinov. The Piazza Peer Data Management System. IEEE Transactions on Knowledge and Data Engineering, 2004. [24] Alon Y . Halevy, Zachary G. Ives, Peter Mork, and Igor Tatarinov. Piazza: Data Management Infrastructure for Semantic Web Applications. In Proc. of the 12th International World Wide Web Conference (WWW2003), 2003. [25] C. Ilioudis, G. Pangalos, and A . Vakali. Security model for X M L data. In Proc. of the 2nd International Conference on Internet Computing (IC 2001), 2001. [26] D.S. Johnson and A.Klug. Optimizing Conjunctive Queries that Contain Untyped Variables. SIAM Journal on Computing, 12(4):616-640, 1983. [27] Phokion G . Kolaitis and Moshe Y . Vardi. Conjunctive-Query Contain- ment and Constraint Satisfaction. In Proc. of the 17th ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems (PODS 1998), 1998. [28] E . Lambrecht, S. Kambhampati, and S. Gnanaprakasam. Optimizing Recursive Information Gathering Plans. In Proc. of the 16th International Joint Conference on Artificial Intelligence, 1999. [29] Maurizio Lenzerini. Data Integration: A Theoretical Perspective. In Proc. of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2002), 2002. [30] A . Y . Levy, A . Rajaraman, and J . J. Ordille. Querying Heterogeneous Information Sources Using Source Descriptions. In Proc. of the 22nd International Conference on Very Large Databases (VLDB 1996), 1996. [31] I. Manolescu, D. Florescu, , and D. Kossmann. Answering X M L Queries over Heterogeneous Data Sources. In Proc. of the 27th International Conference on Very Large Databases (VLDB 2001), 2001. [32] Gerome Miklau. Research Problems in Secure Data Exchange. Technical report, University of Washington, March 2004. 101 Chapter 9. Conclusions and Future Work [33] Gerome Miklau and Dan Suciu. Containment and Equivalence for an X P a t h Fragment. In Proc. of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2002), 2002. [34] Gerome Miklau and Dan Suciu. Containment and Equivalence for a Fragment of XPath. Journal of the ACM (JACM), 51(l):2-45, 2004. [35] Wee Siong Ng, Beng Chin Ooi, Kian-Lee Tan, and Aoying Zhou. PeerDB: A P2P-based System for Distributed Data Sharing. In Proc. of the 19th International Conference on Data Engineering (ICDE 2003), 2003. [36] Prakash Ramanan. Efficient Algorithms for Minimizing Tree Pattern Queries. In Proc. of the ACM SIGMOD International Conference on Management of Data, 2002. [37] Patricia Rodrguez-Gianolli, Maddalena Garzetti, Lei Jiang, Anastasios Kementsietsidis, Iluju Kiringa, Mehedi Masud, Rene J . Miller, and John Mylopoulos. Data Sharing in the Hyperion Peer Database System. In Proc. of the 31st International Conference on Very Large Databases (VLDB 2005), 2005. [38] Thomas Schwentick. XPath Query Containment. SIGMOD Record, 33(1):101-109, 2004. [39] OASIS Standard, extensible Access Control Markup Language ( X A C M L ) Version 1.0, 2003. http://www.oasis-open.org/committees/xacml/. [40] Igor Tatarinov, Zachary Ives, Jayant Madhavan, Alon Halevy, Dan Suciu, Nilesh Dalvi, X i n (Luna) Dong, Yana Kadiyska, Gerome Miklau, and Peter Mork. The Piazza Peer Data Management Project. SIGMOD Record, 32(3):47-52, 2003. [41] Jeffrey D. Ullman. Principles of Database and Knowlege-Based Systems. Computer Science Press, 1989. [42] Peter T. Wood. Minimising Simple XPath Expressions. In Proc. of the 4th International Workshop on the Web and Databases (WebDB 2001), 2001. [43] M . Yannakakis. Algorithms for Acyclic Database Schemes. In Proc. of the 7th International Conference on Very Large Databases (VLDB 1981), 1981. 102
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Access control in XML PDMS query answering
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Access control in XML PDMS query answering Wang, Hsuan 2007
pdf
Page Metadata
Item Metadata
Title | Access control in XML PDMS query answering |
Creator |
Wang, Hsuan |
Publisher | University of British Columbia |
Date Issued | 2007 |
Description | Peer data management system (PDMS) is a decentralized system, in which each peer is autonomous and has its own schema and database. With the help of pairwise schema mapping built between any two relevant peers, a query at one peer can be rewritten and broadcast to the whole PDMS. Then answers from multiple peers are returned to the querying peer. In our thesis, we exploit the access control issues in the query-answering process of the XML PDMS. We propose a formal syntax for access control policy (ACP) to specify the fine-grained access control privileges on peers' local XML database. We also design several query-answering algorithms that aim to handle access control in the PDMS, define the algorithm properties of Information Leakage Free and Completeness, and analyze every designed query-answering algorithm on the two properties. A comprehensive cost model, which consists of the major tasks and primitive operations, is proposed by us to assess the query-answering algorithms. We implement the designed query-answering algorithms, compare their running time, and test the scalability in different facets. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-02-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0052052 |
URI | http://hdl.handle.net/2429/31428 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2007-0271.pdf [ 7.27MB ]
- Metadata
- JSON: 831-1.0052052.json
- JSON-LD: 831-1.0052052-ld.json
- RDF/XML (Pretty): 831-1.0052052-rdf.xml
- RDF/JSON: 831-1.0052052-rdf.json
- Turtle: 831-1.0052052-turtle.txt
- N-Triples: 831-1.0052052-rdf-ntriples.txt
- Original Record: 831-1.0052052-source.json
- Full Text
- 831-1.0052052-fulltext.txt
- Citation
- 831-1.0052052.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0052052/manifest