R A C I E R - A Hierarchical Approach for Content Internetworking Through Interdomain Rout ing by Xiaojuan Cai M.E., Southeast University, China, 1995 B.E., Southeast University, China, 1992 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The Universi ty of B r i t i s h Columbia September 2002 © Xiaojuan Cai, 2002 In presenting t h i s thesis i n p a r t i a l f u l f i l m e n t of the requirements for an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree that the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e for reference and study. I further agree that permission for extensive copying of t h i s thesis for s c h o l a r l y purposes may be granted by the head of my department or by his or her representatives. I t i s understood that copying or p u b l i c a t i o n of t h i s thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission. Department of The U n i v e r s i t y of B r i t i s h Columbia Vancouver, Canada Date J Abstract Nowadays, Internet traffic is mainly comprised of the distribution of static and me-dia r ich content. Whi le ad hoc mechanisms may extend the reach of content, Content Internetworking (CI) attempts to foster the interoperability of distinct, independent Content Networks (CNs). Unfortunately, known C I content routing approaches have scaling problems and partly duplicate existing network layer functions. In this thesis, we propose a new Routing Architecture for Content IntERnetworking ( R A C I E R ) . B y extending existing IP networking infrastructures to incorporate content intelli-gence, an Internet built on the new routing architecture can be both IP and content aware. The key features of the architecture are hierarchy and parallel IP-address based routing and name based routing, which lend themselves well to scalability and performance, while maintaining compatibil i ty wi th traditional Internet architecture. The Border Gateway Protocol ( B G P ) is extended to support the content internetworking i n our approach. We implemented the system prototype based on the Mult i thread Rout ing Toolkit ( M R T ) and conducted preliminary experiments to verify our system architecture. i i Contents Abstract ii Contents iii List of Tables vi List of Figures vii Acknowledgements viii 1 Introduction 1 1.1 Motivat ion 1 1.2 Thesis Contributions 3 1.3 Thesis Organization 5 2 Background and Related Work 7 2.1 Content Networks and Proprietary Solutions 7 2.1.1 Tradit ional Content Access Accelerators 7 2.1.2 Content Networks (CNs) 9 2.1.3 Sample Proprietary Solutions for C N 11 2.2 Internet Draft for Content Internetworking 13 i i i 2.2.1 Motivat ion of C N Internetworking 13 2.2.2 Request Routing (RR) Internetworking 13 2.2.3 Request-Routing 14 2.3 A Glance of B G P 16 3 Content Internetworking Architecture — the Big Picture 20 3.1 System Architecture 20 3.2 Subsystem Descriptions 24 3.2.1 C N Internal Content Request-Routing System 24 3.2.2 Content Gateway Interconnection Protocol ( C G I P ) 25 3.2.3 Content Gateway and Border Gateway Interconnection Pro-tocol ( C G B P ) 26 3.2.4 Content Border Gateway Protocol ( C B G P ) 27 3.2.5 Route Caching at C I G 27 3.3 Content Request Resolution Process 28 4 Design of CBGP 32 4.1 Overview 32 4.2 Features of C B G P 34 4.3 Overview of Content Announcement and Withdrawal 36 4.3.1 Content Update Message Format 36 4.3.2 Processing of the Content Update 40 4.4 A Prel iminary Solution for Content Routing Decision 42 4.4.1 Content Route Preference Calculat ion 42 4.4.2 Content Routing Decision Process 44 4.5 Design with a Refined Metr ic for Network Conditions 45 iv 4.5.1 Pa th Latency Measurement 45 4.5.2 Rout ing Decision Improvement without Changing I P Rout ing 47 4.5.3 Rout ing Decision Improvement by Influencing IP Rout ing . 50 4.6 More Design Issues 51 4.6.1 Controll ing the Content Update Frequency 51 4.6.2 Usage of Network Latency Measurement 52 4.6.3 Network Proximity Inside an A S 53 4.6.4 Convergence Process Discussion 54 4.7 Deployment Considerations 54 5 Implementation and Experiment 57 5.1 Implementation Based on M R T 57 5.1.1 Implementation Overview 57 5.1.2 Introduction of M R T 58 5.1.3 Content Internetworking Implementation 59 5.2 Topology and Configuration of the Experiment 65 5.2.1 General Description 65 5.2.2 Experiment Network Topology 66 5.3 Evaluation Cri ter ia 67 5.4 Data Collection and Analysis 69 6 Conclusion and Future Work 74 6.1 Conclusion 74 6.2 Future Work 75 Bibliography 77 v List of Tables 4.1 I P routing table for the sample topology 48 4.2 Content route table for the sample topology 48 4.3 IP routing table wi th influence of L A T E N C Y 50 4.4 Content route table wi th influence of L A T E N C Y 50 5.1 The configuration of the testing environment 68 5.2 D N S QoS assessment in the internet 70 5.3 Content route table in interoperability test 73 v i List of Figures 2.1 Structure of a CN 10 2.2 Typica l user interaction wi th an Akamaized web site 12 2.3 Request routing internetworking system architecture 15 3.1 System architecture 22 4.1 C B G P routing process 35 4.2 C O N T E N T attribute format 38 4.3 A typical router architecture 46 4.4 A sample network topology 48 5.1 Phase 1 of content route updating 62 5.2 Phase 2 of content route updating 63 5.3 Network topology for the experiment 67 5.4 Load comparison on two similar configured machines 71 5.5 Response times for a single client 72 5.6 Response times for two concurrent clients 72 5.7 Interoperability testing topology 73 v i i Acknowledgements Firs t of a l l , I would like to express my gratitude to my supervisor, Dr . Son Vuong, for his essential guidance, inspiration and suggestion. Without h im, this thesis would have been impossible. I am grateful to Dr . A l a n Wagner for being my second reader and giving me useful comments for improvement. The financial support from the Computer Science Department at U B C in the forms of T A and R A as well as the part ial support from the C I T R Network of Centers of Excellence for my involvement i n the S C E project are gratefully acknowledged. I would also like to thank al l my family members for their consistent support during my graduate study here. Thanks also go to my project partner M i n g Gan and a l l other friends at U B C . X I A O J U A N C A I The University of British Columbia September 2002 v i i i Chapter 1 Introduction 1.1 Motivation The past several years have seen the explosive growth of the Internet, especially the World Wide Web. The primary use of the Internet is content delivery, such as web pages, images, and increasingly, audio and video streams. Some measurements indicate that 70 to 80 percent of Internet traffic is H T T P traffic (and most of the rest of the traffic is for routing and D N S ) [11]. That is, almost a l l of the traffic in the wide area is delivery of content, and ancillary traffic to locate it and determine how to deliver it . Today, millions of clients are accessing thousands of web sites on a daily basis, wi th the top 20 web sites supplying about 10 percent of the total content [11]. A s the scale and use of the Internet increase, web content providers find it increasingly difficult to serve a l l users wishing to access their content wi th an appropriately low response time, especially i n the face of unexpectedly high loads (often called "flash crowds"). This is the case despite the massive additions i n bandwidth across the whole Internet [1]. Content Networks (CNs) have become a popular means of addressing this problem; they typically deploy multiple surrogates 1 i n the Internet, allowing them to offload the management of these surrogates from individual content providers. This offers economies of scale and greater abili ty to respond to high loads. CNs also try to decrease latency for individual clients by satisfying client requests from nearby surrogates. W i t h these benefits, C N s are of increasing importance to the overall architecture of the Web. Despite a l l the advantages, CNs have their own limitations. It is extremely capital-intensive and operationally complex to achieve the scale necessary to be present at the edge of every network or region of the world. Even where they have content replicas close to a particular network, demand may outstrip capacity from time to time or failures may occur. Moreover, each C N also has its own physical scope and content capacity l imitat ion. Clearly, the ability to foster interoperability of independent CNs through open standards to extend the reach of content can offer additional flexibility and expand the market for content networking more rapidly. This falls into the task category of "Content Internetworking" (CI), which aims at improving scale, fault tolerance and performance for individual CNs. Content Internetworking is an emerging technology that has promising features to improve Internet access experience. I E T F working standards of CI mainly consist of the internetworking of three subsystems: Content Distribution [ 6 ] , Content Request-Routing [5] and Accounting [7]. Our research focus is on the Request-Routing. The Content Request-Routing system redirects client requests for content to an appropriate server, selecting from the origin server and surrogate servers holding that content. The choice is made according to which one is expected to best serve the client in terms of latency attributes, such as network proximity, server health and load, content availability, bandwidth usage and congestion. Whi le (77V internal request routing techniques 2 have been explored and developed extensively i n the content networking industry, Content Request Routing i n CI s t i l l remains a compelling technology for research and is the focus of our efforts in this thesis. Al though the I E T F C D I work group specializes i n CI, and other research groups also work on similar topics, most of the existing work focuses on addressing the problem at the application layer. We believe addressing the problem on the application layer constitutes a duplication of network layer routing infrastructure and thus violates the principle of clean layering architecture. Some other approaches [11,12] delegate content routing to the network layer and adopt a pure name based routing scheme as the only routing mechanism on the Internet. This is not practical from the deployment viewpoint and is difficult to scale. This thesis focuses on a new approach to solve the request routing internet-working problem i n the CI, aiming at overcoming most of the defects of the existing methods. 1.2 Thesis Contributions A novel hybrid Rout ing Architecture for Content IntERnetworking (RACIER) is proposed i n this thesis that combines the strengths of the previous approaches while overcoming their disadvantages. In our approach, content internetworking is dele-gated to the network layer to fully exploit existing routing infrastructures. RA CIER seamlessly supports both IP-address and name-based routing; these two mechanisms function i n parallel on the Internet. We adopt a hierarchical system architecture to achieve scalability, which consequently enables a hierarchical request resolution process. Given the fact that a CN is normally under the management of a single set of policies and operated by a 3 single internet service provider (ISP), it is reasonable to assume that most CNs are confined to the boundary of one Autonomous System (AS) . In the few cases where a CN encompasses multiple ASs , it could have multiple gateways residing on different ASs , serving the purpose of interconnection. We therefore bui ld an internetworking architecture wi th in the framework of A S interconnections. Our Architecture accommodates both IP and content routing, and is a su-perset of B G P 4 [8]. This makes it compatible wi th the existing IP routing archi-tecture and infrastructure and is easy to deploy. The dual addressing mechanism of Name+IP we adopt i n the routing process utilizes the features of existing network functionality efficiently and greatly reduces the round tr ip time for content request resolution. This hybrid routing architecture along wi th hierarchical routing and resolution scheme make request routing work effectively, which greatly reduces the size of the content route table for individual CNs. In our approach we choose a distributed routing scheme over a centralized one, i n which original content servers are treated equally as a replicated server. Thus, we don't have an authoritative request routing system as i n [5], and thus avoid the performance and scalability bottleneck. Furthermore, i n our design, only when a client requests content that is not replicated across the globe, and whose origin server is located i n a remote A S , w i l l the request be redirected to a normal D N S [9, 10] resolution chain. According to statistics [11], a relatively small number of web sites comprise a large proportion of the traffic. Discriminating between the handling of replicated content and non-replicated content further reduces the size of route tables and allows for finer-grained content routing. Our approach significantly reduces the response time for accessing replicated content, especially for the most popular content sites, the routes to which 4 are likely to be cached on local CIGs. Furthermore, our experiment shows that our approach only incurs negligible round tr ip time overhead for accessing non-replicated content, as compared to normal D N S resolution performance. Our approach thus lends itself well to performance and scalability, while maintaining compatibili ty wi th the traditional Internet architecture. To prevent the overloading of a single "best" server, we also explore and embed a global load-balancing scheme, by supplying multiple good routes to a piece of content simultaneously, and by using a weighted random selection algorithm to improve the overall load distribution. A complete suite of protocols for our content internetworking architecture have been developed, including the following: Content Border Gateway Protocol ( C B G P ) , Content Gateway Interconnection Protocol ( C G I P ) , Content Gateway Bor-der Protocol ( C G B P ) , encompassing inter-AS, inter- CN and CN-AS communica-tion, respectively. We have also implemented a system prototype based on a routing simulation toolkit called Multi-threaded Routing Toolkit ( M R T ) [14], and conducted experiments for system evaluation and analysis. 1.3 Thesis Organization The thesis comprises of 6 chapters. In Chapter 2, the background information is introduced which includes the definition and composition of CNs, the proposed draft for the solution of C I from I E T F C D I group, and brief introduction to B G P . Chapter 3 describes our hierar-chical system architecture and the content request resolution process. A detailed design about the interdomain content routing for the content internetworking and routing decision process is provided in Chapter 4. Chapter 5 gives the prototype implementation and experiment setup wi th performance evaluation and analysis. Chapter 6 concludes the thesis and discusses future work. 6 Chapter 2 Background and Related Work 2.1 Content Networks and Proprietary Solutions We introduce the origin of C N s , the composition of a C N and a few proprietary C N solutions i n this section. 2.1.1 Traditional Content Access Accelerators The past several years have seen the evolution of technologies centered around "con-tent." Protocols, appliances, and entire markets have been created exclusively for the location, download, and usage tracking of content. Some sample technologies in this area have included web caching proxies, content management tools, intelligent "web switches", and advanced log analysis tools. These technologies typically play a role of solving the "content delivery prob-lem" . Abstractly, the goal i n solving this problem is to arrange a rendezvous between a content source at an origin server and a content sink at a viewer's user agent. In a t r iv ia l case, the rendezvous mechanism is that every user agent sends every re-quest directly to the origin server named in the host part of the U R L identifying 7 the content [1]. A s the audience for the content source grows, so do the demands on the origin server. To achieve better performance, the apparent single logical server may i n fact be implemented as a large "farm" of server machines behind a switch. Bo th caching proxies (acting on behalf the clients' side) and reverse caching proxies (acting on behalf of the server side) can be deployed between the client and server, so that requests can be satisfied by some cache instead of by the original server. In the caching approach, an ISP can also deploy regional parent caches to form a hierarchy that further aggregates user requests and responses. B o t h server farms and hierarchical caching are useful techniques, but have limits. Server farms can improve the scalability of the origin server. However, since the multiple servers and other elements are typically deployed near the origin server, they do little to improve performance problems that are due to network congestion. Caching proxies can improve performance problems due to network congestion (since they are situated near the clients), but they cache objects based on client demand. Caching based on client demand performs poorly i f the requests for a given object, while numerous i n aggregate, are spread thinly among many different caching proxies [1]. Also, caches don't provide enough control over what data is actually served by them. Thus, a content provider wi th a popular content source may find that it has to invest in large server farms, load balancing, and high-bandwidth connections to keep up wi th demand. Even wi th these investments, the user's experience may st i l l be relatively poor due to congestion in the network as a whole. To address these limitations, C D N (Content Delivery Network) or CN (Content Network) [1], which is a special type of network regarding content delivery, has been deployed in 8 increasing numbers in recent years. 2.1.2 Content Networks (CNs) A CN essentially spreads server-farm-like configurations out to network locations more typically occupied by caching proxies. A CN has multiple replicas of each content item being hosted. A request from a browser for a single content item is directed to a "good" replica, where "good" usually means that the item is served to the client quickly compared to the time it would take to fetch it from the origin server, with appropriate integrity and consistency. A CW typically incorporates dynamic information about network conditions and load on the replicas, directing requests to a good replica and to balance the load. A CN has some combination of a content-delivery infrastructure, a request-routing infrastructure, a distribution infrastructure, and an accounting infrastruc-ture. The content-delivery infrastructure consists of a set of "surrogate" servers [22] that deliver copies of content to sets of users. The request-routing infrastructure con-sists of mechanisms that move a client toward a rendezvous with a surrogate. Many request-routing systems route users to the surrogate that is physically "closest" to the requesting user, or to the "least loaded" surrogate. However, the only require-ment of the request-routing system is that it route users to a surrogate that can serve the requested content [1]. The distribution infrastructure consists of mechanisms that move content from the origin server to the surrogates. Finally, the accounting infrastructure tracks and collects data on request-routing, distribution, and delivery functions within the CN [1]. Figure 2-1 is a diagram depicting a simple CN as described above: Compared to using servers and surrogates in a single data center, a CN is 9 Request-routing system jRequest-'routing system (1) client's content request (2) response indicating surrogate location of I——-— content surrogate surrogate (3) Client opens connection to retrieve content client Figure 2 . 1 : Structure of a CN a relatively complex system encompassing multiple points of presence, in locations that may be geographically distant. Operating a CN is not easy for a content provider, since a content provider needs to focus its resources on developing high-value content, not on managing network infrastructure. Instead, i n a more typical arrangement, a network service provider builds and operates a CN, including the request-routers, the surrogates, and the content distributors. Th is service provider establishes (business) relationships wi th content publishers and acts on behalf of their origin sites to provide a distributed delivery system. The value of that CN to a content provider is a combination of its scale (the increased aggregate infrastructure size) and its reach (the increased diversity of content locations). Some CNs create their network across other providers' physical network, and these CN providers must be cautious of the quality of the facilities they use. Because they don't control the networks, or data centers that host their equipment, they must be diligent in evaluating quality and effectiveness. If they are unable to negotiate their requirements, they need to move elsewhere, which can be difficult 1 0 and costly. So, over time, pure playCDNproviders , such as Akamai , are likely to be acquired by a larger service provider that wants to enhance its services suite. 2.1.3 Sample Proprietary Solutions for C N Most CNs constructed today do not rely on overly complex or intricate technologies. Akamai 's service was primari ly based on homegrown applications and technologies, while other providers' services were built using many standard products from com-panies such as Cisco, Nortel, and Inktomi. We give a brief introduction about solutions provided by Cisco and Akamai i n the following. • Cisco Distr ibuted Director Cisco's Distributed Director (DD) is its main product to support CN. D D performs load distribution and content routing in a manner that accounts for rel-ative client-to-server topological proximities ("distances") to determine the "best" server [28]. It uses the Director Response Protocol ( D R P ) to gather the route table information from Cisco routers that are D R P enabled [28]. D D acts as the primary D N S caching name server for a specific host name or subdomain. It redirects a name lookup from the main site to a replica site closer to the requesting client address, making "network intelligent" load distr ibution decisions without considering the server side metric [28]. W i t h this product, clients have to incur the response time penalty of accessing this main site D D before being directed to a closer site. • Akamai Solution Akamai uses its proprietary technology called FreeFlow to deliver content. Figure 2-2 illustrates a typical user interaction with a FreeFlow-enabled website. 11 First, the user's browser sends a request for a web page to the site. In response, the web site returns the appropriate HTML code as usual, the only difference being that the enclosed embedded object URLs have been modified to point to the Akamai network. As a result, the browser next requests and obtains media-rich embedded objects from an optimally located Akamai server, instead of from the home site [20]. 1. User enters standard U R L client 4. Rich content served 2. Htm] with embedded URLs pointing to Akamai 3. Browser request embedded objects Customer web server Akamai server Figure 2.2: Typical user interaction with an Akamaized web site The FreeFlow DNS system makes fast delivery of the requested content by resolving each *.g.akamai.net server name, which represents the customer content server, to the IP address of the Akamai server that delivers the requested content to the user most quickly. FreeFlow DNS is implemented as a 2-level hierarchy of DNS Web servers: high-level .akamai.net servers (HLDNS) and low-level .g.akamai.net servers (LLDNS). Each HLDNS server is responsible for directing each query it receives to an LLDNS server that is close to the requesting client. The LLDNS servers perform the final resolution of IP name to server address, directing each client to the Akamai server best located to serve the client's requests. FreeFlow DNS continuously monitors network conditions and the status of each server [20]. 12 2.2 Internet Draft for Content Internetworking We introduce the proposed solution for CI from I E T F C D I group i n this section. 2.2.1 Motivation of C N Internetworking There axe limits to how large the scale and reach of any one network can be. The increase in either scale or reach is ultimately l imited by the cost of equipment, the space available for deploying equipment, and/or the demand for that scale/reach of infrastructure. Sometimes a particular audience is tied to a single service provider or a small set of providers by constraints of technology, economics, or law. Other times, a network provider may be able to manage surrogates and a distribution system, but may have no direct relationship wi th content providers. Such a provider strives for a means of affiliating his delivery and distr ibution infrastructure wi th other parties who have content to distribute. The proliferation of content networks and content networking capabilities brings increasing interest in interconnecting content networks so as to provide larger scale and/or reach to each participant than they could otherwise achieve. The following two subsections introduce the internetworking schema devel-oped by the current Internet working draft [5]. Instead of introducing al l three subsystems' internetworking, including distribution and accounting, we focus on introducing the request routing internetworking. 2.2.2 Request Routing (RR) Internetworking Request routing internetworking is the interconnection of two or more request rout-ing systems which increases the number of reachable surrogates. In order for a publisher's content to be delivered by multiple CNs, each Content Network request 13 routing system is federated under the Universal Resource Identifier (URI) name space of the publisher object. Th is federation is accomplished by first delegating the authority of the publisher U R I name space to an authoritative request routing system. This authoritative request routing system subsequently splices each inter-connected (it is called "peeing" in the I E T F draft) content network request routing system into this U R I name space, and transitively delegates U R I name space au-thority to them for their participation i n request routing. Figure 2-3 is a diagram showing the request routing (RR) internetworking system architecture. There could be multiple levels of in te r -CN interconnection beyond what is shown i n the sample architecture of Figure 2-3. The request routing internetworking system is hierarchical i n nature. There exists exactly one request routing tree for each publisher U R I . The authoritative request routing system is the root of the request-routing tree. There may be only one authoritative request routing system for a U R I request routing tree. Subordinate to the authoritative request routing systems are the request routing systems of the first level peering CNs. There may exist recursive subordinate request routing systems of additional peering CN levels [4]. 2.2.3 Request-Routing The actual "routing" of a client request is through R R CIGs. The authoritative request routing CIG , which is a globally centralized node, receives the client requests and forwards them to an appropriate peering CN. This process of Inter- CN request-routing may occur multiple times i n a recur-sive manner between request routing CIGs unt i l the request routing system arrives at an appropriate CN to deliver the content. 14 CLIENT (RR Tree Root) 1st level CNA Authoritative RR system ZD—nz Inter-CN RR RR CIG RR System — | R R CIG |—|surrogates CNB RR CIG RR System surrogate surrogates 2nd level CNC Inter-CTV recursive RR RR CIG RR System surrogates surrogates Figure 2.3: Request routing internetworking system architecture 15 In this schema, there may exist multiple levels of off-path (path means the physical path to the content) redirections between CNs before a name lookup is finally solved, and this adds a fairly high cost to the content access process. Th is is further exacerbated by the necessity of going to the globally centralized authori-tative request routing system first, which is a performance bottleneck. In addition, advertisements about aspects of topology, geography and performance of a single CN to other CNs, required in [5], does not help much i n making the global request routing decision due to the lack of knowledge of the global topology. The feature of content networks being overlay networks inherently makes decision processes very complex [5]. 2.3 A Glance of B G P The Border Gateway Protocol ( B G P ) is an inter-Autonomous System (AS) routing protocol. The primary function of a B G P speaking system is to exchange network reachability information with other B G P systems. This network reachability in -formation includes information on the list of ASs that the reachability information traverses. For this reason, B G P is called a path vector protocol. The network reach-ability information is sufficient to construct a graph of A S connectivity, from which routing loops may be pruned and some policy decisions at the A S level could be enforced [8]. B G P 4 is the newest version of B G P and it provides a new set of mechanisms for supporting classless interdomain routing by advertising an IP prefix wi th arbitary length which is less than 32. B G P 4 also introduces mechanisms which allow the aggregation of routes, including aggregation of A S paths. B G P uses T C P as its transport protocol (port 179). This ensures that a l l 16 transport reliability, such as retransmission, is taken care of by T C P and does not need to be implemented i n B G P itself. Two B G P routers form a transport protocol connection wi th each other. These routers are called neighbors or peers. Peer routers exchange multiple messages to open and confirm connection parameters, such as B G P version. In a case of disagreement between peers, notification errors are sent, and the peer connection does not get established. After the connection is established, a l l candidate B G P routes are exchanged initially, and later, incremental updates are sent as network information changes. Routes are advertised between a pair of B G P routers in U P D A T E messages. The U P D A T E message contains, among other things, a list of jlength, prefix^ tuples that indicate the list of destinations reachable v ia each system. The U P D A T E message also contains the path attributes, which include information such as the degree of preference for a particular route [22]. Pa th attributes fall into four separate categories [8]: 1. Well-known mandatory. 2. Well-known discretionary. 3. Optional transitive. 4. Optional non-transitive. Well-known attributes must be recognized by al l B G P implementations. Some of these attributes are mandatory and must be included in every U P D A T E message. Others are discretionary and may or may not be sent in a particular U P D A T E message. A l l well-known attributes must be passed along (after proper updating, i f necessary) to other B G P peers. 17 In addit ion to well-known attributes, each path may contain one or more optional attributes. It is not required or expected that a l l B G P implementations support a l l optional attributes. The handling of an unrecognized optional attribute is determined by the setting of the Transitive bit i n the attribute flags octet. Paths wi th unrecognized transitive optional attributes should be accepted. If a path wi th unrecognized transitive optional attribute is accepted and passed along to other B G P peers, then the unrecognized transitive optional attribute of that path must be passed along wi th the path to other B G P peers [8]. Important well-known attributes include the following: • A S - P A T H : This attribute identifies the autonomous systems through which routing information carried in this U P D A T E message has passed. The com-ponents of this list can be A S - S E T s or A S _ S E Q U E N C E s . A S _ S E T s is an unordered set of ASs a route i n the U P D A T E message has traversed, while A S - S E Q U E N C E s is an ordered set of ASs a route in the U P D A T E message has traversed. • N E X T _ H O P : This path attribute defines the IP address of the border router that should be used as the next hop to the destinations listed i n the U P D A T E message. • L O C A L _ P R E F : This attribute is a degree of preference given to a route deter-mined by the local policy settings, which is used by routing decision process to compare wi th other routes to the same destination. The higher the number, the more favorable the route is. • M U L T I J E X I T J D I S C : This attribute may be used on external (inter-AS) links to discriminate among multiple exit or entry points to the same neighboring 18 A S . In case of information changes, such as a route becoming unreachable or a better path emerging, B G P informs its neighbors by withdrawing invalid routes and injecting new routing information. Wi thdrawn routes are part of the U P D A T E message, wi th a specific value of the message type field. If no routing changes occur, the routers exchange only K E E P A L I V E packets. K E E P A L I V E messages are sent periodically between B G P neighbors to ensure that the connection is kept alive. A typical path selection process for a B G P router is similar as included in the following process: (this is derived from the strategy used by C I S C O routers [24]) 1. If the path specifies a next hop that is inaccessible, drop the update message. 2. If the weights are the same, prefer the path wi th the largest local preference. 3. If no route was originated, prefer the route that has the shortest AS-path . 4. If a l l paths have the same AS-pa th length, prefer the path wi th the lowest origin type (where Interior Gateway Protocol (IGP) is lower than Exterior Gateway Protocol ( E G P ) ) . 5. If the origin codes are the same, prefer the path wi th the lowest M U L T I _ E X I T _ D I S C attribute. 6. Prefer the path wi th the lowest IP address, as specified by the B G P router ID . 19 Chapter 3 Content Internetworking Architecture — the Big Picture 3.1 System Architecture A s we mentioned before, the I E T F draft approach requires the world-wide clients of a site to incur the long round-trip time to go to the authoritative request routing system first, which in turn redirects the request to other CNs, an unknown number of times, unt i l the content request is finally resolved. These long round-trip times are purely overhead, potentially far higher than the round-trip to the origin con-tent server itself. This issue is fast becoming the dominant performance problem for clients as Internet data rates move to multiple gigabits, sometimes reducing the transfer time for content to insignificant lows. The name resolution process may also use congested portions of the network that the content delivery system is oth-erwise designed to avoid, and the same congested portion can be repeatedly used i n the process since the peering CNs have no information about network conditions. Besides having high latency, the authoritative request R R is also a bottleneck for 20 scalability. Further selecting a good content surrogate usually requires the prox-imity measurement of a particular piece of content to existing/potential clients to make routing decisions. Content provider networks must either obtain routing infor-mation from routers near their servers, or else make direct network measurements, which imposes huge traffic increases on the whole network. Meanwhile, aggregate network information for scalability is s t i l l required, which duplicates the existing routing functions of the network. Considering that the dominant traffic i n the Internet is content access, we need to provide an infrastructure that serves this purpose effectively and efficiently. We keep the following points i n mind when creating our design: • In order to avoid a performance bottleneck of the whole system, we can not adopt a centralized scheme; we need to design it in a fully distributed way for scalability. • We should make full use of the already available network level proximity infor-mation in I P routing because any content access follows the physical path de-cided by the network layer routing strategy anyway, and we can not impose the duplication of existing network functionalities, wasting Internet bandwidth. • Our approach should be compatible wi th the existing Internet infrastructure for easy and step-by-step deployment. • Our approach should be able not only to accelerate the content access to repli-cated content by selecting a good content server to serve the client's request, but also to avoid overloading a single selected server; thus, maintaining a global load balance. 21 Taking into consideration the nature of content routing and a clean layering architecture, and wi th the awareness of large scale routing using interdomain rout-ing infrastructure, we propose an approach that delegates content internetworking from application layer to network layer. In our approach, the existing interdomain routing infrastructure is extended to accommodate content routing, which is name-based. That is, in our system, IP routing and content routing share routing-related information and function i n parallel. Our proposed content internetworking architecture integrates content routing wi th traditional IP routing i n a hierarchical way as shown i n Figure 3-1. AS-1 Figure 3.1: System architecture CS - Content Server or Surrogate For the interdomain connection, we inherit the mechanism of the current Internet which uses B G P . Our content internetworking infrastructure integrates with 22 B G P . The Border Gateway in our approach supports both IP and content routing. To distinguish itself from the original Border Gateway, which is only capable of I P routing, ours is called Content Border Gateway ( C B G ) . The system architecture consists of three levels, named from lowest to high-est: the CN internal system level, the CN level, and the A S level. A s we mentioned before, i n a typical CN, a single service provider operates the request-routers, the surrogates, and the content distributors. Therefore, these CNs usually have a large presence wi th in a particular network, confined wi th in the boundary of an administrative system. In a case where there exists some overlay CN encompassing multiple ASs , we can have one CIG of that CN in each A S to perform the task of content intnetworking. For the rest of this thesis, we assume every C N is contained i n a single A S . In a single A S , there may be multiple CNs. O f these, only Content Inter-connection Gateways (CIGs) are visible, and C N s communicate v ia CIGs w i th the external world. Each CIG has knowledge of the contents residing on its local CN. Each C N has one or more CIGs. CIGs wi th in one A S communicate using the Con-tent Gateway Interconnection Protocol ( C G I P ) to maintain a consistent view of the content routing situation i n a local A S . The CIG (the representative of a l l C I G s in the local AS) also communicate wi th a local C B G , to propagate knowledge of the replicated content residing on the local A S . CIGs learn routing information about content residing on foreign ASs directly from the CBG when necessary. CBGs interconnect ASs through Content Border Gateway Protocol ( C B G P ) , replacing the original B G P router in the traditional Internet routing architecture. The CBGs seamlessly accommodate IP-address and name-based routing, and is backward compatible wi th traditional B G P routers. A CBG acquires routing infor-23 mation for a l l the replicated content residing on its local A S , through communication wi th the CIG. CBG keeps exchanging this knowledge and the corresponding updates wi th its peers; hence, each CBG has complete information about the routes to a l l replicated content throughout the Internet. To reduce the load of CIGs , the CBG does not propagate content routing information it learns from the Internet into the CIG under most circumstances. Occasional communication initiated by CBG to CIG is detailed in section 3.2.4. To summarize, our proposed content internetworking architecture consists of the following set of protocols: 1. Content Gateway Interconnection Protocol ( C G I P ) : dealing wi th interaction between CIGs. 2. Content Gateway and Border gateway interconnection Protocol ( C G B P ) : deal-ing wi th interaction between CIG and C B G . 3. Content Border Gateway Protocol ( C B G P ) : dealing wi th interaction between CB Gs. 3.2 Subsystem Descriptions In this section, we depict functions of each sub-system shown in Figure 3-1. 3.2.1 C N Internal Content Request-Routing System A CN's internal R R system is responsible for delivering client requests to the "near-est" surrogate located in that CN [3]. D N S and U R L redirection based request routings are the two most commonly accepted CN internal content routing tech-niques. In general, content internetworking views CNs as a black boxes; allowing the 24 CN supplier the maximum flexibility i n choosing its own internal content routing scheme. 3.2.2 Content Gateway Interconnection Protocol (CGIP) Initially, each CIG has only the knowledge of routing status for content residing on its own CN, learned through routing updates from the CTV's internal request routing protocol. If there is more than one CN i n an A S , a CIG can, by establishing peering relationships wi th neighboring CIGs in its local A S , exchange routing updates wi th others so that a l l CIGs in an A S share a consistent view of content routing states for al l content residing on that A S . A CIG can make content routing decisions based on its local policies and a set of metrics exchanged wi th peers. CIGs also need to communicate w i th CBGs, and since a l l CIGs i n the same A S can synchronize with each other, and since a l l have global content information inside the local A S , it is not necessary for a l l the CIGs to communicate wi th a l l CBGs. System administrators choose only one CIG to represent a l l CIGs in that A S to communicate wi th a certain CBG for content routing exchanges i n order to minimize the traffic load. The representative can be manually specified in the configuration. A s a fault tolerance consideration, besides the primary representative, a secondary representative can be configured. However, this method is not flexible enough. A better mechanism lets CIGs communicate wi th each other to negotiate how to dynamically elect a CIG that best serves as a representative i n terms of CIG load and network proximity to the C B G . 25 3.2.3 Content Gateway and Border Gateway Interconnection Pro-tocol (CGBP) In order to let the external world of an A S know about the content information inside a local A S , the CIG in the local A S needs to communicate wi th the CBG. We use C G B P to handle the communications between the CIG and the C B G . The CIG acts as a special peer of the CBG in its A S , and exchanges content routing updates wi th the CBG i n an asymmetric way. The CIG propagates a l l routing updates to the CBG, however, the CBG does not flood the CIG w i th a l l the content routing information it learns that may never be useful for clients inside a CN. The few situations where the CBG takes the initiative to propagate routing updates to the CIG are discussed i n the following paragraphs. For content routing decisions i n a single A S , usually, a good surrogate which exists in the local A S is considered the best candidate to serve a client's requests originating wi th in that A S . However, there may be exceptional situations where a remote surrogate is better than local surrogates, for example, when there is an extremely heavy load on local surrogates. The CBG has the whole picture of both global and local replicated content routing information, therefore, it is the CBG that decides, for particular content wi th replicas in both the local and remote ASs , which surrogate is best. If a routing decision is made that the best surrogate has changed from a local surrogate to a remote surrogate, or from a remote to a local, the CBG advertises this route to the CIG representative, which propagates this update further to other CIGs wi th in that A S . If this situation happens too often, then it is the local CN provider's responsibility to enhance its surrogates' capabilities. Another situation that can cause reverse communication from the CBG to the CIG representative arises when the CBG finds there is an explicit content route 26 withdrawal from its route table. Th is explicit withdrawal indicates that a content server is not available for service for some reason, and this update must be passed to the CIG immediately. 3.2.4 Content Border Gateway Protocol (CBGP) C B G P interconnects ASs for routing exchanges. C B G P is a superset of B G P . It supports B G P for traditional IP routing, as well as content routing for content internetworking purposes. Whi le the address mechanism for the IP routing part of C B G P remains the same as in the original B G P , the content routing part adopts a dual addressing scheme: Name+IP. Initially, each CBG has knowledge of content available only i n its own A S , learned through C G B P routing updates from a CIG representative (see 3.2.4). B y establishing peer relationships wi th neighboring ASs , it then propagates this knowl-edge and updates to its C B G P peers. When a CBG receives routing updates from a C B G P peer, i f necessary, it updates its Routing Information Base (RIB) and further propagates routing updates to other C B G P peers. In this way, a CBG receives the knowledge of the content routing of the replicated content i n other ASs . In this thesis, unless specified, we use C B G P to refer only to the content routing part of the complete C B G P protocol. The details of C B G P are covered in Chapter 4. 3.2.5 Route Caching at CIG In our architecture, fundamentally, a CIG has routing information only for content available i n the local A S . However, the CIG can send a query to the CBG for routing information of particular content residing on foreign ASs to resolve client requests. 27 W h e n considering a high similarity in access patterns in a group of users, one of the most commonly accepted mechanisms for improving performance is caching, which can be applied in this content routing scenario. Unlike most known web caching schemes that cache the content itself, our cache scheme caches routes for content. W h e n a CIG receives a content response from a CBG wi th an IP address for a name request, the CIG caches the routes for future use. Thus, i f a client requests content that has a replica only i n a remote A S , and i f the CIG finds a cached route for that content, the name request can be resolved i n that CIG locally. Unless there is a cache-miss i n the CIG, the CBG does not bother with name resolution, which greatly speeds up the resolution process. It reduces also the overall demand for the involvement of C B G s , which could give rise to a potential performance bottleneck. O n the other hand, caching can introduce the possibility of "out-of-date" routes. A n appropriately set T T L value may compensate for this shortage by trading off the extra volume of traffic for cache refreshing. From another point of view, the fact that those cached routes are s t i l l available for content service (i.e. those servers are not down) makes this problem appear less serious. 3.3 Content Request Resolution Process Our hierarchical system architecture enables a hierarchical request resolution pro-cess, which gives the system good scalability. In addition, the hierarchical name res-olution approach completely eliminates the globally centralized authoritative system from the resolution chain, which significantly improves service performance, system availability and reliability. Resolution happens at three levels, from lowest to high-est: the CN internal R R System level, the CN level, and the A S level. Each level tries its best wi th the knowledge of content status it has to resolve name requests 28 submitted by clients or from lower levels. Only when a request cannot be resolved at a given level does the request escalate to the next level; i n a case where even the highest level fails to resolve a request, it is redirected to the normal D N S chain for resolution. A client's name request originating wi th in a CN follows the resolution path as the steps described below: 1. CN Internal R R System Level - CN Internal Resolution If the client's request can be resolved by a CN's internal R R system, which means the requested content has either a replica or an origin (depending on the CN's internal R R scheme) located wi th in the local CN, the resolution process stops. Otherwise, the request is forwarded by the internal R R system to the pre-configured CIG, and goes to 2. 2. CIG Level 2.1 CIG Resolution If the request can be resolved by the CIG (usually CIGs replaces whatever D N S server originally exists i n the CN), which means the requested content has either an origin or a replica copy located wi th in its local A S , or i f the CIG finds cached routes for the requested content, which is located only in foreign ASs , it responds wi th the address of the "nearest" server, using the load balancing method described i n step 2.2.2. That can be either the origin server, or a surrogate server for the requested content; otherwise, it goes to step 2.2. 2.2 Content Query from CIG and Content Response by CBG 29 2.2.1 The CIG sends a content query for the requested content to a connected CBG and waits for a response from the C B G . Then it goes to step 3. 2.2.2 U p o n receiving a content response from a connected C B G , the CIG caches the response, and i f the response contains one content server's address, the CIG responds to the CN internal request routing system wi th the address of that server. W h e n there are multiple servers' addresses i n the response, the CIG chooses one of the servers based on load balancing algorithms. This algorithm randomly chooses one of the servers, but it uses a weighted random number generator to select the servers wi th the probability corresponding to the metric of each server. If no route is available from the CBG, the CIG forwards the client request further to the normal D N S request resolution chain. 3. C B G Level Resolution U p o n receiving a content query from a CIG, the CBG checks its Routing Information Base (RIB) for the requested content. If routes are found i n R I B , meaning the requested content has replicas in foreign ASs , the CBG responds to the CIG w i th a l l the "best" routes stored. Otherwise, the CBG responds indicating that the requested content is not replicated i n any foreign A S , and goes to 2.2.2 The discrimination in the handling of replicated content and non-replicated content is an important feature of our approach and an important strategy for achieving good scalability. The CIG has knowledge of non-replicated content but only propagates replicated content to the C B G . W h e n a client requests content that is not replicated, i f the content's origin server resides on the local A S , the request is resolved by the CIG directly; i f the content's origin server resides on a remote A S , 30 and there is no cached route for that content, the request is finally redirected to the normal D N S chain for resolution. Compared to traditional D N S resolution approaches, ours significantly re-duces the response time for replicated content, which is the most widely accessed content on the Internet. However, there is a small penalty imposed on requests for content without replica and wi th the origin server located in remote A S s for which there are no cached routes. This extra overhead, however, is only one round tr ip from the CIG to the CBG located i n the same A S , which is acceptable. Addit ional ly, the reason a particular piece of content is not replicated across the Internet is very likely to be that it is unpopular, and therefore, seldom requested or accessed. Thus, the small additional overhead does not noticeably degrade overall service quality. Further, by populating the CBG w i th the routing states for only replicated content, we significantly reduce the size of its routing table, compared wi th the pure name based routing approach [11]. Whi le Content RR wi th in a CN is handled by the proprietary schemes of different vendors, routing processes for C I is our research focus in this thesis. In our hierarchical architecture, the C B G P deals wi th the interconnection of content i n different ASs , while the C G I P deals wi th the interconnection of CIGs i n the same A S . A s the C G I P is described i n other works [35], our discussion focuses only on C B G P i n this thesis. We give its design details and discuss design issues i n Chapter 4. 31 Chapter 4 Design of C B G P 4.1 Overview The goal of content internetworking can be briefly described thus: by interconnecting the CNs around the world, we can find a good surrogate server globally that can serve client requests anywhere on the Internet wi th good quality, "good" can be measured in different metrics, such as server response time, server-client latency and throughput, server health and load, and so forth. Apparently, network proximity between clients and servers is an important factor, even the only factor considered i n some solutions for the content route selection process. This factor is naturally a bui l t - in feature of the physical path from server to client. The physical path information is well maintained by routers along the path, therefore, it is natural to have routers involved i n the content routing process to provide real time network accessibility information. Th is network-integrated content internetworking approach saves the application layers from doing the proximity measurements, for example, probing wi th "ping", thereby significantly reducing network traffic. It can also solve the problem that arises when servers are behind a firewall or other Network Address 32 Translation (NAT) device, where probing is prohibited. The motivation behind the decision to delegate content internetworking from the pure application layer to the network layer is the desire to fully exploit the existing network layer's functions and resources. From the scalability point of view, the interdomain routing system using BGP is regarded as one of the most successful, and the largest, distributed systems, since it enables the working of the whole Internet. Our content internetworking shares similar feature with this. Depending on the granularity of the replicated content, we may have first level content names such as "www.ubc.ca," or second level content names, such as "www.ubc.ca/student." While disk storage is not a scarce resource nowadays, most content names should only be located at the first level. Content routing, which attempts to find a short route to a replicated piece of content, is very similar in nature to a normal IP anycast that tries to find the "shortest" route based on measurements such as network hops. In content routing, all replicas of a piece of content, for example www.ubc.ca, share the same Anycast Content Name (A CN) www.ubc.ca. Further, this A CN appears in the RIB repre-senting a virtual content destination node on the network which is shown in Figure 4-1. We route content in a way similar to IP routing, except that, instead of trying to find the best route to a destination represented by an IP prefix, the A CN, representing several content replicas, indicates the content destination address. This is discussed in detail later. As shown in the general system architecture, we interconnect the content of CNs in different ASs through interdomain routing, and in particular, we modify the Border Gateway Protocol (BGP), which forms the CBGP, to enable the content 33 internetworking function. In implementation, the newest version of the B G P is used: B G P 4 . 4.2 Features of C B G P We w i l l now discuss the features of the C B G P design. 1. Name+IP addressing mechanism A s content destination is represented by content name, theoretically, we can route content using purely name-based routing globally, just as suggested in [11]. That means the IP address in traditional IP routing is completely re-placed by the A C N . However, this requires a change i n the transport layer protocol, and correspondingly, a l l the client side networking software, since al l network connection establishment and data transmission s t i l l uses the IP addresses instead of names currently on the Internet. Practically, we can not use pure name-based routing, but we rely on the routing process to find a good route to content by directly pointing at a surrogate address for access; hence we do not have a Next-Hop attribute for the content routing table, as the normal network routing table does. Therefore, we cal l our content R I B "content route table" instead of a "routing table". We adopt a combination of the Name and the IP addressing mechanisms to represent replicated content, which means a piece of content is represented by its name, plus the IP address of the server on which the content resides. For example, i f content "x.com" has a replica on server 12.0.3.56, that specific piece of replicated content is identified as a tuple (x.com, 12.0.3.56). When content represented as a tuple (ACN, IP) has changes i n its metric, the updates are 34 sent i n content routing update packets, and the CBG makes a decision by comparing the content routes. Figure 4-1 illustrates an example for the C B G P routing process. x.com( 12.0.3.56) metric 28 AS_path(...)NH 1.2.3.4 ... I x.com I 103.42.3.9 x.com(103.42.3.9) metric 162 AS_path(...)NH 5.6.7.8 ... : virtual content destination node ^ . : pointing to a virtual content destination node from a content server : indirect connection passing through networks : direct connection Figure 4.1: C B G P routing process 2. Load Balance Considerations The content route table of a CBG maintains the "best" route selected by the routing decision process to a l l replicated content by exchanging content routing information wi th C B G P peers. For load balancing and fault tolerance purposes, multiple routes to each piece of content may be stored i n order of goodness from the "best" route, to the secondary route, to the th i rd route, and so on. The number of spare routes the content routing system actually 35 keeps for a particular piece of content is an implementation trade-off between system flexibility and space overhead. 3. Server Side Metric Participating in the Rout ing A s the routing process aims at locating a server that can serve content requests well, we need to accommodate server side metrics i n overall routing decisions in addit ion to network conditions. Thus i n our approach, the server side metric directly participates in content routing. 4. Ful ly Distr ibuted Content internetworking through C B G P has the same nature as IP routing on the Internet: it is fully distributed without any centralized point. 5. Content Query Resolution Function A s well as performing normal content updating processing, the CBG also functions as a content query resolution server by answering content queries from CIG {a). 4.3 Overview of Content Announcement and Wi thdrawal We introduce the basic idea about how to modify B G P to accommodate content routing i n this section. 4.3.1 Content Update Message Format The C B G P shares many protocol features wi th the B G P , such as an open connection and periodic probing. Thus we integrate the C B G P wi th B G P for efficiency and easy deployment. The C B G P adopts similar message types as B G P 4 - O P E N , 36 K E E P A L I V E , U P D A T E , and N O T I F I C A T I O N . C B G P ' s O P E N , K E E P A L I V E and N O T I F I C A T I O N message formats are same to those of BGP4's. A s the U P D A T E message intrinsically carries content routing information, it embeds mechanisms specifically for content routing; this is done by extending the path attributes of B G P . In B G P , an U P D A T E message is used to advertise a single feasible route to a peer, or to withdraw multiple unfeasible routes from service. A s name based routing differs from IP based routing i n route attributes, we add an extra path attribute called "CONTENT" to represent content routing information. Each path attribute in the U P D A T E message is a triple [attribute type, attribute length, attribute valuer of variable length. There is an unique attribute type code in attribute type to identify individual attribute. The "CONTENT" attribute has the following value: - Attribute Length: specify the total length of this type of attribute Attribute type: Type Code: 20; the attribute flag is set to define this attribute as optional transitive - Attribute value: • Content Update Type: 1 bit ("0" - content withdraw, "1" - content announce-ment) • Host Sub-addr: w i th variable length, specify host part of the IP address of the content server. The actual length of this field can be calculated from the length of Network Prefix field in Network Layer Reachibility Information (NLRI) which is a key field i n each U P D A T E message. • Content Identifier Length: specifies the length of the Content Identifier field 37 • Content Identifier: variable length, specifies the name of the content • Server-Side Metric: specifies processing capability of the associated content server • Valid Time Period: how long this route can be thought of as valid • Optional Field: this is a field for future function extension Content Update Type (1 bit) Host Sub-addr (variable) - Content Identifier Length (8 bits) Content Identifier (variable) Server-side Metric (8 bits) Valid Time Period (16 bits) Optional Field (8bits) Figure 4.2: C O N T E N T attribute format We use the field "Val id T ime Period" mainly for considering the "health" of the server. The CBG has no way of knowing i f a specific server is frequently down when the server itself keeps the relevant statistics. Th is information is propagated to CIG which i n turn sends it to CBG. Therefore we keep a time period to represent the effective time of metrics and routes. Using 'Va l id T ime Period ' instead of the absolute time of expiration accommodates the time difference of each system if they are not synchronized. B G P is a path vector protocol as such i f there exists a loop i n the path, the update packet is discarded. Therefore, we do not need to make an extra effort to 38 prevent the loop of routing advertisement packets since we inherit the B G P routing mechanism. In C B G P , one single U P D A T E message can announce multiple content routes availability, and withdraw multiple previous content routes availability information, as long as the servers for the content share the same Network Layer Reachability Information. The announcement and withdrawal updates are not necessarily put in order in the U P D A T E message. For content withdrawal, no Server Side Metric and Valid Time Period field are presented. If the Content Identifier Length is set to 0, al l content on that content server are to be withdrawn, and no Content Identifier field is followed. Here is a sample scenario where a CBG sends an update: the CBG prop-agates content announcements for 190.10.20.26 (with yahoo.com, server side met-ric 36), 190.10.20.5 (with Disney.com, server side metric 129), 190.10.20.108 (with CNN.com, server side metric 825), and makes content withdrawal for 190.10.20.16 (google.com), and a l l contents on 190.10.20.31. The three content advertisements and two withdrawals can be aggregated into one U P D A T E message as follows: Network Layer Reachability Information: 190.10.20.0/24 Attribute Type Code: 20 Attribute Length: 58 bytes Attribute Value: * Content Update Type: 1 IP Subnet: 26/8 Content Identifier Length: 9 Content Identifier: yahoo.com Server Metric: 36 * Content Update Type: 1 IP Subnet: 5/8 Content Identifier Length: 10 39 Content Identifier: Disney.com Server Load: 129 * Content Update Type: 1 IP Subnet: 108/8 Content Identifier Length: 7 Content Identifier: C7VN.com Server Load: 825 * Content Update Type: 0 IP Subnet: 16/8 Content Identifier Length: 10 Content Identifier: google.com * Content Update Type: 0 IP Subnet: 31/8 Content Identifier Length: 0 4.3.2 Processing of the Content Update W h e n a CBG receives a content update message, its actions are as outlined below: • U p o n receiving a content announcement: i f there is no existing content entry for the announced content i n its R I B , it adds that content in the R I B wi th the associated advertised metric; otherwise, i f there already exists a route for the announced content, the CBG compares (in a routing decision process covered i n later sections) the newly updated route wi th existing routes. If the new route is better than any of the exist-ing routes calculated by the routing decision process, the R I B is updated by replacing the most undesirable route wi th the announced route; otherwise, it ignores the content advertisement. • U p o n receiving a content withdrawal: 40 it deletes the related content entries associated wi th that specific server address from its R I B , and then recalculates the new routes to the content as well as sending the withdrawal to the peering CIG i f the withdrawal affects the content route table. • U p o n receiving an I P route withdrawal that withdraws an I P prefix's avail-ability: it deletes al l related content entries associated wi th that IP prefix from its R I B , and then recalculates the new routes to the content, as well as sending the withdrawal to the CIG, i f the withdrawal affects the content route table. The entries in the content route table take the format (Content-Name, (IPi, M e t r i c i , Per iodi ), {IP2, Metric2, Period2 ) • . . (IPn, Met r ic„ , Period^ ) ), where n is the number of best routes the CBG maintains and Per iod represents the valid time period the route has. When a change occurs in its R I B , the CBG propagates this update to its C B G P peers. To achieve scalability and incorporate the diversity of different autonomous systems, B G P can not rely on accurate network proximity information when making routing decisions; instead, it uses predefined policies to calculate route preference. Generally, these policies reflect the wisdom of choosing the most suitable (though not necessarily always "best") route for the current A S . In addition, some recent studies [27,31] show the A S hop count of a path is a decent indicator of the path's proximity, reliability, and stability. To simplify things and avoid heavy modification of B G P , we decided to adopt the path preference calculation strategy of B G P and adopt a reasonable way to make comprehensive content routing decisions. This preliminary solution is described in section 4.4. For a more accurate solution, we 41 discuss another design using refined metrics i n section 4.5. 4.4 A Pre l iminary Solution for Content Rout ing Deci-sion We give a preliminary method to calculate content route preference, and describe the basic routing decision process in this section. 4.4.1 Content Route Preference Calculation The CBG may receive multiple routes for particular content v ia the C B G P peers' advertisements that originate from the same or from different content servers. These routes contain both network accessibility information and a server-side metric. When considering the network level measurement factor, we adopt a similar method as used in B G P 4 for calculating the degree of route preference, i n order to achieve compat-ibi l i ty and share computation processes. The difference is that instead of solely comparing routes to the destination of the same IP prefix, multiple I P addresses (to be accurate, multiple N L P J s ) of those content servers are used to compare the routes, since the concept of "destination" now refers to a piece of content wi th many replicas. The B G P local policies also apply to C B G P , for example, whether the current A S is wil l ing to be the transit for the content i n certain other ASs . For content routes originating from different content servers, i f a specific route is superior to the others in terms of both network level path preference and the server side metric, routing decision can easily be made. Bu t i f there is such apparent advantage to both factors, comprehensive consideration of network level factors and the server side metric is needed. 42 If two routes have unbalanced advantages on either the network level or the server side, we define a function that takes the attributes of a given route as the arguments and return a value denoting the degree of preference for the content route. For the time being, we consider all the main attributes of a route which include the LOCAL J^REF, the AS_PATH and the CONTENT, and define the overall preference of a route as the following: Pref = scale (j • Server-Metric) + scale( (i • AS-path) + a • Loc-Pref The value of the three route attributes LOCAL.PREF, AS-PATH length and Server_Metric could be measured on different scales, which gives them incomparable values. Therefore, we need to re-scale these values to the same magnitude to keep consistency in the comparison. We scale the AS path length and server side metric to the same order of the largest value of local preferences, since the router knows the value of local preferences for each of its interfaces. The coefficients for each metric are currently defined as 7=0.5, as we consider the server side metric has almost the same weight as the network level, according to the measurement of [25], which states that the correlation between the ping round trip time and the HTTP request response time is 0.51. /3—0A, as the AS path length is an indicator of network level proximity in the BGP level. Finally OJ=0.1, as local preference is also a non-negligible factor in making routing decision in BGP. These values are only our initial settings with simple heuristics; a system administrator can adjust these parameters in response to real situations, including considering different set of coefficients for different categories of routes. We also need to perform preprocessing of the value of local preference, as the router usually 43 prefers the one wi th the largest value, which is opposite to A S path length and server side metric (the lower the value is, the more preferable the route is). There are two parts of the A S - P A T H attribute, AS_set (unordered set of ASs a route in the U P D A T E message has traversed) and AS_sequence (ordered set of ASs a route i n the U P D A T E message has traversed considering the aggregation functions). We calculate the actual A S path length using the following formula: AS-Path-Len = length(AS sequence) + log2(length(ASset) - length(ASsequence)) Our rational behind this is that each time several routes originating from different ASs are aggregated into one route, the A S numbers for those independent routes are eliminated from the A S - S E Q U E N C E and instead, the local system which does the aggregation prepends its own A S number in the A S _ S E Q U E N C E . A l l the A S numbers s t i l l appear i n the A S _ S E T . The aggregation process is quite similar to (not exactly like, in the real Internet) a tree structure, so the logarithm approx-imates the height of the tree that the route traverses i n addit ion to the current A S _ S E Q U E N C E length. We can also define a local policy with a re-configurable threshold T on the server side metric for filtering out those routes that have unacceptable server re-sponse times. 4.4.2 Content Routing Decision Process The decision process followed by the C B G P to select a preferred route to a specific content from multiple ones is described below: 1. Fi l ter out the unacceptable routes using threshold T 44 2. If a route update specifies a next hop that is inaccessible, the route is dropped 3. Calculate the preference of each content route using the above formula, the one wi th the lowest value wins 4. If a tie results from step 3, the one wi th the better server-side metric is selected (since the server-side metric is a more accurate measurement than network preference as calculated using the routing strategy of B G P ) In C B G P , route aggregation i n the process of route updating is possible through integration wi th B G P 4 and by sharing the I P prefix of content destination. We have already shown an example in the scenario depicted i n section 4.3.1. 4.5 Design wi th a Refined Me t r i c for Network Condi -tions A s believed by most of the people, using A S path length is a too rough measurement to calculate network proximity, we give another design i n this section to make more accurate measurement about network conditions. 4.5.1 Path Latency Measurement A s shown in B G P , peers constantly exchange K E E P A L I V E packets to probe each other to maintain knowledge of currently available neighbors. This is a waste of bandwidth and computation resource i n most cases when peers are alive. We can utilize these packets for further measurements of the packet latency in the routing path. This is facilitated by the router's I / O architecture. A typical router's architecture is shown i n Figure 4-3. 45 ir iput interfac i i i ;e o utpu —> t interface — i — I — r — • — • —1—1—1 i i i Backplane i i i i i i - • — —1—1—1 I I I • — i — i — i i i i —1—I—I — i — i — i w Figure 4.3: A typical router architecture The K E E P A L I V E packet is sent out through one of the output queues, just as any IP data packet being forwarded, it is not put in a queue wi th a high priority by most "vendors (sometimes the update packet caused by routing table changes is sent through a high priority queue [33]). Also from [34], we find that when there is an overwhelming quantity of data packets to process, the K E E P A L I V E message transfer can be greatly affected or even lost. So, by measuring this delay, we gain relatively accurate data regarding latency between B G P routers, which includes propagation, queuing, and transmission delay (the latter two also accommodate the factor of bandwidth). This is a fair measurement to a l l ASs wi th regard to inter-domain routing, as opposed to different intra-domain routing metrics. The way to measure this latency is to use a triggered K E E P A L I V E sending scheme, i n which when one peer receives a K E E P A L I V E packet wi th a sequence number i n it , it acknowledges it immediately wi th a K E E P A L I V E message; thus the peer ini t ial izing the sending process attains the round tr ip time for the packet. 46 4.5.2 Routing Decision Improvement without Changing IP Rout-ing W h e n the latency between the CBGs is known, we can now include another route attribute, which we call " L A T E N C Y " , in the U P D A T E packet. This attribute is also an optional transitive attribute and takes the attribute type code 21. The attribute value is the accumulated network latency along the way, and it is similar to the sum of the results measured by using the uti l i ty "traceroute". In this approach, the IP routing st i l l follows the original method by which B G P works. For content routing, however, we now have more accurate measurement for network proximity. When a CBG has multiple paths to the content, it compares the sum of the accumulated network latency along the way, and the server side metric of each path, then selects the best one. We prefer to use server response time as the server side metric here; i f not, we need some way to do the correspond-ing conversion from other metrics. Th is is discussed i n Chapter 6. If the update causes content route table changes, the new updates caused are sent to its peers and the accumulated network latency in the U P D A T E packet increases by the network latency between those two peers. In Figure 4-4, there are five ASs and we assume there is no Internal B G P ( I B G P ) communication inside each as we focus on interdomain communication here. The number on the link indicates the network latency between the peers through that link, measured using the method described i n the previous section. CS2 and CS1 are two replicas of content A wi th A C N url-1 residing on A S I and II respectively. In A S V , the local preferences for the routes coming from A S III, I V and V I are 50, 100, 100 respectively. We also assume that the server side metric for CS2 and CS1 are 10 and 6 respectively; CS2 resides on a server wi th the IP address Addr2 , which 47 is attached to subnet X ; CS1 resides on a server wi th the IP address A d d r l , which is attached to subnet Y . Under the conditions stated above, in A S V , we have the following IP routing table entries ( not a l l the attributes of the route are listed ) and content route table entry (we omit the Va l id T ime Period for the route and assume at most two routes are kept for each content): Destination AS-pa th (in A S - S E Q U E N C E ) X I V , I ( L O C . P R E F E R E N C E & A S _ P A T H dominate the decision) Y V I , II ( L O C _ P R E F E R E N C E & A S _ P A T H dominate the decision) Table 4.1: IP routing table for the sample topology Name Routes url-1 ( Addr2 , 17), ( A d d r l , 19) Table 4.2: Content route table for the sample topology Figure 4.4: A sample network topology Clearly, the accuracy of this approach matches the application layer ap-48 proach. The application layer measurements follow the physical path to those con-tent servers calculated by B G P , and i n our approach, the network latency mea-surement of the paths is used to make close-to-best route selection. This method ameliorates the disadvantage of using hops as a measurement for network proxim-ity. More importantly, it greatly reduces the application layer probing traffic to servers or networks, which could be repeatedly and constantly created by many applications. Currently, we use the content server side metric as the trigger for content updates. It is possible that during the interval between content updates, some changes occur to the accumulated network latency along the path to the content server, but at present we put this scenario aside because we believe the change of network latency is largely caused by much heavier or lighter access to content and this should be reflected on the server side metric. We consider including both triggers in future work. Even though B G P routers generally exchange only connectivity information, not performance information, and i n the absence of explicit policy, the routers make decisions by minimizing the number of independent autonomous systems traversed along the way to the destination; this metric doesn't correlate wi th performance characteristics very well, but it doesn't affect our approach above. The rationale behind this is that the decision of the B G P router is fair to a l l users: as soon as the decision is made, the path is defined and every packet from any application program has to follow it. However, we can make further improvement for content routing by influencing the IP routing process, since we can attain measurements of network latency along the way. We discuss this i n the next section. 49 4.5.3 Routing Decision Improvement by Influencing IP Routing In each U P D A T E packet, there is the " L A T E N C Y " attribute which records the accumulated network latency of the path traversed, and we use this metric to find a better path (from the network routing point of view) to reach that server. For the I P routing, when a CBG receives an update, firstly, it s t i l l applies the policy routing, such as comparing the weight of routes specifically defined by some vendors, local preference of the path and so on, since we need to respect the subjective choice of the network operator. If al l the previous comparisons result i n a tie, it compares the length of the A S path. A t this stage, instead of using the A S path length, it compares the accumulated network delay of multiple paths and selects the best one. For content routing, the sum of the network delay and server response time is used as the comprehensive metric to compare the routes to a content destination. If a better route results from either the IP or content routing, updates are sent to neighbors. Again , referring to Figure 4-4, i n A S V , we have the following IP routing table entries and content route table entry: Destination AS-pa th (in A S . S E Q U E N C E ) X I V , I Y I V , V I , II Table 4.3: IP routing table wi th influence of L A T E N C Y Name Routes url-1 ( A d d r l , 16), (Addr2, 17) Table 4.4: Content route table wi th influence of L A T E N C Y The reason that the A S path to the network Y is (IV, V I , II) instead of (VI , II) 50 is because the accumulated network latency dominates, since the local preference to A S number I V and VT have the same value. For the content route table, considering both the network proximity and server side metrics, we get the result shown above. Al though the above method can produce a better route for accessing the content, we don't advocate using it , since the frequent changing of the IP routing behavior could cause route instability on the Internet. 4.6 M o r e Design Issues We discuss more design issues i n this section including: how to control content update frequency in the Internet, how to make use of the periodic network latency measurements, how to deal wi th network proximity measurement inside an A S , and what is the behaviour of the convergence process for content routing. 4.6.1 Controlling the Content Update Frequency The Internet is a huge distributed system. Considering its scale, making frequent route changes is not accepted for stability and scalability. Our intention is to inter-connect content networks around the world and find a good replica to serve client requests, so it is not necessary to respond to a l l metric changes in real time, as many changes are transient. Thus we need to control the content update frequency. A s we know from the description i n the previous section, in addition to selecting content routes based on a normal network level accessibility factor, extra effort must be made to take the server-side quality of service into consideration. This can be measured in terms of server response time to client requests, server C P U load, server health, maximum connections to the server, and so forth. To measure a server's processing capability at a certain time, we prefer to use server response time, 51 as this is the metric that reflects the service quality for a request, since our purpose for content networking is to improve content response time. A s we can imagine, the server side metric is usually quite a transient parameter, and sometimes can change dramatically i n a very short time. To smooth out the oscillation effects, this metric is usually recorded periodically, and reflects the average value in a moving window of time. (We refer to the size of this moving window as the measurement interval). Compared wi th the value in the previous measurement interval, when the value of the next interval has a significant change, then an update should be sent. The Measurement interval can be adjusted by each server. Since the server-side metric is learned v ia the C G B P through the CIG, which in turn acquires this metric from the CN internal R R system, we have an alternative way to control the content update frequency. To keep the high accuracy and real time response inside a CN, the server response time can be updated at a relatively high frequency, but the CIG can do further processing to suppress the frequent updates to CBG, using monitoring and weighted calculations of historical changes. (Note: In our implementation, we use the technique described i n the previous paragraph) 4.6.2 Usage of Network Latency Measurement We calculate the latency between CBGs periodically, but we do not use a l l latency values at each moment, because the updates may not be sent as frequently as the latency measurement. Therefore, we need to discover a way to calculate the actual latency to use in the update packet. To calculate this value, we borrow the idea from T C P about how to compute the retransmission timer, since our purpose here has similarities wi th it. We have two variables, L T and M , representing the latency we want to calculate and the latest measurement of the latency, respectively. To 52 accommodate the effect of the variance between the new measurement and the historical value, we use a variable, D , to calculate the deviation by the following formula: D = SD + (1-5) | LT - M | Where 5 has the value of 7/8. The calculation of L T is: LT = M + 4*D Each time after an U P D A T E packet is sent, L T and D are reset to start again. 4.6.3 Network Proximity Inside an A S There is one important factor that affects content routing decisions: the delay i n the original A S where the content update initiates. Due to the different size of the ASs , this factor can have a different weight in the influence to the whole route decision process. To account for this factor, we take a rough measurement of latency inside an A S . As described in the previous sections, by measuring the latency between border routers through K E E P A L I V E packets, each border router of the Internet Service Provider (ISP) stores the latency measured from this router to the other border routers i n its network (the I B G P routers). B y averaging the latencies stored, we arrive at the approximate latency i n this A S . The content update packet then includes this number as the ini t ia l value for attribute L A T E N C Y . This value can also be used as a reference to compare whether a content source inside an A S is better or worse than one outside an A S . This approach is scalable because the measurements 53 are performed locally (to an ISP's network) and the information stored at the border routers is only on the order of the number of border routers in the ISP's network. This idea is used i n [21] as well. 4.6.4 Convergence P r o c e s s D i s c u s s i o n Since our name request resolution hinges on the Name+ I P addressing mechanism that hardwires the IP address to the specified content in the routing process, con-cerns arise regarding the route convergence process. We claim that content routing in our approach does not have more serious problems in routing convergence and oscillation than that in the original B G P 4 [36]. If a content server is down, content requests directed to that server before the B G P 4 route converges drop, but this is no more serious than the convergence be-havior of the original B G P 4 . The content withdrawal is quickly propagated through the content routing process, and CBGs remove that route from the content route table, i f it is there, and the peering CIGs are informed. In addition, multiple paths kept by each CBG ameliorate this problem. If the route to certain content is i n the convergence process because of the existence of a better route when the CBG is serving a name resolution query for that content, the performance w i l l not be affected much. The old destination is s t i l l reachable and can serve the content request, but wi th only a slight performance loss, as compared to the new one. 4.7 Deployment Considerations Our content internetworking system allows an all-at-once deployment or incremental evolution, based on services needs. 54 To incorporate the individual CN into the content internetworking picture, the only requirement is to place CIGs at the edge of CN, which is a basic requirement to enable interconnection wi th other CNs. Running C G I P on CIG w i l l enable one CN to locate other CIG peers in the same A S , hence interconnecting wi th them. Enabl ing the CN interconnection on a global scale requires only the upgrade of B G P 4 to support C B G P on existing Border Gateways. B y using an U P D A T E packet compatible wi th the original B G P protocol and defining the new attributes as transitive optional, we can make a step-by-step deployment for the new routing architecture. The justification is that paths wi th unrecognized transitive optional attributes should.be accepted as defined i n the B G P standard. If a path wi th an unrecognized transitive optional attribute is accepted and passed along to other B G P peers, then the unrecognized transitive optional attribute of that path must be passed along wi th the path to other B G P peers. Then, we can guarantee that at-tributes such as C O N T E N T can be transmitted across non-CBGP-enabled regions. A n original Border Gateway that only has IP routing function ignores content at-tributes i n routing advertisements and passes them along to B G P peers, so it can s t i l l work compatibly wi th the upgraded Content Border Gateway. A n incremental deployment path greatly facilitates the extension of global content internetworking. The deployment process can be based on user needs and an ISP's motivation to provide a better web experience to customers and superior service to co-located content providers. For replicated content, name resolution can be quickly solved wi th the result of returning a good server, eliminating the need for name requests to leave their network. For content without replicas, the name resolution fails over to normal D N S behavior. Th is in i t ia l deployment requires no change to end hosts. W i t h more and more content being replicated, the content route table may 55 become bigger i n CBGs, which may run out of resources to store route table and answer queries. In that case, we can deploy an active server co-located wi th the CBG, to store content routes and processes content requests. ISPs that already peer at the I P routing level are motivated to peer at the content routing level to provide their customers faster access to nearby content servers and increase the benefit of placing content servers in their networks. A s demand grows, the routers w i l l be upgraded gradually to adapt to content internetworking requirements. 56 Chapter 5 Implementation and Experiment 5.1 Implementation Based on M R T We introduce our prototype implementation i n this section. The focus is on the implementation about C B G P . Those of other protocols can be found i n [35]. 5.1.1 Implementation Overview To test the complete system framework, we implement a l l the components in Figure 3-1, including simulating the CN itself, as we don't have an existing CN system to experiment with. Thus, the implementation includes a server side metric updating protocol between content servers and CIG, C G I P , C G B P , and C B G P . We have a monitoring agent installed on each server that records the server performance status each minute, and sends updates to the CIG through port 10789 whenever a significant change happens. CIGs inside an A S communicate wi th each other, exchanging the content 57 status inside each CN. Each CIG has a name resolution thread running on port 53, and the C I G replaces the normal D N S server running on the machine. One of the CIGs is selected as the representative to communicate wi th the CBG through C G B P . That CIG is configured as an internal peer of the C B G . The focus i n this chapter is on C B G P . C B G P is implemented by extending the Multi-threaded Routing Toolkit ( M R T ) [14, 15, 16, 17] under L inux to support content internetworking. 5.1.2 Introduction of M R T M R T is a routing toolkit developed by the University of Michigan/Mer i t Network. Besides tools such as the traffic generator and message format converter, the key part of this toolkit is a routing daemon called M R T d . It supports RIPng , B G P 4 + , multiple R I B s (route server), and R I P 1 / 2 . M R T d reads Cisco Systems-like router configuration files and supports a Cisco Systems router-like telnet interface. The routing architecture design of M R T incorporates features such as parallel lightweight processes, multiple processor support, and shared memory. Al though M R T has been designed wi th multi-threaded, multi-processor architectures i n mind, the software w i l l run in emulation mode on non-thread capable operating systems. The modular design of the software encourages the rapid addition and prototyping of experimental routing protocols and inter-domain policy algorithms. M R T is wri t ten i n C programming language, and contains about 120,000 lines of code, amongst which there are more than 30,000 lines specifically dealing wi th B G P (this excludes the code for data structures and operations shared by the whole M R T ) . Currently, we have only completed the implementation of the preliminary solution described i n the previous chapter based on M R T due to l imited time, and 58 we are still working on the refined solution. Our focus is the extension of the routing daemon MRTd. In particular, implementation of B G P 4 in the MRTd is modified and extended to support con-tent internetworking. Content query/response handling functions on the CBG is implemented in a separate MRT thread. 5.1.3 Content Internetworking Implementation • Data structures In addition to the regular IP routing table which is represented by a radix tree in MRT, we need another data structure for the RIB of content routes. Currently, we use a hash table for experimention. A patricia tree is also a good data structure for this purpose since we assume that only a low percentage of websites on the Internet are replicated, but we have not implemented it. The hash key for the content route table is the name of the content, and each entry in the content route table is a data structure defined as the following: typedef struct _croute_item_t{ char *cname; //content name one_entry *entry; //the l i s t of routes for this content } croute_item_t; typedef struct _one_entry{ char *prefix; //the server which holds the content bgp_route_head_t *bhead; //the pointer to the bgp RIB node unsigned int metricN; // metric of the network level 59 unsigned int metricS; // server side metric of the content long expire_time; // the expire time for this entry struct _one_entry *next; //next pointer i n this link l i s t } one_entry; Each route to the content is inserted into a linked list i n order of the calcu-lated preference value. The CBG maintains one data structure for each of its peers, which contains a l l the information and data storing substructure needed when communicating with that peer. The main fields of the data structure are listed below: typedef struct _cbgp_peer_t { char *name; /* peer name */ prefix_t *peer_addr; /* peer address */ int peer_as; /* peer's AS number */ u_long peer_id; /* peer's router i d */ int peer_port; /* port number in case i t i s not 179 */ nexthop_t *nexthop; /* immediate next hop */ int sockfd; /* socket connected to peer */ pthread_mutex_t mutex_lock; LINKED-LIST *ll_announce; /* store announcement from the peer */ LINKED_LIST *ll_withdraw; /* store withdrawal from the peer */ cbgp_attr_t *attr; /* attributes of the path */ LINKED_LIST *ll_update_out; /* updates to be sent to peer */ radix_tree_t *routes_in[AFI_MAXj [SAFI_MAX] ; /*incoming routes*/ 60 HASHJTABLE *content; /* content routes r i b _ i n * / } cbgp_peer_t; • Content route updating process Since the packet format is changed with the addition of new attributes, the first thing we need to do is to modify the encoding and decoding process for the UPDATE packet. When a new peer relationship is established, all the entries in the content route table are sent to the peer. By processing all the route updates from individual peers, the CBG keeps on updating its view to the outside world. Each time a CBG finds a change in its view of the IP or content routes, the changes are sent to all the other peers, except the one that announces the new route. The processing of the updates can be divided into the following three phases: (only the processing of a route announcement is described below, since content updates are not involved with the.processing of the normal IP route withdrawal. We assume there is only one prefix in the NLPJ field, since processing is the same for each). Phase 1: In this phase, the announced route gets checked and put into the RIB-in for that peer. PJB-in stores route information that has been learned from inbound UPDATE messages. The entries in it represent routes that are available as an input to the decision process. A processing flow diagram is shown in Figure 5-1. The Check Attributes step in the diagram aims at verifying the path at-tributes to guarantee it has the all the mandatory attributes, such as NEXT_HOP, AS_PATH. 61 Receiving an UPDATE entry Figure 5.1: Phase 1 of content route updating 62 Phase 2: The view of the overall content routing map in the CBG is updated according to the route change resulting from phase 1. The best routes are chosen out of a l l those available for the content destination, and installed into the local R I B . The changes to the R I B are stored in the change list. The main processing flow is shown in Figure 5-2. Policy Check re tu rn """Only CONTENT~-~~ N normal bgp —>_attr changed? - processing Calc route preference, select best content routes add content updates to change list, set flag Combine updates in change list - • / /^ Go to phase Figure 5.2: Phase 2 of content route updating The Policy Check step filters the routes and path attributes manipulating to influence its own decision process as the local policy defines. It could filter certain networks coming from a peer while accepting others. In case of aggregation, the content server address only matches a less specific route (with a shorter prefix), it is necessary to recalculate the sub-host address in the C O N T E N T attribute to match that prefix. In this phase, i f a new route is added to the content route table, the expiration time is calculated by adding the Va l id T ime Period to the current system time. 63 Phase 3: For the changes i n the list resulting from phase 2, each peer, except the one announcing the new route, gets scanned to check whether it has a contradiction with the predefined policy. The changes are disseminated to those peers i f the policy allows it. Meanwhile, the route withdrawal and the route changes to the content existing i n the local A S are also propagated to the CIG representative. • Content resolution thread CIGs send content requests to CBG whenever they can not find an answer i n their own content route table, or the entry in their cache expires, either because the caching T T L expires, or because the Va l id T ime for that entry expires. There is an individual thread i n CBG which deals wi th such requests. The content request/response processing between CIG and CBG is internal in the whole system architecture, and does not relate to the end user's request directly, so the communication port does not have to be 53. We chose 553 as the listening port. Each time a content request is received, the content route table is searched, and i f the item is found, the routes stored in the linked list are retrieved one by one and then encapsulated into one singe data structure to pass back to the CIG. In the process of route retrieval, the expiry time for each route is checked and i f it expires, the route is discarded. Otherwise, the new Val id T ime Period is calculated by computing the difference of the current system time and the expected expire time. • T ips in system implementation M R T d is a routing daemon running on Linux. When it runs, no terminal is associated wi th the threads, so it is difficult to debug. To make it convenient 64 for debugging and testing, the daemon process is removed from the original M R T d program. Further, since the multi-threading system is very hard to debug due to constant thread switching by the process scheduler, i n the system debugging stage, we set M R T d to run i n emulation mode in a single process, just as i f running on non-thread capable operating systems. The other interesting point in the implementation is that the client side appli-cations such as ping or Netscape on the Windows platform exhibit different behavior i n accepting the name resolution answer from those on Linux . Clients on Windows accept answers sent from any port on the server system, while clients on L inux only accept answers sent through the standard service port wi th the number 53. This problem proves difficult, as we wrote the name resolution thread in Windows and tested it there, however a direct port d id not work on the L inux platform. 5.2 Topology and Configuration of the Experiment The network topology description, environment settings and tools used i n the ex-periment are covered i n this section. 5.2.1 General Description We conducted experiments to test the viabil i ty of the system architecture and the performance of the system. In simulating an Internet web environment, we set up Apache web servers to hold web pages, files etc.. A s usual, the CN itself is considered a black box for content internetworking, so our simulation ignores the CN internal Request-Rout ing system. We use agents residing on the web server to collect real-time content availabilities and server side metric information, and these variations are 65 considered to be the origins of routing updates to be sent to the CIG. We simulated a single A S using a L inux box running C B G P . Other machines Played the role of web servers and CIG by running the corresponding software. W i t h l imited number of machines, however, we could have only one machine acting as a multi-function box wi th any combination of C B G , CIG and web server. The CIG communicating wi th a certain CBG is configured as the same A S number as the C B G . G N U web stress tool - OpenLoad [18], which provides near real-time perfor-mance measurements for web applications, is used to generate traffic for system per-formance evaluation and analysis. For example, "openload www.myContent l .com 50" simulates 50 clients requesting and accessing content www.myContent l .com simultaneously. The name requests are sent to our Request-Routing system for res-olution; after getting the IP address of a content server, Openload visits the content servers. Accordingly, content servers' load and network usage are affected. The metric variations are reflected in the routing system. Thus, we can observe content routing behaviors and the load balancing performance of the content routing system. Openload is further modified to reflect the content route changes i n the process of simulation. 5.2.2 Experiment Network Topology We used a Bay450 T-24 Switch wi th V L A N capability to set up a simulation environ-ment. The simulation environment contains 6 ASs wi th CNs scattered throughout, and 9 ACNs representing different content. We have clients sit t ing in different ASs access the content provided, and the overall performance data is recorded for com-parison. The simulation topology is shown i n Figure 5-3. Al though the simulation 66 scale is small, the characteristic features of the Internet structure are simulated, and we plan to make a large scale simulation in future work. The local preference of each route coming into an A S is set to the same default value, and in this special environment, we set a very small weight (0.1) for both the A S path length and local preference, while we give a heavy weight to the server side metric. AS-360 AS-350 AS-310 AS-340 AS-380 AS-330 Figure 5.3: Network topology for the experiment The above topology was configured wi th the computers in the lab. Computer names, the A S numbers they belong to, their IP addresses, the role they played in the network and the content they hold (if they take content server roles), are listed in Table 5-1. Some L inux boxes have several network cards installed to act as routers, and hence have several network addresses. The whole experiment is done on an internal network using the internal IP subnet 192.168.x.x. We also set up web servers delegating pseudo domain names, such as www.cisco.com, www.yahoo.com as shown in Table 5-1. 5 . 3 Evaluation Criteria In the experiment, we mainly conduct a stress test for performance evaluation. That is, we simulate multiple clients' concurrent and intensive access to content. 67 Computer AS No. IP address (es) Role of the Content held CN name belonged to machine No. in an AS goodearth 310 192.168.1.7; 192.168.5.7 C B G , CIG, content server Cisco, dior, lancom, yahoo : CN-1 venus 350 192.168.5.2; 192.168.3.2; 192.168.6.2; 192.168.8.2 C B G , CIG, content server Cisco, dior, lancom, yahoo CN-1 grimface 330 192.168.1.12; 192.168.3.12 C B G , content server Netscape, elle CN-1 slesse 330 192.168.3.15 CIG, content server Netscape, elle CN-2 hozameen 380 192.168.8.16 CIG, content server Netscape, elle CN-1 celestial 380 192.168.8.25; 192.168.7.25 C B G , CIG, content server Google, cnn, ubc.ca CN-2 robson 360 192.168.6.10; 192.168.4.10 C B G , CIG, content server Google, cnn, ubc.ca CN-1 moonlight 340 192.168.4.8; 192.168.7.8 C B G , CIG, content server Cisco, dior, lancom, yahoo CN-1 Table 5.1: The configuration of the testing environment 68 Client request response time, transactions completed per second, and total number of completed requests are measured by Openload and recorded for comparison and analysis. The load variations on the server side are also monitored. We evaluate the system on the following aspects: first, we expect that if there ia a large quantity of client requests, content that has more replicas will de-liver better overall performance than that which has fewer or no replicas, despite system management overhead. Second, with our load balancing strategy, content servers should demonstrate good overall load distribution performance. Third, com-pared to traditional DNS resolution schemes, our approach incurs one extra level of round trip time from the CIG to the CBG for non-replicated content request resolution. However, we think this overhead should be acceptable in contrast to the significant performance gains for resolving widely replicated content, which is most frequently requested and accessed. Fourth, our internetworking architecture should be compatible with the original border gateways, shown by good interoperability. The experiment results are shown in the next section. 5.4 Data Collection and Analysis We did testing and data analysis on the following aspects: 1) Overhead on the content name resolution: In this test, we measured the overhead of the content name resolution process (total for both request and response). The experiment is carried out on a content route table of 50,000 entries. These entries are randomly generated domain names with most of them in the .com domain and others in .org and .net. The overhead we measured on a 667 MHz Pentium III system running Linux 2.4.13 is just 5.7 milliseconds for going through two hops of the content layer that 69 are CIG and CBG. Also, we believe most of this time is spent on packet processing, as any D N S server has to do. The D N S delay measured by [32] is shown in Table 5-2. It assesses the delay by choosing top 100 U R L s wi th largest number of Internet users and 100 random U R L s . The 95th percentile delay listed i n the table is the delay sufficient to serve 95% of the queries. Top 100 U R L 100 Random U R L Median 95th per-centile Average Median 95th per-centile Average D N S delay (ms) 160 2159 498 298 6149 1706 Table 5.2: D N S QoS assessment i n the internet Compared wi th the data shown i n Table 5-2, we conclude that the overhead i n our approach is negligible. 2) Other performance testing for content accessing: In the set of tests conducted, we compare situations where server loads change wi th the timeline, the response times when different numbers of surrogate servers exist, and overall request response time measured from different clients. 2.1) Server load comparison between two machines wi th similar configura-tion: B y having clients i n different A S s access www.cnn.com continuously for 3 minutes, we observed the changing trend of the server load on robson and celestial as shown i n Figure 5-4 i n which server load is calculated i n self-defined units. Robson and celestial are two machines wi th similar system configurations, a 200MHz C P U , and 64M and 96M memory respectively. Server load is computed synthetically from 70 the C P U load as well as memory and swap space usage. From the chart shown i n the figure, we can see that these two servers have basically the same load during the service period, which showed good load distribution. - • — Robson -*— Celestial «P