UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Fast failure recovery in MPLS networks Luo, Qingsheng 2001

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2002-0478.pdf [ 2.86MB ]
JSON: 831-1.0051708.json
JSON-LD: 831-1.0051708-ld.json
RDF/XML (Pretty): 831-1.0051708-rdf.xml
RDF/JSON: 831-1.0051708-rdf.json
Turtle: 831-1.0051708-turtle.txt
N-Triples: 831-1.0051708-rdf-ntriples.txt
Original Record: 831-1.0051708-source.json
Full Text

Full Text

Fast F a i l u r e R e c o v e r y i n M P L S N e t w o r k s by Qingsheng Luo Bachelor of Engineering, Tsinghua University, China 1991 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Mas te r of Science in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard T h e U n i v e r s i t y o f B r i t i s h C o l u m b i a September 2002 © Qingsheng Luo, 2002 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make i t freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department The University of British Columbia Vancouver, Canada Abstract Recovery from network failures is a very important and challenging require-ment for current Internet service providers. Real-time applications and other mission-critical tasks require networks to respond to network faults expeditiously in order to not degrade the required quality. In IP networks, failures are detected and handled through the convergence of network topology (topology change reflected in all routers' routing information bases). This recovery mechanism, however, has an obvious disadvantage in that it reacts to network faults very slowly, because it may take a long time for the entire network topology to converge. In this thesis, we present the design of a fast recovery mechanism in M P L S networks, implemented over the Linux platform. We investigate its performance over our testing network composed of Linux routers. Our experiments show that our recovery mechanism responds to network faults quickly, with less packet loss. u C o n t e n t s Abstract ii Contents iii List of Tables vii List of Figures viii Acknowledgements x 1 Introduction 1 1.1 Motivation and Problems 1 1.2 Current Research 3 1.3 Overview 4 2 Background 7 2.1 M P L S 7 2.1.1 From IP to M P L S 7 2.1.2 M P L S Overview 9 2.1.3 Label Distribution 12 2.2 R S V P 14 ii i 2.2.1 Data Flows 14 2.2.2 Reservation Model 15 2.2.3 Reservation Styles . . 16 2.2.4 Encapsulation of R S V P Messages 16 2.2.5 Basic Mechanism 16 2.2.6 Messages and Objects in R S V P 17 2.2.7 R S V P - T E Extension for M P L S 20 2.3 Linux 26 2.3.1 Overview 26 2.3.2 System Cal l 26 2.3.3 /proc File System 26 2.3.4 Linux Network 27 2.3.5 Configuration as a Router 29 2.4 Transmission Control Protocol 30 2.4.1 T C P 30 2.4.2 T C P High Performance Extensions 31 2.4.3 T C P in Linux 31 2.5 O S D N M P L S project 32 2.5.1 Introduction 32 2.5.2 Data Structure 33 2.5.3 Label and Labelspace 34 2.5.4 Operation and Interface 34 2.6 Failure Recovery 36 3 Design 38 3.1 Overview 38 iv 3.2 Label Binding • • 38 3.3 Primary and Protection Routes 39 3.4 Failure Detection 40 3.5 Failure Report 41 3.6 Failure Processing 42 3.7 Operations at the Locus L S R 44 3.8 Operation at the Ingress Node 45 3.9 Operation at the Egress L E R 46 3.10 Analysis of Potential Problems 47 4 Implementation 51 4.1 Overview 51 4.2 System Structure 51 4.3 M P L S Session 54 4.4 Primary Tunnels and Protection Tunnels 57 4.5 L S P Tunnel Signaling 60 4.6 Label Management 62 4.7 Failure Notification 62 4.8 Ingress L E R 63 4.9 Locus L S R 65 4.10 Egress L E R 67 4.11 OS Influence 68 5 Performance Evaluation 70 5.1 Overview and Experimental Environment 70 5.2 Evaluation Criteria 72 v 5.3 Evaluation Results 72 5.3.1 Using T C P 72 5.3.2 Using U D P 75 6 Conclusion and Future Work 78 6.1 Contribution and Conclusion 78 6.2 Limitation and Future Work 79 Bibliography 80 vi L i s t o f Tab les 5.1 T i m e needed for transferring a data of 700MB 73 5.2 Tota l number of bytes lost in a single failure 77 vii L i s t o f F i g u r e s 2.1 Carrying label in the shim layer header 9 2.2 L S P in M P L S 11 2.3 Using R S V P 17 2.4 R S V P Common Header 18 2.5 R S V P Object Format 19 2.6 Generic Label Request Object 24 2.7 Generic Label Object 24 2.8 L S P - T U N N E L J P V 4 Session Object 25 2.9 L S P . T U N N E L J P V 4 Sender Template Object 25 2.10 H E L L O R E Q U E S T Object 25 2.11 H E L L O A C K Object 25 2.12 Linux T C P / I P layers 28 2.13 T C P Header 31 2.14 Operations in Linux-MPLS 36 3.1 Protection Routes 40 3.2 Failure Report 42 3.3 Loopback Tunnel 43 3.4 Route Types 45 vm 3.5 Recovery Cycle 48 3.6 Out-of-order Problem 49 4.1 Structure of Implementation 52 4.2 Organization of Sessions and Tunnels 55 5.1 Testing Network Topology 71 ix s A c k n o w l e d g e m e n t s I would like to express my gratitude to my supervisor, Professor A l a n Wagner, for his essential guidance and inspiration of this thesis. I would also like to thank Professor N o r m Hutchinson for being the second reader of this thesis and giving me important suggestions for improvements. A n d special thanks to my family, especially my daughter Yichen, for their patience through the years when I was engaged in my graduate study. QINGSHENG L U O The University of British Columbia September 2002 Chapter 1 I n t r o d u c t i o n 1.1 Motivation and Problems I t h a s b e c o m e m o r e a n d m o r e d i f f i c u l t t o s a t i s f y c u s t o m e r s a n d s e r v i c e p r o v i d e r s as p e o p l e are s e e k i n g m o r e a d v a n c e d s e r v i c e s f r o m t h e I n t e r n e t b e s i d e s t h e e x i s t i n g best-ef fort p a c k e t f o r w a r d i n g . Q u a l i t y o f S e r v i c e ( Q o S ) , V i r t u a l P r i v a t e N e t w o r k ( V P N ) a n d T r a f f i c E n g i n e e r i n g , for e x a m p l e , e n a b l e I S P s t o offer b e t t e r s e r v i c e s for s p e c i f i c t ra f f i c f lows. U n f o r t u n a t e l y , I P , w h i c h w a s o r i g i n a l l y d e s i g n e d for b e s t -effort c o n n e c t i o n l e s s p a c k e t t r a n s f e r , i s n o t s u i t a b l e o r is a w k w a r d for t h e s e n e w f e a t u r e s . M u l t i - P r o t o c o l L a b e l S w i t c h ( M P L S ) , w h i c h creates v i r t u a l t u n n e l s over I P n e t w o r k s , s h o w s i t s c a p a b i l i t i e s i n h a n d l i n g these n e w features . S i n c e i t s e m e r g e n c e , M P L S h a s a t t r a c t e d a l o t o f a t t e n t i o n a n d r e s e a r c h b e c a u s e i t n o t o n l y i m p r o v e s n e t w o r k p e r f o r m a n c e a n d eff ic iency, b u t a l s o , a n d p o s s i b l y m o r e i m p o r t a n t l y , p r o v i d e s a n a t u r a l w a y t o f i n e l y c o n t r o l s p e c i f i c t ra f f i c f lows , m a k i n g i t p o s s i b l e t o e a s i l y a n d s e a m l e s s l y d e s i g n a n d i m p l e m e n t n e w f e a t u r e s i n n e t w o r k r o u t e r s . O n e v e r y i m p o r t a n t c o m p o n e n t i n Q o S is f a u l t t o l e r a n c e , o r r e c o v e r y f r o m n e t w o r k f a i l u r e . R e a l - t i m e a p p l i c a t i o n s l i k e V o I P ( V o i c e over I P ) a n d 1 other mission-critical tasks require a quick response to a network failure, otherwise the quality requirement can not be met. Error handling in traditional IP networks is achieved by detecting a failure (link failure or node failure) at the local router, propagating this failure to the whole network area, and processing the failure after the network topology changes accord-ingly. This relies on the convergence of the whole network topology. One major disadvantage of this failure handling mechanism is its slow response (possibly in tens of seconds or minutes). For example, in OSPF, which uses flooding for topol-ogy information exchange, it typically may take 30 seconds for the whole network topology to converge. This problem arises because of the following: 1. although the lower layers (layer 2 or layer 1) can very quickly detect a failure (in milliseconds), they have no idea of where the failure should be reported to because they have no information for routing; 2. on the other hand, layer 3 (IP) does not report it either because it is a con-nectionless protocol and 3. O S P F is designed for dynamic routing and to prevent topology fluctuation, it prefers postponing the flooding of topology change until the network topology converges. However, M P L S makes it possible to handle failures more gracefully because M P L S is connection oriented, and lies in between layer 2 and layer 3, which means it can detect failure quickly, while at the same time have the information of upper layer routes to facilitate the handling of failure. Unlike OSPF, M P L S knows the infor-mation of current sessions or connections, and which of them need to be protected. This helps an M P L S node notify the relevant edge routers of a network failure, and 2 take immediate restoration steps for those tunnels even before the network topology finally converges (under an OSPF propagation mechanism, for example.). 1.2 Current Research Some effort has been made for failure recovery with MPLS. Anat Bremier-Brarr, Yehuda Afek, et al. proposed a restoration technique to reroute a problematic tunnel to a backup one by concatenating the two tunnels at the local node [1]. Murali Kodialam and T.V. Lakshman presented an algorithm for dynamic routing for local restoration (which means the restoration is completed by the local node) [6]. In [3] by Tony Mancill, and [5] by Radim Bartos and Mythilikanth Raman, algorithms for bandwidth guaranteed restoration using MPLS local recovery were presented. The above mentioned research focused mainly on resource reservation (band-width) or routing algorithms. However, quick response to prevent packet loss, which is very important to the quality of failure recovery, has not received enough atten-tion. For example, in real-time applications, such as voice over IP, it is desirable to reroute failed tunnels onto backup ones within tens of milliseconds, otherwise, users would notice the degradation in quality. Video broadcasting is another example where too much packet loss is unacceptable for reasonable video quality. Papers [8] and [9] provide basic concepts for failure recovery in MPLS and suggest a set of protocols piggy-backing on RSVP. This thesis focuses on the fast recovery of network failure in MPLS networks. Fast recovery here, means a quick response to the failure while trying to prevent packet loss. In term of failure, we use a single link failure model which means we assume only one link failure occurs at a time. This is the usual case because during a very short period of time it would 3 be very rare for two or more links to fail simultaneously. Also in practice, network links are more vulnerable than nodes, which means link failure happens more often than node or router failure. The single link failure model has been used by most recovery researchers. In this thesis, unless explicitly mentioned, failure or network failure means link failure. We talk about node failure in Chapter 6. 1.3 Overview In the thesis, a fast recovery mechanism in M P L S networks is designed and imple-mented over Linux routers. The recovery mechanism is evaluated by experimentally testing its performance with selected end-to-end protocols and applications. A link failure is detected at the M P L S level using the R S V P - H E L L O ex-tension (Section 2.2.7). Primary tunnels (tunnels which need to be protected) and backup tunnels are pre-signaled or pre-established from the ingress router. A pro-tection loop is created along the reverse direction of the primary tunnel (from egress to ingress in the context of the primary tunnel). When a network failure happens, the upstream node of the failed link detects the failure and does the following: 1. (a) tries to find a backup tunnel through this node but not the failed link, and switch the traffic to the backup one by concatenating the two tunnels, otherwise (b) concatenates the protected tunnel with the protection loop, switching the traffic back to the protection loop, and 2. informs the ingress node of the failure. 4 The ingress node, after receiving the notification of the failure, finally redirects the traffic to a backup tunnel. The failure notification arrives at the ingress node before the whole network topology finally converges. This allows the failure to be handled more quickly than otherwise would occur if only a route distribution protocol, such as OSPF, is used. A n M P L S router only notifies those ingress nodes which need this instant notification (have task-critical traffic passing through this failed link), in contrast to OSPF in which all routers inside the whole area have to be informed. If an ingress router can aggregate multiple traffic (with the same QoS, ingress-egress pair, etc.) into multi-commodity flows, it should be scalable. By processing the failure locally, instead of relying on the ingress router to do the recovery, packet loss can be reduced. In a practical system, the propagation time of a failure notification from the failure locus to the ingress router can be hundreds of milliseconds, and hence using a temporary recovery tunnel (either a backup one or the protection loop) at the locus node first can be of interest to quality-critical tasks. One possible drawback however, is the out-of-order problem: during this temporary period, until the final backup decision is made by the ingress router, packets may arrive out of order at the destination. Most of the current approaches in the literature use network simulators, such as NS, as their testbed. Using a simulator is more convenient for the user, and in some cases, it would be impossible to obtain real systems. However, it would be more convincing if the algorithm or mechanism is tested over real systems. In account of this, we use Linux workstations as routers in the project platform. We use R S V P - T E (RFC2205 [14], [10]) as the protocol for label distribution, link failure detection and error notification. We implemented the major functions of 5 RSVP-TE for MPLS over Linux and also investigated the performance over Linux. Our experiment shows that the recovery operation can be carried out within 100 milliseconds, and packet loss can be minimized using our fast recovery mechanism (with a protection loop) under certain circumstances. The thesis is organized as follows: Chapter 2 gives some related background knowledge, then the details of the design and implementation are described in Chap-ter 3 and Chapter 4. Chapter 5 gives the results of the experiment, and the conclu-sion and future work are given in Chapter 6. 6 Chapter 2 B a c k g r o u n d 2.1 MPLS 2.1.1 F r o m I P t o M P L S The growth in both the number of users of the Internet and their bandwidth and service requirements places increasing demands on Internet service providers' (ISPs) networks. To meet all the growing demand, ISPs need higher performance switching and routing products. Traditional IP routing is done by looking up the F E C (Forwarding Equiva-lence Class) in a router's FIB (Forwarding Information Base) to find the outgoing interface and next hop address to which the packet should be forwarded when a router takes in a new packet. Normally, an F E C is composed of a destination ad-dress, and probably a source address and a QoS class if required. A n IP router organizes its F IB as a hash table indexed by the F E C . Wi th the massive growth of the Internet, a router's FIB may become huge, and it may take too long (compared to the speed of packet arrival) for the router to get a next hop. This would introduce a problem for the performance and efficiency of a router to support a broadband 7 bandwidth network with huge data flows, for example, Gigabit networks. M P L S can easily handle this by inserting a label into the header of an IP packet. After an M P L S tunnel (which can be considered as a route created by M P L S signaling) is created, this label can be used as a quick index to the next hop to which the packet is to be forwarded. Without looking up the next hop in a hash table (done in IP forwarding), M P L S can obviously reduce the cost of forwarding packets. This merit makes the M P L S a very good choice for improving the performance and efficiency of routers and simplifying their design and production. Traffic engineering is another motivation for the introduction of M P L S . For example, although using the shortest path (the most widely used routing protocol in IP) can conserve network resources, it may cause congestion problem, which. occurs when the shortest paths from different sources overlap at some of the links. Underutilization and over-utilization of the resources may occur when the traffic over a shortest path exceeds its capacity, while a longer but less busy path between two routers is left under-utilized. To balance the traffic load on different links and routers in the network, traffic engineering routes the traffic flows across the network based on the resources that the traffic requires, and the availability of the resources in the network using C B R (Constraint Based Routing). These requirements for traffic engineering are awkward or difficult for IP, notable for its destination based routing, to satisfy, while M P L S can handle it by creating a virtual tunnel for each specific F E C . The current M P L S technology was derived from several similar attempts. Some of these attempts included Toshiba's Cell Switching Router (CSR), Cisco's Tag Switching, IBM's ARIS, and so forth. The working group of the Internet Engineering Task Force (IETF) introduced M P L S in 1997 [12]. M P L S extends routing with 8 Link layer header "Shim" layer header Network layer header Network layer data • ^ ^ —- ^ ^ ^ ^ ^ ^ ^ ^ ^ —i Label(20) Exp (3) S TTL Label: Label value Exp: Experimental use S: Bottome of Stack TTL: Time to Live Figure 2.1: Carrying label in the shim layer header respect to packet forwarding, load balancing and path controlling. 2.1.2 M P L S Overview In MPLS, a label is a short, fixed-length entity, with no internal structure, inserted into the header of an IP packet. It lies in between layer 2 and layer 3, and is called a shim layer. A label does not directly encode any of the information from the network layer header. It is used as an index by a router to look up the next hop in a switching table to which the packet should be forwarded. See Figure 2.1 for the layer structure of an MPLS packet. The LSR (Label Switching Router) is used to name a router in an MPLS area. One type of special LSR is an LER (Label Edge Router), located at the edge of an MPLS routing domain. The FEC (Forwarding Equivalence Class) is used to classify packets. An FEC is a representation of a group of packets that contain the same traffic requirements. All packets with the same FEC are forwarded or transferred along the same route and to the same next hop. One example of an FEC is a set of unicast packets whose network layer destination address matches 9 a particular IP address prefix. A n ingress L E R is used to insert a label into the packet header from its F T N (FEC-To-Nexthop) mapping table. A n egress L E R is used to remove a label from the packet header before the packet departs from the M P L S domain. Only at an ingress does the L E R need to look up the F E C , and at all other LSRs only layer2 switching is needed. In M P L S , an L S P (Label Switching Path) tunnel is created using some type of distribution protocol (see 2.1.3) before it can be used. The establishment of an LSP, and traversal of a labeled packet across an M P L S domain, is shown in Figure 2.2. In this example, an L S P is created from L E R A to L E R C for a certain traffic flow. Normally, the creation of an L S P is initiated at an ingress L E R . Here, A initiates the tunnel setup by sending a label request to the downstream node B . B then sends a label request to C. When the tunnel is set up, a label is assigned for this tunnel from C to B and B to A . Depending on different mechanisms, label assignment between B and A , and that between C and B , can be done independently, or the former one should be held until the latter is completed. When the traffic flow arrives, ingress router A admits the traffic by searching its LIB (Label Information Base) using the flow's F E C as the index. A inserts the label 32 into the header of the IP packet and forwards it via outgoing interface 2 to L S R B , according to the routing information it receives from its LIB. Now B receives an M P L S packet and looks up the routing information in its LIB using the incoming label 32 as the index. B gets the outgoing interface 3 and the outgoing label 33 as the related routing information from its L I B , replaces label 32 with 33, and forwards it through interface 3 to node C. C performs similar steps but removes the M P L S label and forwards it as a regular IP packet to an IP router outside the domain. Logically, M P L S routing can be partitioned into two basic components: the 10 LSP setup (from LSR A to C) Label Reservation Label Reservation (label=32) (label=33) IF:interface LSR: Label Switching Router LER: Label Edge Router Traffic flow through LSP iif inlabel oif outlabel 2 32 3 33 FEC: Forwarding Equivalence Class oif: outgoing interface iif: incoming interface inlabel: incoming label outlabel: outgoing label Figure 2.2: LSP in MPLS 11 control plane and the forwarding plane. The forwarding component (plane) is re-sponsible for the actual forwarding of packets from input to output across a switch or router. The forwarding component uses a forwarding table and the information carried in the packet itself to forward a packet. The forwarding table is maintained and managed by the control component. Although, in this thesis and by the current literature M P L S is used over IP networks, it is not intrinsicly designed as working over IP. Instead, it supports any layer 3 protocol and is capable of operating over any link layer protocol, hence, the name multi-protocol. However, this is beyond the scope of this thesis, and we restrict the discussion to IP only. 2.1.3 L a b e l D i s t r i b u t i o n Label Crea t ion Label creation for LSPs can be triggered by a topology change, the incoming of data packets or tunnel requests. The three creation techniques related to the three triggering mechanisms are topology-driven, data-driven and request-driven. In the topology-driven mechanism, LSPs are created before the data transmission, while the data-driven method uses the reception of data flow, and the request-driven method uses user requests to initiate L S P label creation. Labe l Assignment A label can be assigned by a downstream node or an upstream node. Currently, the downstream assignment method is more widely used, especially in explicit routing. We use downstream assignment in this thesis. 12 Label Setup Different label setup methods are allowed in M P L S , depending upon the routing methods that are used. In hop-by-hop routing, each L S R determines the next hop based on its own F T N table. This is similar to the one currently used in IP networks. In explicit routing, an explicit path (strict or loose) is specified by the L E R before the L S P is established. This explicit path is normally chosen by the administrator or from C B R (Constraint Based Routing). Each L S R follows the explicit path to setup the tunnel and assign labels. It is mainly used in traffic engineering. Distribution Protocols There are two different kinds of protocols for label distribution: new defined protocol and protocol piggybacking on an existing protocol. L D P (Label Distribution Pro-tocol) is a set of new protocols defined by the M P L S working group, used solely for M P L S label distribution. It is useful when no existing protocol can be used. How-ever, using it can be expensive because a completely new protocol must be handled. Piggy-backing label distribution on another existing protocol is more preferable. Candidate protocols include: RSVP, OSPF , P I M and B G P . Adding some exten-sions to these protocols can save a lot of time rather than defining and handling a completely new one, such as L D P . In the thesis we use the R S V P - E R O extension to support M P L S label dis-tribution, which is one of the piggy-backing methods. 13 2.2 R S V P R S V P (ReSerVation Protocol) is a resource reservation setup protocol designed for an integrated services Internet [14]. R S V P is used by a host to request specific qualities of service from the network for particular application data streams or flows. It is also used by routers to deliver QoS requests to all the nodes along the paths of the flows, and to establish and maintain a state for providing the requested service. A n R S V P request generally results in some resource being reserved in each node along the data path. R S V P operates on top of IP. However, R S V P does not transport application data, but is rather an Internet control protocol. Similarly to I C M P and I G M P , R S V P works as a control part, not a forwarding part. R S V P is not itself a routing protocol. A n R S V P process consults the local routing database to obtain routes. R S V P is only concerned with the QoS of those packets that are forwarded in accor-dance with routing. For scalability, R S V P makes receivers responsible for requesting a specific QoS. R S V P protocol carries the request to all the nodes along the reverse data path to the data source. R S V P establishes a soft state for the resource reservation, which means R S V P sends periodic refresh messages to maintain the state along the reserved path(s). In the absence of refresh messages, the state automatically times out, and is released. 2 . 2 . 1 D a t a F l o w s R S V P defines a session to be a data flow with a particular destination and transport-layer protocol. A n R S V P session is defined by the tuple of D'estAddress, Protocolld 14 and DestPort. DestAddress is the IP destination address. Protocolld is the IP protocol ID. Optional DestPort is a generalized destination port, for example, for further demultiplexing. 2.2.2 Reservation Model A n elementary R S V P reservation request consists of a flowspec, together with a filterspec; this pair is called a flow descriptor. The flowspec specifies a desired QoS. The format and content of the flowspec are opaque to R S V P . The filterspc, together with a session specification defines the set of data packets: the flow to receive this QoS. In the most general approach, filter specs may select arbitrary subsets of the packets in a given session. The basic filterspec format defined in the present R S V P specification has a very restricted form: sender IP address and optionally SrcPort. R S V P message carrying reservation requests originate at receivers, and are passed upstream towards the sender(s). At each intermediate node, a reservation request triggers two general actions, as follows: • Make a reservation on a link The R S V P process passes the request to admission control and policy control. If either test fails, the reservation is rejected, and an error message is returned to the appropriate receiver(s). If both succeed, the node sets its forwarding plane (e.g, packet classifier) and forwards the request upstream if needed. • Forward the request upstream A reservation request is propagated upstream towards the appropriate senders. The reservation request that a node forwards upstream may differ from the request that it received from downstream because of, for example, traffic merg-15 ing. 2.2.3 Reservation Styles The reservation styles are defined below: • A Wildcard-Filter (WF) style implies shared reservation and wildcard sender selection. • A Fixed-Filter (FF) style implies options with distinct reservation and explicit sender selection which creates a distinct reservation for data packets from a particular sender, instead of sharing them with other senders' packets for the same session. • Shared-Explicit (SE) style implies options with shared reservation and explicit sender selection which creates a single reservation shared by selected upstream senders. 2.2.4 Encapsulation of R S V P Messages Generally R S V P messages are sent and received using protocol 46 as IP datagrams. However, they can also be encapsulated in U D P if the host systems do not support raw network I /O. 2.2.5 Basic Mechanism There are two fundamental R S V P messages: Path and Resv. Each R S V P sender host transmits R S V P Path messages downstream along the uni-/multicast routes provided by the routing protocol(s), following the paths of the data. These Path messages store a path state in each node along the way. Each receiver host sends 16 Figure 2.3: Using RSVP RSVP reservation request (Resv) messages upstream towards the senders. They create and maintain a reservation state in each node along the path. Figure 2.3 is an example that uses RSVP for multicasting, with SE as the reservation style. 2.2.6 Messages and Objects i n R S V P Each RSVP message consists of a common header, followed by a body consisting of a variable number of variable-length, typed objects. Common Header Figure 2.4 shows the structure of the common header. Ver Protocol version number. The current version is 1. Flags Not yet defined. Msg Type l=Path,2=Resv,3=PathErr,4=ResvErr,5=PathTear,6=ResvTear,7=ResvConf 17 Ver (4) Flags (4) Msg Type (8) R S V P Checksum (16) Send_TTL (Reserved) R S V P Length Figure 2.4: R S V P Common Header RSVP Checksum The one's complement of the one's complement sum of the message. Send-TTL IP T T L value with which the message is sent. RSVP Length The total length of this R S V P message in bytes, including the common header. RSVP Object Every object is composed of one or more 32-bit words with a one-word header, with the format shown in Figure 2.5. Length Total length of the object in bytes. Class-Num Defines object class. C-Type Object type. Unique within each Class-Num. Combined with Class-Num, determines an object. Message Types Path A Path message travels from a sender to a receiver(s) along the same path(s) used by the data packets. It contains a S E N D E R _ T E M P L A T E object defining 18 Length (bytes) (16) Class-Num (8) C-Type (8) (Object contents) Figure 2.5: RSVP Object Format the format of the data packets and a SENDER_TSPEC object specifying the traffic characteristics of the flow. Resv A Resv message carries a reservation request hop-by-hop from the receiver to the sender, along the reverse path of the data flow for the session. PathTear A PathTear is used by an upstream node to request the downstream node (next hop) to tear down the session. Receipt of a PathTear message deletes the matching path state. PathTear must be routed exactly like the corresponding Path message. ResvTear A ResvTear message is initiated by the receiver or by any node in which the reservation state has timed out, and travels upstream towards all match-ing senders. Receipt of a ResvTear deletes the matching reservation state. ResvTear must also be routed exactly like the corresponding Resv message. P a t h E r r A PathErr message reports errors in processing path messages. It travels upstream towards the sender and is routed hop-by-hop using the path state. PathErr messages do not modify the state of any node through which it passes. It is only reported to the sender. 19 ResvErr A ResvErr message reports errors in processing the Resv message. It travels downstream towards the appropriate receiver, routed hop-by-hop using the reservation state. ResvConf A ResvConf is sent downstream to the receiver when requested, to ac-knowledge the reservation request. 2.2.7 R S V P - T E Extension for M P L S RSVP-TE is an extension of RSVP for establishing label switched paths in MPLS networks. It was initially proposed by Daniel O. Awduche, Lou Berger, et al. [10]. The RSVP-TE protocol supports the instantiation of explicitly routed LSPs, with or without resource reservation. The LSPs created with RSVP can be used to carry Traffic Trunks. Since the traffic that flows along an LSP is defined by the label applied at the ingress node, and is opaque to all intermediate nodes, these paths can be treated as tunnels: LSP tunnels. The signaling protocol model defines the specification of an explicit path as a sequence of strict or loose routes. While this is not mandatory, resource reservation can be done together with LSP tunnel creation. LSPs without resource reservations can be used in, for example, best-effort network traffic. Basic Concepts When RSVP and MPLS are combined, a session can have a more flexible meaning, compared to the data flow defined in RSVP. The ingress node can use a variety of means to determine which packets are assigned a particular label, according to its admission control policy. In RSVP-TE, a SESSION object uniquely defines a traffic engineered tunnel 20 (TE tunnels) which may consist of several LSP tunnels, for example, to spread traffic to multiple paths or reroute traffic during failure. A tunnel ID and an extended tunnel ID are parts of a SESSION object. SENDER_TEMPLATE (downstream) and FILTER_TEMPLATE (upstream) objects carry the LSPJD. An LSP tunnel is uniquely identified by the SESSION object, and the SENDER_TEMPLATE or FILTER_TEMPLATE object. To create an LSP tunnel, the first MPLS node of the path (ingress) cre-ates an RSVP Path message carrying a SESSION object with a session type of LSP.TUNNEL J P V 4 or LSP-TUNNEL J P V 6 and a LABEL.REQUEST object. The LABEL_REQUEST object indicates that label binding for this path is requested. The ingress may also add an EXPLICIT-ROUTE object (ERO) to the Path mes-sage. ERO specifies the route as a sequence of abstract nodes. The Path message is forwarded along the path specified by the ERO when there is one. When this Path message is received by the destination of the LSP path, the destination (normally egress) responds to the label request by including a L A B E L object in its response RSVP Resv message. The Resv message is sent back upstream toward the sender, following the path state created by the Path message, in reverse order. When a node receives a Resv message with a L A B E L object inside of it, the node binds this label as the outgoing label. If a node can not bind a label it sends a PathErr message with some error code inside, back to the sender. As the Resv message propagates towards the sender node, an LSP path is established. 21 M a j o r Message and Object Formats Extra objects such as L A B E L _ R E Q U E S T and L A B E L are added, and some exten-sions to existing messages are introduced to R S V P for supporting L S P signaling. Messages: P a t h <C0MM0N HEADER> [<INTEGRITY>] <SESSI0N> <RSVP_H0P> <TIME_VALUE> [<EXPLICIT_ROUTE>] <LABEL_REQUEST> <sender descriptor^ <sender descr iptor>: : <SENDER_TEMPLATE> <SENDER_TSPEC> Resv <C0MM0N HEADER> [<INTEGRITY>] <SESSI0N> <RSVP_H0P> <TIME_VALUE> <STYLE> <flow descr iptor l i s t> <flow descriptor l i s t > : : <FF flow descriptor l i s t > I <SE flow descriptor> <FF flow descriptor l i s t > : : <FL0WSPEC> <FILTER_SPEC> <LABEL> I <FF flow descriptor l i s t > <FF flow descriptor> <FF flow descriptor>: : [<FL0WSPEC>] <FILTER_SPEC> <LABEL> <SE flow d e s c r i p t o r s : 22 <FLOWSPEC> <SE f i l t e r spec l i s t > <SE f i l t e r spec l i s t > : : <SE f i l t e r spec> I <SE f i l t e r spec l i s t> <SE f i l t e r spec> <SE f i l t e r spec>:: <FILTER_SPEC> <LABEL> P a t h E r r <C0MM0N HEADER> [<INTEGRITY>] <SESSI0N> <ERROR_SPEC> [sender desc r ip to r ] R e s v E r r <C0MM0N HEADER> [<INTEGRITY>] <SESSI0N> <RSVP_H0P> <ERROR_SPEC> <STYLE> [<error flow descriptor>] Pa thTea r <C0MM0N HEADER> [<INTEGRITY>] <SESSI0N> <RSVP_H0P> [<sender descriptor>] ResvTear <C0MMGN HEADER> [<INTEGRITY>] <SESSI0N> <RSVP_H0P> <STYLE> <flow desc r ip to r l i s t > H e l l o <common header> <HELL0> R S V P Hello extension enables R S V P nodes to detect when a neighbouring node is not reachable. The Hello extension is 23 Figure 2.6: Generic Label Request Object label (32) Figure 2.7: Generic Label Object composed of a H E L L O - R E Q U E S T object and a H E L L O _ A C K object. Each object carries source and destination instance values. Objects: LABEL_REQUEST class=19, C_TYPE=1 L3PID:identifier of layer 3 protocol using this path. Standard Ethertype values are used. L A B E L class=16,C_TYPE=l 4 octets. Generic M P L S labels range from 0 to 1048575. ERO class=20,C_TYPE=l A n E R O consists of multiple sub-objects, each of which represents an address of a virtual node. L S P _ T U N N E L J P V 4 Session Object Class=SESSION,C_TYPE=7 L S P - T U N N E L J P V 4 Sender Template Object C l a s s = S E N D E R _ T E M P L A T E , C_TYPE=7 HELLO Class=HELLO, C_TYPE=1 HELLO_ACK Class=HELLO, C_TYPE=2 24 IPV4 tunnel end point address(32) Must be zeros. Tunnel ID( 16) Extended Tunnel ID(32) Figure 2.8: LSP_TUNNELJPV4 Session Object IPV4 tunnel sender address (32) Must be zeros. LSP ID (16) Figure 2.9: L S P . T U N N E L J P V 4 Sender Template Object Src_Instance (32) Dstjnstance (32) Figure 2.10: H E L L O R E Q U E S T Object Src_Instance (32) Dstjnstance (32) Figure 2.11: H E L L O A C K Object 25 2.3 Linux 2.3.1 Overview Linux has become very popular since it emerged in 1991. Generally, as an OS it can be divided into two parts: user level processes and kernel side (protected) processes. The kernel contains device drivers, memory management, process management and communication management. Communication between user-level processes and the kernel is achieved through message passing using system calls. In the following we cite something closely related to this thesis. One can find a large number of references about Linux on the web. The kernel version which we use as the platform is Linux 2.4.2. 2.3.2 System Call In Linux, system calls are used by user processes to request kernel services. The signal system call for example, is used to install a user-defined signal handler for a specific signal. 2.3.3 /proc File System In Linux, /proc files and directories do not actually exist. The /proc file system registers itself with the Virtual File System. For example, the kernel's /proc/'devices file is generated from the kernel's data structures describing its devices. The kernel traps proc file access, and instead of executing ordinary file oper-ations on them it calls special (individually registered) functions. When a file in the /proc directory is created, it is registered with a set of functions that tell the kernel what to do when the file is read from or written to. The /proc file system presents 26 a user readable window into the kernel's inner workings. Some of the files can be modified like a regular file, which facilitates changing a kernel working mode. 2.3.4 Linux Network Linux TCP/IP Networking Layers Figure 2.12 shows how Linux implements the internet protocol address family as a series of layered software. BSD sockets are operated by a generic socket management software (concerned only with BSD sockets). Supporting it is the INET socket layer which manages the communication end points for the IP based protocols TCP and UDP. Underneath the IP layer are network devices, for example, ethernet and PPP. BSD Socket Interface The BSD socket interface is a general interface which not only supports various forms of networking, but is also an inter-process communication mechanism. Linux supports several classes of sockets, and these are known as address families. The socket address families or domains supported by Linux include the following: UNIX Unix domain sockets INET TCP/IP AX25 Amateur radio X25 IPX Novell IPX A P P L E T A L K Appletalk DDP X25 X25 protocol 27 Network Applications User Kernel B S D Sockets Socket Interface I N E T Sockets Protocols layers T C P U D P IP Network Devices PPP SLIP Ethernet Figure 2.12: Linux TCP/IP layers 28 There are several socket types and they represent the type of service that supports the connection. Linux BSD sockets support a number of socket types: Stream supported by the T C P protocol of the I N E T address family Datagram supported by the U D P protocol of the I N E T family Raw Reliable Delivered Messages Sequenced Packets Packet Communication among processes via sockets uses the client-server model. A server provides a service, and clients make use of that service. Linux abstracts the socket interface with the BSD socket layer. The exact meaning of operations on a BSD socket depends on its underlying address family. 2.3.5 Configuration as a Router In Linux, several tools are provided for configuring a network function. The ifconfig is used to configure network interfaces, ip addresses, netmasks, and so forth. The tool route is used to add and delete routing entries. To enable its working as a router, the contents of the file /proc/sys/net/ipv4/ip-forward should be 1. More information about the Linux router can be found in [15]. 29 2.4 T r a n s m i s s i o n C o n t r o l P r o t o c o l 2.4.1 TCP Transmission Control Protocol (TCP) is a connection oriented, end-to-end reliable protocol designed to fit into a layered hierarchy of protocols which support multi-network applications [16]. We use TCP applications to test system performance. • Reliability At the sender, TCP assigns a sequence number to each octet transmitted and requires a positive acknowledgment (ACK) from the receiver. If the Ack is not received within a timeout interval, the data is retransmitted. The receiver uses sequence numbers to correctly order segments. • Flow Control TCP provides a means for the receiver to govern the amount of data sent by the sender. This is achieved by returning a Window with every A C K indicating a range of acceptable sequence numbers beyond the last successfully received segment. • Connection In TCP, each connection is uniquely specified by a pair of sockets. (A socket is identified by an IP address and a socket port number.) Two processes must first establish a connection before they can communicate. Figure 2.13 shows the structure of a TCP header. 30 Source Port Destination Port Sequence Number Acknowledgment Number Data U A P R S F Offset Reserved R c s s Y U S G K H T N w Window Checksum Urgent Pointer Options Padding Figure 2.13: T C P Header 2.4.2 T C P High Performance Extensions One performance problem for T C P when it is applied on a large bandwidth network is the window size limit. A T C P header uses a 16 bit field to report the receiver window size to the sender; therefore, the largest window that can be used is 64KB. Another problem arises from the cumulative acknowledgement mechanism, which may cause a devastating effect on the throughput of T C P because all the packets following a lost packet must be retransmitted even if only a single one gets lost. For addressing the above and some other problems of T C P , T C P High Per-formance Extensions were introduced in RFC1323 ([18]). By defining a new T C P option and Window Scale, and using Selective Acknowledgement (SACK) , the win-dow size is allowed to be larger than 64KB, and the problems caused by cumulative acknowledgement can be avoided. 2.4.3 T C P in Linux The T C P protocol is implemented in Linux according to the specifications of RFC793, RFC1122 and RFC2001. Linux 2.4.2 supports RFC1323 T C P high performance ex-tensions, including large T C P windows and selective acknowledgement (SACK) . In 31 order to make use of them, the send and receive buffer sizes must be increased, and the SACK function should be enabled. These can be be set globally with the net. core.wmem-default, net.core.rmem-default and net.ipv4-tcpsack sysctls. 2.5 OSDN M P L S project 2.5.1 Introduction There is a project group, MPLS-Linux in Open Source Development Network (OSDN), working on implementing an MPLS stack in the Linux kernel [4]. The MPLS-Linux package works as the forwarding component of the MPLS. The current version is 0.996. We use version 0.990 in this thesis. The package is implemented in C, and supplied as a difference file created from the utility diff. For performance consideration, the implementation (addition and modification) spreads through a lot of Linux kernel files; however this brings an inconvenience for understanding its processing for it is not modularized. It uses inlabel and outlabel to represent the label of the incoming packet and the label which is put onto the outgoing packet, respectively. It uses labelspace to enable global or local label contexts. Labels on different interfaces can exist in the same context or their own contexts, depending on the labelspace definition for each interface. The implementation also supports label stacks and ATM, PPP, and so forth, as the lower layer. We use it over Ethernet. The /proc/net/mpls_* files provide a window for viewing the internal working of the MPLS forwarding part. 32 2.5.2 Data Structure Three major data structures are defined. They are as follows: ILM Incoming Label Map This is used to help interpret the incoming label. A n I L M table consists of all the incoming labels that an L S R or egress L E R recognizes. The content of each I L M entry is the label, opcode, F E C and an optional link to the outgoing data structure. The general operation for incoming label processing is as follows: • extract label from top shim • lookup the label in the I L M table • further processing on the packet based on the opcode stored with the label N H L F E Next Hop Label Forwarding Entry N H L F E is used to assist in adding outgoing labels. The N H L F E table consists of all the labels which can be pushed onto packets. A n N H L F E entry contains the label, outgoing interface and next hop information. The general processing that occurs when a packet reaches the N H L F E table is as follows: • form new shim that contains the label • push the shim onto the packet • forward the packet to the next hop via the outgoing interface F T N F E C To N H L F E The third structure is used to help the ingress L E R figure out what label should be tagged to a packet. The F T N table consists of all F E C s unto which 33 labels are needed to be added. A n F T N entry contains an F E C and an N H L F E entry. The general order of operation for processing an F T N is as follows: • decide what F E C a packet belongs to • find the F E C in the F T N table • forward the packet to the N H L F E entry that corresponds to the F T N 2.5.3 Label and Labelspace The generic label is a 32-bit integer, the highest 12 bits of which is the labelspace. Labelspace means the label context of the interface this label is associated with, either as an incoming or an outgoing one. A l l inlabels are organized as a Radix Tree with a width of 4 and depth of 8, while outlabels are stored in another Radix Tree. 2.5.4 Operation and Interface The Linux-MPLS package provides an interface through which it can communi-cate with user-level control components. To achieve this, it extends I O C T L and R T N E T L I N K message types. • I O C T L — labelspace: setting labelspace — M P L S tunnel: adding and/or deleting an M P L S tunnel • R T N E T L I N K — inlabel: adding and/or deleting an inlabel, setting the opcode A n inlabel is represented in the form of <label type>:<label value>: <label space>. For example, gen:32:0 represents a generic label valued 32 with 0 as its 34 labelspace. Opcodes or instructions for an inlabel include the following: * pop: remove the top label from the label stack * push: push another label onto the label stack * dlv: send this packet to the layer 3 protocol * peek: look at the label on top of the label stack, and start executing instructions associated with it. * set: set the incoming interface to something different from the real incoming interface * fwd: send a packet to an outgoing label structure to be processed outlabel: adding and/or deleting an outlabel, setting the opcode A n outlabel has the form of <label type>:<value>:<interface>:<nexthop family>:<nexthop address>. For example, gen:33:eth0:ipv4: represents a generic outlabel valued 33 with ethO as its outgoing interface, and an IPv4 address as its nexthop to which the packet is forwarded. Opcodes for outlabels include the following: * pop: remove the top label from the label stack * push: push another label onto the label stack * set: copy the outgoing interface and the next hop layer 2 destination from the outgoing label structure * fwd: send a packet to an outgoing label structure to be processed F E C : adding and/or deleting the mapping of an F E C To an N H L F E One limitation is that the F E C entry should already exist in the layer 3 35 192.168.1/ 192.168 add outlabel gen:32:1 :ipv4: set opcode push:gen:32:1 FTN to gen:32:l set labelspace 0 to interface 0 add inlabel gen:32:0 set opcode pop:fwd:gen:33:1 add outlabel gen:33:l:ipv4: setopcode push:gen:33:1 set labelspace 0 to interface 0 add inlabel gen:33:0 set opcode pop:dlv Figure 2.14: Operations in Linux-MPLS routing table before it can be used, because the implementation makes use of the data structure of the original layer 3 routing table. Figure 2.14 is an example of creating an M P L S tunnel from node A to C for the F E C with as the destination prefix. Suppose all the labels that are to be assigned and reserved along this L S P (this is the task of the control plane of M P L S using R S V P - T E , for example.) are: at A , outlabel of 32 for interface 1; at B , inlabel 32 of interface 0 and outlabel 33 for interface 1; and at C, inlabel 33 of interface 0. To create this L S P tunnel, the operations that the control part should perform at each of the nodes are shown in the figure. 2.6 Failure Recovery There are two main approaches to failure recovery: failure protection and failure restoration. Protection requires that resources are preallocated, while restoration relies on dynamic resource establishment. It may take longer time for a restoration strategy to recover from failure because extra steps are needed to reserve resources. 36 I n t h i s t h e s i s , w e b a s i c a l l y f o c u s o n f a i l u r e p r o t e c t i o n , i n w h i c h r e s o u r c e s are p r e a l -l o c a t e d . 37 Chapter 3 Design 3.1 Overview In this Chapter, a fast failure recovery algorithm is designed in M P L S . We try to quickly respond to network faults and reduce packet loss as much as possible using our fast reroute mechanism. The model is designed in Linux kernel 2.4.2. We assume network topology and primary and protection routes are already available, which can be obtained from OSPF or configuration files. 3.2 Label Binding Downstream label binding is used in our design. A label is assigned by the down-stream node, and the upstream node uses this label as the outgoing label which is put in the outgoing packet while the downstream node uses it as the incoming label. The M P L S data plane (Section 2.1) uses this incoming-outgoing label mapping to forward packets after the L S P tunnel is set up. The creation or destruction of label bindings can be triggered by data pack-38 ets (referred to as data-driven label bindings) or control information (referred to as control-driven label bindings). In the data-driven approach, a label binding is triggered when the first packet of a flow, or several packets, are seen. However, this mechanism which is often used for best-effort traffic, is not applicable here because the primary and protection tunnels need to be created and initialized before fast recovery can be ready when a failure happens. Also, the control-driven approach is preferred for its scalability and simplicity. In this thesis, all primary and protec-tion tunnels come from control information so the control-driven approach is thus adopted. 3.3 Primary and Protection Routes In our model, primary and protection tunnels are chosen and set up, starting from ingress L E R s . Protection routes may come from network topology (obtained from OSPF, for example) and/or control information. A n ideal protection route should be one whose links are disjoint with the primary tunnel because this ensures that protection can always be available for any link or node fault along the primary path. We call this ideal backup tunnel the Disjoint tunnel in this thesis for simplicity. We suppose all protection routes are already set up. A n ingress router plays a key role in our recovery since all backup routes are chosen and stored at ingress routers. Compared with the distributed approach where backup routes are selected at every node to protect individual links, this centralized approach may have a disadvantage with respect to the scalability issue. For example, in Figure 3.1, to achieve the same protection robustness for the link failure BC and CD with a distributed approach where 2 local backup routes are needed for single link protection, 3 backup routes are needed here. 39 To protect the route A - B - C - D for failure of link B C and/or C D , backup routes A - B - E - C - D , A - B - E - C - F - D , A - B - C - F - D can be used. Figure 3.1: Protection Routes Importantly, our centralized strategy makes it easier to apply global resource allocation policies, and to implement internal routers. Scalability issues are further discussed in Chapter 6. 3.4 Failure Detection L i n k failure can be detected at either the physical layer or the data link layer. Normally, when detection occurs at a lower level, a fault can be sensed more quickly. In our model, we can not obtain the failure detection service at lower layers, and hence failure is detected in the M P L S layer (the shim layer between the IP and the data link layer). In the M P L S , P A T H / R E S V messages are used for creating and binding la-bels. A timeout of these messages is used for tearing down L S P tunnels. However, 40 it can not be used for failure detection because the refresh period of these messages is in seconds for scalability consideration. The HELLO message is defined in RSVP-TE for detecting and recognizing neighbouring LSRs. Its refresh period can be set as short as 5 milliseconds or less. We use it to detect network failure in our design. 3.5 Failure Report After a failure is detected by the locus node (the node closest to the fault), a failure notification is reported back to the ingress LER. A failure is detected and reported to the ingress node by the upstream node. In [3], Tae H.Oh, Thomas M.Chen and Jeffery L.Kennington use the down-stream node to detect and process the failure. One disadvantage of using the down-stream node for this role is that sometimes there may not exist any route between the locus node and the ingress LER for the network failure to be reported. This can be seen in Figure 3.2. When link BC fails, downstream node C can not find a path through which the failure can be reported to the ingress LER A. Although this limitation can be avoided by reporting the error to the egress LER D, and letting it then communicate with LER A, it nevertheless introduces an inconsistency between backup route selection and failure reporting, and unnecessarily delays the final noti-fication arrival at LER A. While using the upstream node for the role, there always exists a path for this task which is just the reverse direction of the primary route, the same one RESV messages traverse. Hence, upstream nodes are used to detect and report failures, and this ap-proach is simpler, responds more quickly, and is more efficient for failure handling. 41 Egress ° ~ Protection Route Primary Route When link B C breaks, there is no direct path for downstream C to report the failure to ingress L E R A . Figure 3.2: Failure Report 3 . 6 Failure Processing After a failure is detected and reported from the locus node to the ingress node through the reverse direction of either a primary path or a protection path, it can then be processed at the ingress node. T h e ingress L E R chooses an alternative tunnel to which all following flows are redirected, and stops the former F T N mapping. Depending on the type and location of the failure, as well as the control policies, it selects a Disjoint backup tunnel as preferred, or otherwise it selects another backup tunnel. T h e problematic tunnel is invalidated and allowed to die out gradually without being torn down explicitly. T h e session still exists, but with no label binding unless the deletion instruction from the upper layer is received (possibly because the network topology finally converges through for example, O S P F ) . Thi s lazy deletion strategy prevents the unnecessary system cost of releasing and recreating tunnel sessions when the failure is a temporary event. Besides notifying the ingress L E R and letting it make the final recovery operation when a failure is detected, the locus node also adopts a local recovery 42 When the link BC fails, locus node B switches traffic back to ingress LER A through the loopback tunnel. Figure 3.3: Loopback Tunnel step. If the recovery completely relies on the ingress LER, all packets would get lost after the failure until the recovery operation is completed at the ingress LER. Depending on the length of time, the total number of packets lost may vary. Giving this observation, we attempt to be more aggressive and try to save as many packets from being lost as we can, using a protection loopback tunnel for each primary route to reroute traffic back to the ingress LER during that temporary period. The protection loopback tunnel is created and signaled along the reverse direction of the primary tunnel as soon as the primary tunnel is set up. When a failure occurs, the locus node first reroutes the influenced traffic back to the ingress LER through this loopback tunnel, and simultaneously notifies the ingress LER of the fault. This can be thought of as a temporary repairing step before the final one is made at the ingress LER. Figure 3.3 shows the use of a loopback tunnel. Thus, we achieve fast recovery through the combination of a centralized and local 43 recovery mechanism. 3.7 Operations at the Locus LSR When a failure is detected, operations at the local L S R are as follows: • Finding a protection tunnel A n eligible protection tunnel should at least, not include the downstream link, and preferrablely not include the downstream node, for covering downstream node failure. If a regular protection tunnel can not be found, the protection loop is selected. In brief, the preferential order of choosing a protection tunnel is as follows: — Node Failure Protection Tunnel — Link Failure Protection Tunnel — Loopback Tunnel • Switching the traffic Reroute the traffic to this chosen protection tunnel • Notifying ingress L E R of this failure After a failure happens, the locus L S R stops sending the refresh R E S V messages to its P H O P (the upstream node) until it receives a refresh R E S V message from the N H O P (the downstream node). The failed L S P tunnel gets invalidated through the timeouts of the R E S V message. A receipt of an R E S V message from the N H O P for a failed tunnel is an indication that the failed link or node has recovered and reactivates the tunnel. 44 Disjoint Protection Tunnel Primary Tunnel abfd: backup tunnel which can protect node c failure abfcd:backup tunnel which protects link cd failure Figure 3.4: Route Types 3.8 Opera t ion at the Ingress N o d e An ingress LER initializes the setup of all primary and protecting tunnels. All MPLS tunnel sessions are not deleted unless management instructions are received or the network topology changes. The ingress LER also concatenates the protection loopback tunnel with one protection tunnel so that a recovery tunnel is ready to respond to a network fault. The ingress LER acknowledges a failure through the failure notification car-ried in a PATHERR message generated at the locus node. In response to a failure, it tries to select a backup tunnel to which the traffic is rerouted. The preference order for choosing a backup tunnel is as follows: • A Disjoint Tunnel 45 • Node Failure Protection Tunnel Basically two possible factors cause a network fault: link failure and node failure. It is best to choose a protection tunnel resistant to both cases if it is not possible to distinguish between the two types of failures. • Link Failure Protection Tunnel If no other choice is available, a link failure protection tunnel is chosen. If unfortunately, the fault is a downstream node failure, the selected protection tunnel (which can only resist a link failure) is invalidated later after a failure notification of that tunnel arrives at the ingress node. In this case, the ingress LER stops forwarding the influenced traffic until at least one tunnel is available again. The problematic primary tunnel is still used until it finally times out, and is re-activated in case a refresh RESV message is received from the NHOP or replaced by a new one if an instruction from the upper level arrives. It is up to the system whether the traffic which is already re-directed to a protection tunnel needs to be switched back to the primary one. In our design, we choose to switch it back. 3.9 Operation at the Egress LER The only operation necessary for an egress LER as part of the failure recovery is to signal the protection loopback tunnel in the reverse direction for each primary tunnel. The first time an egress LER receives a PATH message for a primary tunnel it creates a reverse protection tunnel session and sends out PATH messages in the reverse direction. As the PATH message traverses all nodes hop-by-hop, along the reverse path of the primary route, a protection loopback tunnel is set up from the 46 egress L E R to the ingress L E R (in the context of data flow instead of the reverse loop tunnel). 3.10 Analysis of Potential Problems Centralized failure recovery at the ingress L E R , and locally repairing the link with enabling a reverse protection loop as previously described, makes it possible to quickly respond to network faults while dropping as few packets as possible. How-ever, the effectiveness of using the protection loop may be influenced by several factors which are analyzed in the following. • T i m i n g T h e failure recovery cycle is illustrated in Figure 3.5. T l reflects the delay of the detection of an occurrence of failure. T 2 and T 4 reflect the delay for recovery operations to take effect after they are issued. T 3 arises mainly from network propagation. If we ignore the influence of the operation delay of T 2 and T 4 , which we can not control, the goal in using a loopback tunnel is to try to save packet loss during T 3 . Observing that we always have to lose some packets during T l , T 2 or T 4 , we may ask if our approach would be undermined by the packet loss which has to happen in T l , T 2 or T 4 . For example, if T l is much longer than or comparable with T 3 , does this loopback tunnel protection still make sense? • Reordering Another issue which may undermine our strategy is the re-ordering of packets, which means that the destination may receive packets out of order. Out-of-order delivery of packets occurs when the final recovery step is finished at the 47 T l T2 T3 T4 Failure Happened Local Repairing in Effect Recovery Completed Failure Detected Local Repairing Issued Failure Notification Sent Notification Arrived Figure 3.5: Recovery Cycle ingress node, and all the following packets go through the backup tunnel while some former packets remaining in the loopback tunnel must arrive later at the ingress L E R . T h i s out-of-order problem lasts until all the packets already in the loopback tunnel get redirected to the backup tunnel by the ingress L E R . Figure 3.6 shows this problem. After the recovery steps are completed at the ingress L E R , packets 101 and 102 are already on their way in the new backup tunnel, while packets 98,99 and 100 have to arrive later at the ingress node through the loopback tunnel and be forwarded into the chosen protection tunnel. T h e arrival sequence of these packets in destination may be 101-102-98-99-100 instead of the in-order delivery of 98-99-100-101-102. • End-to-end Protocol F r o m the previous explanation, it is clear that the end-to-end protocol used may influence the performance of our packet saving endeavor using loopback tunnels. If the end-to-end protocol can not resist the out-of-order problem, or can not reorder packets, this aggressive strategy (using loopback tunnels) may 48 Protection Tunnel 104 103 o*-Egress Loopback Tunnel Primary Tunnel Locus Figure 3.6: Out-of-order Problem not be suitable. Also, if there is no buffering mechanism in the end-to-end protocol, the loopback strategy may also need to be carefully reconsidered. T C P is a widely used end-to-end protocol which guarantees reliable transport (see Section 2.4). To make use of the benefits of our loopback strategy, the two T C P options, S A C K (Section 2.4.2) and large buffering (to hold all out-of-order packets) may be needed. We notice that before our recovery commands are issued and take effect, some packets already get lost. Without S A C K , T C P may resend all those packets following the first lost one, even though there may be only 1 packet lost. This means T C P without S A C K may not be able to make use of those saved packets in our loopback strategy. The buffering capacity reflects the ability of T C P to handle packet reordering. If the buffer (window size) is too small, T C P may too quickly resend packets that it thinks of as lost, although they may arrive a little later. In addition, T C P has a congestion control mechanism which curbs sending out new packets if an acknowledgement is not received within the expected time-out period. This 49 congestion control may also help alleviate the reordering problem. Linux 2.4.2 supports S A C K and the large buffering option (Section 2.4.3) which facilitates our design. In this project, U D P is also used for our evaluation. Packet loss is measured in U D P to evaluate our packet saving strategy. T h e implementation of the above design over Linux clusters is given in Chap-4. Performance evaluation over T C P and U D P is given in Chapter 5. 50 Chapter 4 Implementation 4.1 Overview A s described in Section 2.5, the M P L S forwarding plane (data plane) used is from O S D N . We used Linux kernel 2.4.2 workstations as our L S R s , and implemented the basic control plane in user space. T h e control plane uses R S V P - T E as the label distribution protocol in accordance with the I E T F draft [10], and realizes our fast recovery algorithm. T h e system infrastructure, key data structures, and the inter-operation are described in this chapter. 4.2 System Structure O u r M P L S control plane runs as a user level program in Linux. We use U D P to transport L S P signaling messages, and adopt R S V P - T E as the label distribution protocol [10]. N E T L I N K messages are used to convey control information to and from the data plane executed in the kernel space. Diagram 4.1 depicts the structure of our implementation. 51 T u n n e l M a n a g e r Sess ion M a n a g e r T R S V P Packager U D P Transceiver L a b e l M a n a g e r N e i g h b o u r Detector N E T L I N K Messager J L i n u x K e r n e l Figure 4.1: Structure of Implementation A description for each functional unit is given below. U D P Messager This is used to transmit and receive messages from other neigh-bouring LSRs. These messages include MPLS tunnel signaling messages (Path/Resv messages) and neighbouring LSRs' hand-shaking (HELLO messages). Both types of messages are encapsulated in UDP packets. Path/Resv messages are sent or updated and received at the frequency of seconds, while HELLO messages occur more frequently, normally every 100 milliseconds. N E T L I N K Messager This is used to communicate with the kernel side MPLS 52 data plane mainly for label manipulation, such as adding labels, deleting labels, label binding, and so on. RSVP Packager It is responsible for packaging and unpackaging R S V P messages to or from the U D P Transceiver. Neighbour Detector The Neighbour Detector makes use of the H E L L O messages from the R S V P Packager to detect network failure, and notifies the Session Manager of the failure for related M P L S sessions. It also periodically sends out H E L L O messages through the service provided by the R S V P Packager to all its neighbours at the frequency set during initialization to detect failure. Session Manager A session is a dynamic concept in our implementation. A n LSP tunnel created in memory, ready for M P L S sigalling, is called an LSP session. At an ingress L E R , the Session Manager reads in information of all primary and protection tunnels from the Tunnel Manager, and triggers an M P L S ses-sion for each tunnel. The Session Manager receives LSP tunnel signaling mes-sages from its neighbouring LSRs through the R S V P Packager, and performs related tunnel operations. Upon the receipt of Path/Resv messages it reserves labels and does label bindings, and upon the receipt of Pathtear/Resvtear messages it revokes the reserved labels and undoes label bindings through the label manager. Other relevant operations are also issued when receiving Patherr/Resverr messages. The Session Manager also periodically updates the states of tunnels and sends out Pathtear/Resvtear messages in the case of a time-out. The Session Manager acknowledges a network failure through the service pro-vided by the Neighbour Detector when a network fault happens. The failure 53 is then reported by the locus LSR to the ingress LER with a Patherr message. The Session Manager of the locus LSR also reacts to a network failure by redi-recting the flow to a pre-signaled protection LSP tunnel. The redirection at the locus LSR is achieved by concatenating the failed tunnel with a protection one (Section 4.9). All other LSRs, except the locus and the ingress nodes, forward the Patherr message back to the ingress LER. Label Manager The purpose of the Label Manager is to manage label spaces. It maintains a table of labels in use for each network interface. On the request from the Session Manager it may reserve or recycle a label for a certain inter-face. Label binding or unbinding is then completed through the NETLINK Manager. Tunnel Manager The Tunnel Manager maintains all primary tunnels and protec-tion tunnels. The Tunnel Manager gets the tunnel information from the routing table if the node is an ingress LER, or from the Session Manager, which receives tunnel sessions from its neighbours if it is an internal LSR. The Tunnel Manager also provides the information to the Session Manager for selecting a protection tunnel in case of network failure. In our implementation, a simple configuration file is used to contain all the tunnel information at an ingress LER. 4.3 MPLS Session A session is an LSP tunnel whose status is active (i.e., created in memory and may not yet be signaled as OK). Each LSP session corresponds to a certain path or route 54 hash table session grpl session 1 session 2 session grp n T session n Figure 4.2: Organization of Sessions and Tunnels of a data flow. A l l sessions are organized into a hash table. Each traffic flow (FEC) may have multiple corresponding tunnels (primary and protection) which are treated as one session group. A l l tunnels (corresponding to the same F E C ) in the same session group are stored as a linked list in a hash bucket. The organizational structure is shown in Figure 4.2. We adopted this organization because all the tunnels for the same F E C are stored together, which can facilitate tunnel operations. A session group is identified by the Session Key, which is composed of a Destination Address Prefix, a Source Address and a Session Id. The Session Id is assigned at the ingress L E R , and corresponds to a certain F E C . In addition, an 55 F E C basically includes the destination address or prefix, plus extra information if needed, such as the source address and/or QoS requirement. It is possible that multiple session groups for different traffic flows can exist between the same pair of source-destination, and hence Session Ids are used, combined with source and destination addresses to identify these traffic flows. In our implementation, the Destination Address Prefix is the address prefix of the final receiver (not egress L E R ) , while the Source Address is the default IP address of the ingress L E R . A s already stated, each session group contains primary and protection tun-nels. E a c h tunnel within a session group is uniquely identified by the field Sender Template which contains the sender's IP address and the tunnel's L S P I D . Here, the sender refers to the ingress L E R , not the originator. T h e L S P I D is used to distin-guish different tunnels (LSPs) within the session group. E a c h L S P session needs two ways of signaling: upward and downward signaling (for example, each L S R receives Path's from, and sends Resv's back to, upstream L S R s , while it also receives Resvs from, and sends Paths to, downstream L S R s ) . In our notation, an L S P session has two sub-sessions, Incoming and Outgoing, each of which relates to one of the two traffic flow directions. A s a summary, the important data fields in an L S P session (for both Incoming and Outgoing) are described in the following. Sender Template Sender's IP address, L S P I D Hop Phop (Previous Hop) for Incoming and Nhop (Next Hop) for the Outgoing IP address of a Phop or Nhop 56 Time Value The interval between two refresh signals required by a Phop or Nhop (A refresh Path/Resv message should be sent out to a Nhop/Phop before the time-out.) Iif/Oif Incoming/Outgoing network interface index Label For Incoming, label assigned to a Phop For Outgoing, label assigned by a Nhop Pathtimer For Incoming, the last time when a Path message was received For Outgoing, the last time when a Path message was sent to the Nhop Resvtimer For Incoming, the last time when a Resv message was sent to the Phop For Outgoing, the last time when a Resv message was received Up,Down Upward or Downward session from or to which the tunnel is currently con-catenated. 4 . 4 Primary Tunnels and Protection Tunnels Tunnels and Explicit Route Objects In our implementation, a primary LSP and all its protection LSPs are contained in the same session group, sharing the same FEC. An LSP configuration file (routes.conf) 57 is used for configuring the tunnel groups. Each record in the routes.conf represents an FEC and all its LSPs with explicit routes. During LSP signaling, an LSP tunnel is identified at each LSR by the Session Key, Sender Template and LSPID. At an ingress LER each FEC is given a unique session ID within that ingress LER as part of a Session Key. All LSRs match the primary and protection tunnels by checking the Session Key and LSPID when RSVP messages are received. Explicit routing is used to create the LSP tunnels. An ERO (Explicit Routing Object) is carried in each Path message. Each LSR checks the ERO in an incoming RSVP message (Path message only) to see if the next hop to which the message is to be sent is the LSR itself. A Patherr message is sent back to the Phop when an unexpected Next Hop appears in the ERO. Each internal LSR (other than egress nodes) strips off the first hop (which is the LSR itself) from the ERO and puts the shortened ERO in the Path message, which is sent to the Nhop. Each LSR holds part of the explicit route for an LSP starting from itself, which can be used to help choose a protection tunnel in case of a failure (see Section 4.9). Explicit Route for A Loopback Tunnel As explained in the previous section, each LSR only records a shortened path of the complete route, and hence, an egress LER does not know all the upstream hops of the explicit route for an LSP tunnel. As a result, the egress LER is not able to signal a loopback protection tunnel to the ingress LER unless we add an extra mechanism. In our implementation, the problem is solved by carrying the loopback route as an extra segment following the normal explicit route for a primary LSP tunnel. 58 When an egress LSR receives the ERO in a Path message, it gets the loopback route which is left in the ERO. The egress node acknowledges the loopback route (from an ordinary ERO) or the request of loopback protection when the Nhop in an ERO equals the Phop from which this Path message is received. For example, consider a primary tunnel with loopback protection with an explicit route (at the ingress node) similar to A —> B —> C —> D -» C —> B —> A, where A and D are LERs. When the Path message from C arrives at egress LSR D, D gets the route D —>• C —> B —» A (the segment A —> B —> C has been stripped off at former hops) and determines that the last segment is a loopback route by checking if the Nhop of this ERO, which is C, equals the Phop of this Path message, which is also C. D, instead of forwarding the Path message to the Nhop as it would normally, initiates another tunnel session towards A for the loopback protection tunnel. An additional benefit of this mechanism is that an ingress LER can decide whether a protection loopback tunnel is enabled by adding or not adding the loop-back segment onto the ERO. Distinguishing Tunnels We have three different kinds of tunnels: primary, protection and loopback tunnels. Each needs to be treated differently in many cases. For example, we need to find out which one is the loopback tunnel in a session group when we need it, such as in the case of a fault. As well, because the loopback protection tunnel is concatenated to another protection tunnel at the ingress LER, obviously we want to ensure the looping operation does not happen on ordinary tunnels. The question is how can we distinguish them? 59 Primary tunnels are distinguished from protection tunnels by using different LSPIDs: 0 for primary tunnels, and a non-zero value for protection tunnels. A loopback protection tunnel uses the egress L E R ' s address in its Sender Template which contrasts to a regular tunnel that uses the ingress L E R ' s address instead. A loopback protection tunnel can be distinguished by seeing if the Sender IP Address in its Sender Template is different from the Source Address in its Session Key. Ordinarily, tunnels use the same address. 4 . 5 LSP Tunnel Signaling R S V P - T E is used to signal L S P tunnels. Each L S R maintains the states for all L S P tunnels it currently holds. After a tunnel session is created, there are mainly two possible states for the session: Waiting and Connected. A session in a Connected state means the tunnel is successfully signalled, the incoming or outgoing labels are assigned, and the label mapping is created. A session in a Waiting state means the session is created, but still waiting for a response from the downstream LSRs, for example, for label assignment. Events that can make a tunnel session state change include: R S V P messages (Path, Patherr, Resv, Resverr, etc.), timeout, and network failure. A l l R S V P messages are inspected for their validity before further processing is made. These validity checks include: interface check, hop (Phop/Nhop) check and label check. A n R S V P message from an unexpected interface (for example, a Path message for an existing tunnel session received from an interface different from the last time) is discarded. A Path/Resv message from an unexpected Phop or Nhop is also discarded. 60 A detailed description for the processing of these events is as follows: 1. A new Path message initiates the creation of a new LSP tunnel session. 2. A Path message for an existing LSP tunnel session refreshes the session state (Pathtimer/Resvtimer of the session is refreshed to the current time value, for example). 3. A new Resv message with a label inside makes the LSR reserve a label for the incoming interface of this tunnel session and send out a Resv message to its upstream neighbour. A new Resv message makes an ingress LER redirect the traffic to the successfully created tunnel. An Resv message for an existing tunnel session refreshes the session state. 4. A Resv message for an existing session refreshes the session state. 5. A Pathtear message instructs the LSR to tear down the created tunnel without deleting the session, and for intermediate nodes, forward a Pathtear message downstream. 6. A Resvtear message gets the LSR to tear down the created tunnel without deleting the session, and forward a Resvtear message upstream, if applicable (for intermediate nodes). 7. Processing a Patherr message depends on what kind of error is carried inside. For network failure notification the Patherr message is just forwarded upwards until it reaches the ingress LER where the recovery operation is carried out (Section 4.8). 8. Receipt of an Resverr message may get the LSR to reserve a different label and reconfigure the tunnel. 61 9. T h e timeout of a tunnel session deletes the tunnel session from memory. T h i s is the only way of deleting a tunnel session. 4 . 6 Label Management A label is assigned by the downstream L S R on an interface basis. Each interface has an independent label space. E a c h incoming or outgoing label is identified by the corresponding interface index and the label value. A valid label value is a 20-bit unsigned integer. T h e kernel side M P L S data plane maintains and manages all in /out labels currently in use. T h e user-side control plane in our implementation also maintains a copy (organized differently from the one at the kernel side) of all labels in use for performance revisions: for example, a free label could be quickly assigned without having to consult the kernel side every time. T h e control plane maintains a label allocation table for each interface, and each M P L S session records the in /out labels it is currently using. 4 . 7 Failure Notification W h e n network failure is detected at the locus L S R , the error needs to be reported as quickly as possible to the ingress L E R . T h e question is how this error should be reported, or what kind of information should be carried in the notification message. To handle a network failure, the ingress L E R needs to know the location of the failure. It also needs to know for which L S P tunnel session this error occured. A s a result, the IP address of the locus L S R where the error is detected, together with an error code for network fault, are carried in the Patherr message for a failed L S P 62 tunnel (with its Session Key and Tunnel ID, etc.). The Patherr message traverses in the reverse route of the Path message, and each LSR forwards the Patherr message without modification back to its Phop of the LSP tunnel. On receipt of the Patherr message, the ingress LER knows the actual location where the error happened, and can make the appropriate recovery operation based on the error information. If the Patherr message is lost, the fault recovery would have to be delayed until a timeout occurs for the refresh Resv message for the tunnel session at the ingress node. No failure is reported for the loopback protection tunnel because the loopback tunnel does not need to be re-protected. 4 . 8 Ingress LER General Operations An ingress LER maintains and manages mappings of FECs to LSP tunnels. The session ID (in the Session Key) is assigned for each FEC by an ingress LER. The LSPID is assigned by the ingress LER, and used together with the sender's IP address to identify an LSP tunnel within an LSP session group. LSPID 0 is assigned to the primary tunnel, while other values are assigned to protection tunnels starting from 1. For each traffic flow, only one FTN (FEC To Nexthop) mapping can be active at the ingress node, which means a certain FEC traffic can not have multiple independent flows through our MPLS network. An Ingress LER gets all the tunnels' information (FECs and explicit primary/protection routes) from a configuration file. During initialization, an ingress LER takes in a session group entry (with its FEC and all tunnel information). The Ingress LER then assigns a Session ID and 63 L S P I D ' s and signals the sessions for all of its tunnels. W h e n a Path message for a loopback protection tunnel is received 1 , the ingress L E R assigns a label and sends back a Resv message carrying this label backwards (downstream). T h e n , instead of issuing a P O P / D L V operation for this inlabel which would normally forward its representing traffic outside the M P L S area, the ingress L E R concatenates the loopback tunnel to an already created protection tunnel for later use. T h e concatenation is done as follows: Do label binding Loopback.inlabel—> Protection, outlabel Record the concatenation Loopback. down = Protection and Protection.up = Loop-back T h e chosen protection tunnel should ideally be a Disjoint tunnel. A n ingress L E R requests the loopback protection by giving the loopback segment in the E R O of the Path message, and disables the loopback protection without that segment. Recovery Operations W h e n a network failure report is received at an ingress L E R , the ingress node processes the error by doing the following steps: Recording the Error T h e tunnel state is changed to Failed, and remains in that state until a Resv message is received from its downstream neighbour. 1Here, this ingress can be considered as the egress of the loopback tunnel: however, we still call it ingress in the context of the traffic flow and a loopback tunnel is considered as a special protection tunnel of the primary tunnel. 64 Choosing a Protection Tunnel A protection tunnel in a good status is chosen according to the privilege order . described in Chapter 3. Rerouting the Traffic Traffic is now rerouted to the chosen protection tunnel and the former FTN mapping is revoked. After a tunnel is marked as Failed, all LSRs (except the ingress LER without an upstream neighbour) stop sending refresh Resv messages upwards until refresh Resv messages are received from downstream nodes. All LSRs still send refresh Path messages downwards to keep the session alive (not removed from memory). The tunnel along its path remains Failed until the network failure is fixed. Later, the reception of a Resv message from downstream makes the ingress LER acknowledge that the once failed tunnel is operational again, and in our implementation, the rerouted traffic is switched back to the primary tunnel. 4.9 Locus LSR Fault Detection A network failure is detected using RSVP Hello messages. Each LSR periodically sends out Hello messages to all its neighbouring LSR's. Each LSR maintains the instance number of itself which is increased by one each time a Hello message is sent to its neighbour. Also included in a Hello message is the destination's instance number which was the one received most recently. These two source/destination instance numbers can be used by neighbouring LSR's to detect each other. The interval of two consecutive Hello messages should be as short as possible, 50ms in our 65 implementation. Missing two consecutive Hello messages is used as the indication of a network failure. T h e upstream L S R of the failure processes the failure (reports it to the ingress L E R and adopts a temporary local recovery). Sometimes external interference can cause a link or node to be instable during some period. T h i s may be reflected as the alternate occurrences of failure and recov-ery. T o avoid the flapping problem of a network l ink/node failure, the Hello mecha-nism is not used for detecting network recovery (which means the failed l ink/node is recovered). Instead, the receipt of a Resv message from the downstream neighbour is used as to indicate the link recovery. T h i s is attractive for two reasons: 1. First , for recovery we don't need to be so hasty to detect it and reroute the traffic back to the primary path, compared to a network failure to which we have to be as quick as possible to respond to reduce packet loss. 2. Resv messages are processed in much longer periods (normally several sec-onds). Compared with Hello messages, using Resv messages to sense the link recovery can help avoid overburdening L S R s . Local Recovery Operation W h e n network failure is detected, the upstream node notifies the ingress L E R by sending back a Patherr message with its own IP address inside, and also invokes a local recovery operation. If the tunnel is a primary tunnel (protection tunnels are not re-protected!), a local recovery operation includes the following: Choosing a Protection Tunnel If a regular protection tunnel can not be found, the loopback protection tunnel is chosen. 66 Rerouting the Traffic The original label mapping is replaced with the one which redirects the traffic to the protection tunnel chosen in the above. Failed.inlabel—t Protection, outlabel, Failed.down = Protection, Protection.up = Failed. A loopback tunnel finally dies out with the timeout of the Path message2. This means the loopback concatenation does not last very long, just a short period (tem-porary recovery concatenation). This approach is elegant because we do not want the looping to last too long (consuming too many network resources). When the network is fixed (a Resv message with a label inside is received for the session), this concatenation (if still active) is disconnected by re-mapping the tunnels within themselves, as follows: Fixed.inlabel^rFixed.outlabel, Fixed.down = 0, Protection.up = 0, Protection, inlabel-> Protection, outlabel. 4.10 Egress L E R Main tasks of an egress LER include the following: 1. reserving a label and sending back Resv messages which set up the tunnel; and 2. creating and initiating signaling the loopback tunnel session for a primary tunnel, if required. 2 This happens because either the downstream link or node is broken and no more Path messages will come in. 67 W h e n a Path message for a primary tunnel is received for the first time, the egress L E R creates a loopback tunnel session in memory, and starts signaling the tunnel if there is a loopback segment in the E R O of the received Pa th message. T h i s loopback tunnel session remains active as long as the primary tunnel session remains alive. Receipt of a Resv message for a loopback tunnel session from the upstream L S R is ignored because the first segment of the loopback tunnel is practically useless (there is no F T N for the loopback tunnel). A loopback tunnel session uses the same Session K e y value as that in its regular primary tunnel, but a different Sender Template value. T h i s means that for a loopback tunnel, the Session ID and the L S P I D are not special and the ingress IP address is still used for the Source, but the egress IP address is used for the Sender Template. In addition, this feature can be used to identify a loopback tunnel from an ordinary one by all L S R s . 4.11 OS Influence This section discusses some special factors in Linux which influenced our implemen-tation. 1. T imer In practice there is an upper limit to the frequency in which a T i m e r can be set in Linux. We tested and found that a period of less than 30ms would cause the Linux kernel to crash. We set a period of 50ms for the T i m e r in our experiment. T h i s is used for sending and receiving Hello messages. A s analyzed in Section 3.10, we prefer that a network failure be detected as quickly as possible. T h e 68 timer limitation means the failure can only be detected within 100-200ms. 2. Communication and Scheduling A delay can not be completely avoided for an operation to be completed in Linux. This delay arises from the communication from user space to kernel side and OS scheduling. It can vary and is out of our control3. A possible effect of this delay is that it may take longer for a command issued in our user space, the MPLS control plane, to finally be completed than for a message to travel through our Ethernet network and complete at the ingress node. We have to artificially add some delay at some point in our code to make the timing consistent without losing generality, which we present in our experiment in the next chapter. 3 Although by implementing the whole stuff as a kernel module the delay can be reduced, it would be far more complex than our thesis can handle. 69 Chapter 5 Performance Evaluation 5.1 Overview and Experimental Environment In the previous two chapters we described the design and implementation of a fast recovery system for network failures in M P L S networks. For quickly responding to network failure, a quick failure detection mechanism using R S V P - H E L L O messages is adopted. B y notifying the ingress L E R of the failure, which makes the ingress router reroute the influenced traffic to the best pre-signaled backup tunnel, and simultaneously invoking a, local recovery operation using a loopback tunnel at the locus npde where the failure is detected, packet loss is expected to be reduced. In this chapter, we performed some experiments to evaluate our recovery performance mainly over two end-to-end protocols: T C P and U D P . L inux workstations are used to bui ld our experimental topology. Since our entire testing system is set up over Linux, and our concern is basically focused on failure recovery response and packet loss, instead of a routing algorithm, it suffices to use simple network topology in the experiment. Figure 5.1 shows the topology we use as our test bed. Linux servers A,B,C 70 A , B , C , D : M P L S routers P,Q: communication peers A l l : Linux 2.2. P H 266/128M Figure 5.1: Testing Network Topology and D compose our M P L S network where A and D are edge routers. P and Q are two hosts, each of which communicates to each other, traversing our testing network. T h e configuration of all the Linux boxes are: Intel PII 266, 128M memory, and 100/10 Ethernet. T h e L A N is 100M Ethernet. We investigated the communication between P and Q while using P as the sender and Q as the receiver. T w o M P L S tunnels for the communication P —» Q are pre-signaled: a primary route ABD, and a protection route ACT). T h e link BD is made unreliable to simulate a network failure. T h e primary and protection tunnels, plus a loopback protection tunnel DBA are pre-signaled as O K . Normally, the traffic goes through the primary tunnel ABD. W h e n the link BD fails, the traffic is rerouted to the protection tunnel ACD at ingress A, and routed back to the loopback protection tunnel at locus node B. W h e n the failed link recovers, the traffic is routed back to the primary tunnel ABD. A s described in Section 4.11, the processing and scheduling delay in Linux O S may result in t iming inconsistencies for our experiment on a L A N . To achieve t iming 71 consistency, and to simulate network propagation delay (latency), an artificial delay is introduced to the transport of Patherr messages (for failure notification) and used as a parameter in our experiment. T w o T C P features, S A C K and large T C P window size options are enabled in Linux (2.4.3). Buffer size for sending and receiving is another parameter in the experiment. We show how these parameters affect the results below. 5.2 Evaluation Criteria To evaluate our recovery performance, response time and packet loss are investigated in our experiment. However, response time is sometimes hard to accurately measure in practice, and in that case we compare our recovery performance with the one without network failure which indirectly reflects the recovery time. We also compare the performance of our recovery mechanism with and without the protection loop. 5.3 Evaluation Results 5.3.1 Using T C P Experiment 1 In the first experiment, we transfer a large file (size of 700MB) between the two hosts P and Q v ia T C P through our M P L S network with the link B-D periodically out of service (this is simulated by blocking the communication between B and D periodically). We use the total elapsed time for our evaluation. T h e experimental results are shown in Table 5.1. Failure between B and D occurs every 10 seconds, and lasts 5 seconds each time. We rely on T C P for speed 72 wsize 6 4 K B 100KB 640KB No Failure 59.7ms d= =100ms Loop 60.4s 60.4s 60.2s n-Loop 60.3s 60.3s 60.3s d= :200ms L o o p 60.00s 60.00s 59.96s n-Loop 60.33s 60.33s 60.40s d= =300ms Loop 60.59s 60.26s 60.20s n-Loop 64.00s 63.90s 63.30s d= :500ms Loop 60.8s 61.4s 60.5s n-Loop 66.3s 69.9s 68.7s d= :700ms L o o p 68.9s 64.2s 63.8s n-Loop 72.2s 72.8s 63.5s Table 5.1: T i m e needed for transferring a data of 700MB control (trying to send data out as fast as possible). In the table, d stands for the latency, while wsize stands for T C P window size, Loop for recovery with loopback protection and n-Loop for recovery without loopback protection. B y investigating these results and comparing the total time spent in all the different cases, we notice the following: 1. For a latency less than or equal to 500ms, Loop performs very well (very close to the 0-loss performance). For a latency less than or equal to 200ms, n-Loop also performs very close to the G-Ioss case. Failure can be quickly detected (within 100ms) with our recovery algorithm, and hence, packet loss is small. 2. W h e n the latency is low (100ms, 200ms) performance between Loop and n-Loop is very close. T h i s may imply that the effort for using a protection loop to reduce packet loss is not necessary for low latency networks. Network failure is detected within 100ms. A s analyzed in Section 3.10, when the latency is comparable with, or close to, the delay of failure detection, loopback protection 73 does not make too much difference. 3. T C P buffer size has no effect in all the cases. T h i s is unexpected since a larger buffer size should perform better, as we had expected in Section 3.10. T h e answer lies in T C P congestion control in which T C P slows or stops sending out packets when an acknowledgement is not received within a expected maximum delay. Therefore, when a network failure happens, T C P does not continue to blindly send out packets until the buffer is full, but instead stops for a while until the next acknowledgement is received from its peer (when a protection tunnel is in use). 4. W h e n the latency is 700ms, the Loop mechanism performs worse than it does for a latency of 500ms. T h i s may at first glance contradict our expectation that Loop should perform better for a higher latency. However, it is actually just a result of the cost of using a protection loop. O u r average transport speed is 90Mbs which is close to the network bandwidth (lOOmb Ethernet). Since it would be impractical for us to implement resource allocation in our testing of Linux routers, a traffic j a m arises when the same physical link is used by the bidirectional protection loop for a longer time! Thus in practice, although Loop is more suitable for high latency network, consideration for the cost of network resources is still necessary. O u r recovery algorithm (either Loop or n-Loop) works well with T C P . It reacts quickly to network failure so that packet loss can be prevented as soon as possible. Using protection loop can help save extra packets from being lost, and expedite failure recovery for high latency networks. However, the resource consumption cost should be carefully calculated. Interestingly, the T C P congestion control mechanism 74 helps prevent the packet out-of-order problem, and eliminate huge buffer require-ment for otherwise handling out-of-order packets. Experiment 2 In the second experiment, we try to measure total packet loss during a one-time network failure. A comparison is given to our recovery algorithm with and without a protection loopback tunnel. A script using the tcpdump utility is used to measure the total number of re-transmitted packets, which is also total packet loss with the S A C K option enabled in T C P . We found that whatever values the T C P buffer size and network latency (from 100ms to 700ms) are, Loop and n-Loop get the same packet loss (2 packets)! T h i s just confirms our analysis in the above, that T C P congestion control is in effect in this situation. We can say that for our recovery Loop buffering is not an extra issue with T C P , or it does not bring in extra burden for T C P buffering. 5.3.2 Using U D P Since T C P has a mechanism for congestion control, it stops sending out new packets when it does not receive acknowledgements from the other end for the packets already sent. T h i s implies that the sender, when a network failure happens, avoids sending more packets until the failure is fixed, or the traffic is rerouted to a backup tunnel. T h i s is shown in our former experiment that packet loss is the same for Loop and n-Loop. To more accurately measure and evaluate our recovery response time, and especially the performance of our Loop mechanism, we would also like to know how 75 many packets can be prevented from being lost during a network failure, and how long it takes for our system to respond to a failure. T h e U D P protocol is used in this experiment. A communication program using U D P is written for measuring packet loss between the two peers, P and Q. We measured packet loss for our Loop and n-Loop during a one-time network failure for different network latencies and source speeds. Table 5.2 shows the results of the experiment. We use the ratio of packet loss in Loop to that in n-Loop and call it the loss ratio. T h e lower the loss ratio, the better the saving effect. We investigate the results of the source speed of 30Mbs, and find that the higher the network latency, the lower the loss ratio: 0.48, 0.82, 0.34, 0.23 and 0.13 for a network latency of 100ms, 200ms, 300ms, 500ms and 700ms, respectively. T h i s is also true for all other source speeds 1 . We can briefly calculate response time R using the form: R = loss*8/[data speed]. For example, for a latency of 100ms, the response time is: R = 530KB*8/30Mbs = 140ms for Loop, R = l . l M B * 8 / 3 0 M b s = 300ms for n-Loop. T h e response delay arises from the delay for network failure detection and failure notification plus processing at routers. A s implied in the results, no more packets are lost after the traffic flow is redirected to the protection loop. For the latency of 100ms, the response time (140ms) is roughly the sum of the delay of detection which is 40ms or so, and the latency of failure report which is 100ms. T h e results show that our recovery system can respond to a network failure within 200ms. Considering this has been undermined by some limitations of the 1 T h e source speed control in our testing U D P program is subject to a Linux real time signal limit of 20ms, compared with an ideal situation that should send out packets contin-uously with a constant speed. This limitation influences some of our experiment data when the speed is slow: for example, loss ratio is 0.8 for network latency 700ms with source speed of IMbs. 76 D a t a Speed IMbs 5Mbs l O M b s 20Mbs 30Mbs d=100ms Loop n-Loop 2 1 K B 100KB 200KB 350KB 530KB 5 7 K B 290KB 350KB 1.1MB 1.1MB d=200ms Loop n-Loop 16KB 100KB 150KB 830KB 1.4MB 5 4 K B 250KB 580KB 1.4MB 1.7MB d=300ms Loop n-Loop 1 4 K B 100KB 250KB 350KB 620KB 5 5 K B 270KB 720KB 1.3MB 1.8MB d=500ms Loop n-Loop 2 0 K B 4 5 0 K B 160KB 340KB 500KB 100KB 1.8MB 740KB 1.5MB 2 .2MB d=700ms Loop n-Loop 2 0 K B 7 2 K B 4 3 0 K B 380KB 510KB 2 5 K B 580KB 1.2MB 2 .5MB 3 .8MB Table 5.2: Total number of bytes lost in a single failure Linux O S , the response time can be much shorter in real routers, say less than 100ms. T h e results also show, that Loop is more suitable for higher latency networks, consistent with the result in the previous section and our analysis in Chapter 3. However, buffering is needed at the receiver side for handling the packet reordering problem. Est imated buffer size should be 2*Latency*Bandwidth (Here, Bandwidth is the bandwidth reserved for the traffic). 77 Chapter 6 Conclusion and Future Work 6.1 Contribution and Conclusion Timely response to network failure is an important requirement for many real-time applications in order to avoid packet loss and quality degradation. In this thesis, we designed a fast recovery mechanism in M P L S networks which can quickly respond to failures. First , network failure is detected using R S V P - H E L L O at the M P L S layer. A failure can be detected within milliseconds, and the time needed can be shorter if detection occurs at the hardware or datalink layer. Second, the failure is reported to an ingress router through a P A T H E R R message.. A t the same time the local node performs local recovery by redirecting the flow to a backup tunnel or to a reverse loopback tunnel if a normal backup tunnel does not exist. Finally, the ingress router makes the final recovery. We implemented our algorithm, as well as an M P L S control plane, over the Linux platform. R S V P - E R O extension for M P L S is used in this thesis as the label signaling protocol. We investigated the recovery performance of our mechanism with Linux 78 routers. TCP and UDP are the two end-to-end protocols used for our evaluation. The results show that our recovery mechanism can react to a failure very quickly (within 200ms). It also shows that for a high network latency, and with the sup-port of end-to-end protocols like TCP, loopback protection can help expedite failure response and reduce packet loss. 6.2 Limitation and Future Work Our recovery mechanism requires all protection tunnels, including the loopback one, to be pre-created before a failure happens. This may however, not be feasible in some cases where network resources are limited. We may choose to provide this protection only for those traffic flows that really need it. Our failure report, or notification, is done on the tunnel basis. This may result in scalability problems; there may exist many tunnels for which the failure needs to be reported to ingress routers. The scalability problems can be solved by using multi-commodity traffic, for example, merging several similar flows (with the same pair of ingress-egress routers and QoS requirement) into single commodity traffic. For simplicity, we only focused on link failure in this thesis. However, the recovery algorithm can be extended to cover node failure as well. Node failure can be acknowledged at an ingress edge router by combining multiple link failure reports from different local nodes. The intricacy may exist in that we will need to wait a little longer than we currently do for collecting these reports (to resist node failures) before making a recovery decision, and this will further have to somewhat increase our recovery response delay, which however, is a side-effect. 79 B i b l i o g r a p h y [1] Anat Bremier-Brarr, Yehuda Afek, H a i m K a p l a n , E d i t h Cohen and Michael Merritt , "Fast Recovery of M P L S Paths", ACM SIGMETRICS, Jan. 2001. [2] A y a n Banerjee, John Drake, Jonathan Lang, B r a d Turner, Daniel Awduche, L o u Berger, Kireeti Kompe l la and Yahov Rekhter, " A n Overview of Signalling Enhancements and Recovery Techniques", IEEE Communications Magazine, July 2001. [3] Tae H . O h , Thomas M . C h e n and Jeffery L . Kennington, "Fault Restoration and Spare Capacity Al locat ion with QoS Constraints for M P L S Networks", IEEE Globecom 2000, Nov. 2000, pp. 1731-1735. [4] O S D N , " M P L S for Linux", http://sourceforge.net/projects/mpls-linux. [5] R a d i m Bartos and Mythi l ikanth Raman, "A Heuristic Approach to Service Restoration in M P L S Networks", IEEE International Conference on Commu-nications, June 2001, pp. 117-121. [6] M u r a l i Kod ia lam and T . V . L a k s h m a n , "Dynamic Rout ing of Locally Restorable Bandwidth Guaranteed Tunnels using Aggregated L i n k Usage Information", Proceedings of IEEE Infocom 2001, A p r . 2001, pp. 376-385. 80 [7] Bruce Davie and Yakov Rekhter, " M P L S Technology and Applications", Mor-gan Kaufmann Publishers, 2000. [8] P i n g P a n E d , Der-Hwa G a n and et al, "Fast Reroute Extensions to R S V P -T E for L S P Tunnels", Internet Draft draft-ietf-mpls-rsvp-lsp-fastreroute-OO.txt, M a y 2002. [9] Vi sha l Sharma, Ben-Mack Crane and et al, "Framework for M P L S - b a s e d Re-covery", Internet Draft draft-ietf-mpls-recovery-frmwrk-03.txt, M a y 2002. [10] Daniel O . Awduche, L o u Berger and et al, " R S V P - T E : Extensions to R S V P for L S P Tunnels", Internet Draft draft-ietf-mpls-rsvp-lsp-tunnel-08.txt, Feb. 2001. [11] D . Awduche, J . Malcolm, J . Agogbua, M . O'Del l and J . M c M a n u s , "Require-ments for Traffic Engineering Over M P L S " , IETF RFC2702, Sept. 1999. [12] E . Rosen, A . Viswanathan and R. Cal lon, " M P L S Architecture", IETF RFC3031, Jan. 2001. [13] J . Wroclawski, " R S V P with Intserv", IETF RFC2210, Sept. 1997. [14] R. Braden E d . , L . Zhang and et al, "Resource Reservation Protocol", IETF RFC2205, Sept, 1997. [15] Tony Manci l l , "Linux Routers", Prentice Hall, 2001. [16] D A R P A , "Transmission Control Protocol", IETF RFC!93, Sept. 1981. [17] W.Stevens, " T C P Slow Start, Congestion Avoidance,Fast Retransmit and Fast Recovery Algorithm", IETF RFC2001, Jan. 1997. [18] V.Jacobson, R .Braden and D . B o r m a n , " T C P Extensions for High Perfor-mance", IETF RFC1323, M a y 1992. 81 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items