UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Fast failure recovery in MPLS networks Luo, Qingsheng 2001

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2002-0478.pdf [ 2.86MB ]
JSON: 831-1.0051708.json
JSON-LD: 831-1.0051708-ld.json
RDF/XML (Pretty): 831-1.0051708-rdf.xml
RDF/JSON: 831-1.0051708-rdf.json
Turtle: 831-1.0051708-turtle.txt
N-Triples: 831-1.0051708-rdf-ntriples.txt
Original Record: 831-1.0051708-source.json
Full Text

Full Text

Fast Failure R e c o v e r y i n M P L S  Networks  by Qingsheng Luo Bachelor of Engineering, Tsinghua University, China 1991  A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF THE REQUIREMENTS FOR THE D E G R E E OF M a s t e r of Science in THE FACULTY OF G R A D U A T E STUDIES (Department of Computer Science)  We accept this thesis as conforming to the required standard  The University of British September 2002 © Qingsheng Luo, 2002  Columbia  In presenting this thesis i n p a r t i a l fulfilment of the requirements for an advanced degree at the University of B r i t i s h Columbia, I agree that the Library s h a l l make i t f r e e l y available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It i s understood that copying or publication of this thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission.  Department The University of B r i t i s h Columbia Vancouver, Canada  Abstract Recovery from network failures is a very important and challenging requirement for current Internet service providers. Real-time applications and other missioncritical tasks require networks to respond to network faults expeditiously in order to not degrade the required quality. In IP networks, failures are detected and handled through the convergence of network topology (topology change reflected in all routers' routing information bases). This recovery mechanism, however, has an obvious disadvantage in that it reacts to network faults very slowly, because it may take a long time for the entire network topology to converge. In this thesis, we present the design of a fast recovery mechanism in M P L S networks, implemented over the Linux platform. We investigate its performance over our testing network composed of Linux routers. Our experiments show that our recovery mechanism responds to network faults quickly, with less packet loss.  u  Contents  Abstract  ii  Contents  iii  List of Tables  vii  List of Figures  viii  Acknowledgements  x  1 Introduction  1  1.1  Motivation and Problems  1  1.2  Current Research  3  1.3  Overview  4  2 Background  7  2.1  2.2  MPLS  7  2.1.1  From IP to M P L S  7  2.1.2  M P L S Overview  9  2.1.3  Label Distribution  12  RSVP  14  iii  2.3  2.4  2.5  2.6  2.2.1  Data Flows  14  2.2.2  Reservation Model  15  2.2.3  Reservation Styles  2.2.4  Encapsulation of R S V P Messages  16  2.2.5  Basic Mechanism  16  2.2.6  Messages and Objects in R S V P  17  2.2.7  R S V P - T E Extension for M P L S  20  . .  16  Linux  26  2.3.1  Overview  26  2.3.2  System C a l l  26  2.3.3  /proc File System  26  2.3.4  Linux Network  27  2.3.5  Configuration as a Router  29  Transmission Control Protocol  30  2.4.1  TCP  30  2.4.2  T C P High Performance Extensions  31  2.4.3  T C P in Linux  31  O S D N M P L S project  32  2.5.1  Introduction  32  2.5.2  Data Structure  33  2.5.3  Label and Labelspace  34  2.5.4  Operation and Interface  34  Failure Recovery  36  3 Design 3.1  38  Overview  38 iv  3.2  Label Binding  • •  3.3  Primary and Protection Routes  39  3.4  Failure Detection  40  3.5  Failure Report  41  3.6  Failure Processing  42  3.7  Operations at the Locus L S R  44  3.8  Operation at the Ingress Node  45  3.9  Operation at the Egress L E R  46  3.10 Analysis of Potential Problems  4  38  47  Implementation  51  4.1  Overview  51  4.2  System Structure  51  4.3  M P L S Session  54  4.4  Primary Tunnels and Protection Tunnels  57  4.5  L S P Tunnel Signaling  60  4.6  Label Management  62  4.7  Failure Notification  62  4.8  Ingress L E R  63  4.9  Locus L S R  65  4.10 Egress L E R  67  4.11 OS Influence  68  5 Performance Evaluation  70  5.1  Overview and Experimental Environment  70  5.2  Evaluation Criteria  72  v  5.3  6  Evaluation Results  72  5.3.1  Using T C P  72  5.3.2  Using U D P  75  Conclusion and Future W o r k  78  6.1  Contribution and Conclusion  78  6.2  Limitation and Future Work  79  Bibliography  80  vi  List of Tables  5.1  T i m e needed for transferring a d a t a of 7 0 0 M B  73  5.2  T o t a l number of bytes lost i n a single failure  77  vii  List of Figures  2.1  Carrying label in the shim layer header  9  2.2  L S P in M P L S  11  2.3  Using R S V P  17  2.4  R S V P Common Header  18  2.5  R S V P Object Format  19  2.6  Generic Label Request Object  24  2.7  Generic Label Object  24  2.8  L S P - T U N N E L J P V 4 Session Object  25  2.9  L S P . T U N N E L J P V 4 Sender Template Object  25  2.10 H E L L O R E Q U E S T Object  25  2.11 H E L L O A C K Object  25  2.12 Linux T C P / I P layers  28  2.13 T C P Header  31  2.14 Operations in L i n u x - M P L S  36  3.1  Protection Routes  40  3.2  Failure Report  42  3.3  Loopback Tunnel  43  3.4  Route Types  45 vm  3.5  Recovery Cycle  48  3.6  Out-of-order Problem  49  4.1  Structure of Implementation  52  4.2  Organization of Sessions and Tunnels  55  5.1  Testing Network Topology  71  ix  s  Acknowledgements I would like to express m y gratitude to m y supervisor, Professor A l a n Wagner, for his essential guidance a n d inspiration of this thesis.  I would also like to thank  Professor N o r m H u t c h i n s o n for being the second reader of this thesis a n d giving me important suggestions for improvements. A n d special thanks to m y family, especially m y daughter Y i c h e n , for their patience through the years when I was engaged i n my graduate study.  QINGSHENG L U O  The University of British Columbia September 2002  Chapter 1  Introduction 1.1  Motivation and Problems  It h a s b e c o m e m o r e a n d m o r e difficult t o satisfy c u s t o m e r s a n d service p r o v i d e r s as p e o p l e a r e s e e k i n g m o r e a d v a n c e d s e r v i c e s f r o m t h e I n t e r n e t b e s i d e s t h e e x i s t i n g best-effort  packet forwarding.  Q u a l i t y of Service (QoS), V i r t u a l Private Network  ( V P N ) a n d T r a f f i c E n g i n e e r i n g , f o r e x a m p l e , e n a b l e I S P s t o offer b e t t e r  services  for specific traffic flows. U n f o r t u n a t e l y , I P , w h i c h w a s o r i g i n a l l y d e s i g n e d for besteffort c o n n e c t i o n l e s s p a c k e t t r a n s f e r , i s n o t s u i t a b l e o r i s a w k w a r d f o r t h e s e n e w features.  M u l t i - P r o t o c o l L a b e l S w i t c h ( M P L S ) , w h i c h creates v i r t u a l t u n n e l s over  I P networks, shows i t s capabilities i n h a n d l i n g these n e w features. Since its emergence, M P L S  has attracted a lot of attention a n d research  because it not only improves network performance  a n d efficiency,  b u t also, a n d  p o s s i b l y m o r e i m p o r t a n t l y , provides a n a t u r a l w a y t o finely c o n t r o l specific traffic flows, m a k i n g i t possible to easily a n d seamlessly design a n d i m p l e m e n t n e w features in network routers.  O n e v e r y i m p o r t a n t c o m p o n e n t i n Q o S is f a u l t tolerance, o r  recovery f r o m n e t w o r k failure. R e a l - t i m e a p p l i c a t i o n s like V o I P (Voice over I P ) a n d  1  other mission-critical tasks require a quick response to a network failure, otherwise the quality requirement can not be met. Error handling in traditional IP networks is achieved by detecting a failure (link failure or node failure) at the local router, propagating this failure to the whole network area, and processing the failure after the network topology changes accordingly. This relies on the convergence of the whole network topology. One major disadvantage of this failure handling mechanism is its slow response (possibly in tens of seconds or minutes). For example, in O S P F , which uses flooding for topology information exchange, it typically may take 30 seconds for the whole network topology to converge. This problem arises because of the following: 1. although the lower layers (layer 2 or layer 1) can very quickly detect a failure (in milliseconds), they have no idea of where the failure should be reported to because they have no information for routing; 2. on the other hand, layer 3 (IP) does not report it either because it is a connectionless protocol and 3. O S P F is designed for dynamic routing and to prevent topology fluctuation, it prefers postponing the flooding of topology change until the network topology converges. However, M P L S makes it possible to handle failures more gracefully because M P L S is connection oriented, and lies in between layer 2 and layer 3, which means it can detect failure quickly, while at the same time have the information of upper layer routes to facilitate the handling of failure. Unlike O S P F , M P L S knows the information of current sessions or connections, and which of them need to be protected. This helps an M P L S node notify the relevant edge routers of a network failure, and 2  take immediate restoration steps for those tunnels even before the network topology finally converges (under an OSPF propagation mechanism, for example.).  1.2  Current Research  Some effort has been made for failure recovery with MPLS. Anat Bremier-Brarr, Yehuda Afek, et al. proposed a restoration technique to reroute a problematic tunnel to a backup one by concatenating the two tunnels at the local node [1]. Murali Kodialam and T.V. Lakshman presented an algorithm for dynamic routing for local restoration (which means the restoration is completed by the local node) [6]. In [3] by Tony Mancill, and [5] by Radim Bartos and Mythilikanth Raman, algorithms for bandwidth guaranteed restoration using MPLS local recovery were presented. The above mentioned research focused mainly on resource reservation (bandwidth) or routing algorithms. However, quick response to prevent packet loss, which is very important to the quality of failure recovery, has not received enough attention. For example, in real-time applications, such as voice over IP, it is desirable to reroute failed tunnels onto backup ones within tens of milliseconds, otherwise, users would notice the degradation in quality. Video broadcasting is another example where too much packet loss is unacceptable for reasonable video quality. Papers [8] and [9] provide basic concepts for failure recovery in MPLS and suggest a set of protocols piggy-backing on RSVP. This thesis focuses on the fast recovery of network failure in MPLS networks. Fast recovery here, means a quick response to the failure while trying to prevent packet loss. In term of failure, we use a single link failure model which means we assume only one link failure occurs at a time. This is the usual case because during a very short period of time it would 3  be very rare for two or more links to fail simultaneously. Also in practice, network links are more vulnerable than nodes, which means link failure happens more often than node or router failure. The single link failure model has been used by most recovery researchers. In this thesis, unless explicitly mentioned, failure or network failure means link failure. We talk about node failure in Chapter 6.  1.3  Overview  In the thesis, a fast recovery mechanism in M P L S networks is designed and implemented over Linux routers. The recovery mechanism is evaluated by experimentally testing its performance with selected end-to-end protocols and applications. A link failure is detected at the M P L S level using the R S V P - H E L L O extension (Section 2.2.7). Primary tunnels (tunnels which need to be protected) and backup tunnels are pre-signaled or pre-established from the ingress router. A protection loop is created along the reverse direction of the primary tunnel (from egress to ingress in the context of the primary tunnel). When a network failure happens, the upstream node of the failed link detects the failure and does the following: 1.  (a) tries to find a backup tunnel through this node but not the failed link, and switch the traffic to the backup one by concatenating the two tunnels, otherwise (b) concatenates the protected tunnel with the protection loop, switching the traffic back to the protection loop, and  2. informs the ingress node of the failure.  4  The ingress node, after receiving the notification of the failure, finally redirects the traffic to a backup tunnel. The failure notification arrives at the ingress node before the whole network topology finally converges. This allows the failure to be handled more quickly than otherwise would occur if only a route distribution protocol, such as O S P F , is used. A n M P L S router only notifies those ingress nodes which need this instant notification (have task-critical traffic passing through this failed link), in contrast to O S P F in which all routers inside the whole area have to be informed. If an ingress router can aggregate multiple traffic (with the same QoS, ingress-egress pair, etc.)  into  multi-commodity flows, it should be scalable. By processing the failure locally, instead of relying on the ingress router to do the recovery, packet loss can be reduced. In a practical system, the propagation time of a failure notification from the failure locus to the ingress router can be hundreds of milliseconds, and hence using a temporary recovery tunnel (either a backup one or the protection loop) at the locus node first can be of interest to quality-critical tasks. One possible drawback however, is the out-of-order problem: during this temporary period, until the final backup decision is made by the ingress router, packets may arrive out of order at the destination. Most of the current approaches in the literature use network simulators, such as NS, as their testbed. Using a simulator is more convenient for the user, and in some cases, it would be impossible to obtain real systems. However, it would be more convincing if the algorithm or mechanism is tested over real systems. In account of this, we use Linux workstations as routers in the project platform. We use R S V P - T E (RFC2205 [14], [10]) as the protocol for label distribution, link failure detection and error notification. We implemented the major functions of  5  RSVP-TE for MPLS over Linux and also investigated the performance over Linux. Our experiment shows that the recovery operation can be carried out within 100 milliseconds, and packet loss can be minimized using our fast recovery mechanism (with a protection loop) under certain circumstances. The thesis is organized as follows: Chapter 2 gives some related background knowledge, then the details of the design and implementation are described in Chapter 3 and Chapter 4. Chapter 5 gives the results of the experiment, and the conclusion and future work are given in Chapter 6.  6  Chapter 2  Background 2.1 2.1.1  MPLS From IP to MPLS  The growth in both the number of users of the Internet and their bandwidth and service requirements places increasing demands on Internet service providers' (ISPs) networks. To meet all the growing demand, ISPs need higher performance switching and routing products. Traditional IP routing is done by looking up the F E C (Forwarding Equivalence Class) in a router's F I B (Forwarding Information Base) to find the outgoing interface and next hop address to which the packet should be forwarded when a router takes in a new packet. Normally, an F E C is composed of a destination address, and probably a source address and a QoS class if required. A n IP router organizes its F I B as a hash table indexed by the F E C . W i t h the massive growth of the Internet, a router's F I B may become huge, and it may take too long (compared to the speed of packet arrival) for the router to get a next hop. This would introduce a problem for the performance and efficiency of a router to support a broadband  7  bandwidth network with huge data flows, for example, Gigabit networks.  MPLS  can easily handle this by inserting a label into the header of an IP packet. After an M P L S tunnel (which can be considered as a route created by M P L S signaling) is created, this label can be used as a quick index to the next hop to which the packet is to be forwarded. Without looking up the next hop in a hash table (done in IP forwarding), M P L S can obviously reduce the cost of forwarding packets. This merit makes the M P L S a very good choice for improving the performance and efficiency of routers and simplifying their design and production. Traffic engineering is another motivation for the introduction of M P L S . For example, although using the shortest path (the most widely used routing protocol in IP) can conserve network resources, it may cause congestion problem, which. occurs when the shortest paths from different sources overlap at some of the links. Underutilization and over-utilization of the resources may occur when the traffic over a shortest path exceeds its capacity, while a longer but less busy path between two routers is left under-utilized. To balance the traffic load on different links and routers in the network, traffic engineering routes the traffic flows across the network based on the resources that the traffic requires, and the availability of the resources in the network using C B R (Constraint Based Routing). These requirements for traffic engineering are awkward or difficult for IP, notable for its destination based routing, to satisfy, while M P L S can handle it by creating a virtual tunnel for each specific F E C . The current M P L S technology was derived from several similar attempts. Some of these attempts included Toshiba's Cell Switching Router (CSR), Cisco's Tag Switching, I B M ' s ARIS, and so forth. The working group of the Internet Engineering Task Force (IETF) introduced M P L S in 1997 [12]. M P L S extends routing with  8  • ^  Network layer Network layer data header  "Shim" layer header  Link layer header ^  —- ^ ^ ^ ^ ^ ^ ^ ^ ^  Exp S (3)  Label(20) Label: Label value S: Bottome of Stack  —i  TTL  Exp: Experimental use TTL: Time to Live  Figure 2.1: Carrying label in the shim layer header  respect to packet forwarding, load balancing and path controlling. 2.1.2  MPLS  Overview  In MPLS, a label is a short, fixed-length entity, with no internal structure, inserted into the header of an IP packet. It lies in between layer 2 and layer 3, and is called a shim layer. A label does not directly encode any of the information from the network layer header. It is used as an index by a router to look up the next hop in a switching table to which the packet should be forwarded. See Figure 2.1 for the layer structure of an MPLS packet. The LSR (Label Switching Router) is used to name a router in an MPLS area. One type of special LSR is an LER (Label Edge Router), located at the edge of an MPLS routing domain. The FEC (Forwarding Equivalence Class) is used to classify packets. An FEC is a representation of a group of packets that contain the same traffic requirements. All packets with the same FEC are forwarded or transferred along the same route and to the same next hop. One example of an FEC is a set of unicast packets whose network layer destination address matches  9  a particular IP address prefix. A n ingress L E R is used to insert a label into the packet header from its F T N (FEC-To-Nexthop) mapping table. A n egress L E R is used to remove a label from the packet header before the packet departs from the M P L S domain. Only at an ingress does the L E R need to look up the F E C , and at all other LSRs only layer2 switching is needed. In M P L S , an L S P (Label Switching Path) tunnel is created using some type of distribution protocol (see 2.1.3) before it can be used. The establishment of an LSP, and traversal of a labeled packet across an M P L S domain, is shown in Figure 2.2. In this example, an L S P is created from L E R A to L E R C for a certain traffic flow.  Normally, the creation of an L S P is initiated at an ingress L E R . Here, A  initiates the tunnel setup by sending a label request to the downstream node B . B then sends a label request to C . When the tunnel is set up, a label is assigned for this tunnel from C to B and B to A . Depending on different mechanisms, label assignment between B and A , and that between C and B , can be done independently, or the former one should be held until the latter is completed. When the traffic flow arrives, ingress router A admits the traffic by searching its L I B (Label Information Base) using the flow's F E C as the index. A inserts the label 32 into the header of the IP packet and forwards it via outgoing interface 2 to L S R B , according to the routing information it receives from its L I B . Now B receives an M P L S packet and looks up the routing information in its L I B using the incoming label 32 as the index. B gets the outgoing interface 3 and the outgoing label 33 as the related routing information from its L I B , replaces label 32 with 33, and forwards it through interface 3 to node C . C performs similar steps but removes the M P L S label and forwards it as a regular IP packet to an IP router outside the domain. Logically, M P L S routing can be partitioned into two basic components: the  10  LSP setup (from LSR A to C)  Label Reservation (label=32)  Label Reservation (label=33)  IF:interface LSR: Label Switching Router LER: Label Edge Router  Traffic flow through LSP  iif inlabel oif outlabel 2  32  3  33  FEC: Forwarding Equivalence Class oif: outgoing interface iif: incoming interface inlabel: incoming label outlabel: outgoing label  Figure 2.2: LSP in MPLS  11  control plane and the forwarding plane. The forwarding component (plane) is responsible for the actual forwarding of packets from input to output across a switch or router. The forwarding component uses a forwarding table and the information carried in the packet itself to forward a packet. The forwarding table is maintained and managed by the control component. Although, in this thesis and by the current literature M P L S is used over IP networks, it is not intrinsicly designed as working over IP. Instead, it supports any layer 3 protocol and is capable of operating over any link layer protocol, hence, the name multi-protocol. However, this is beyond the scope of this thesis, and we restrict the discussion to IP only.  2.1.3  Label Distribution  Label Creation Label creation for LSPs can be triggered by a topology change, the incoming of data packets or tunnel requests. The three creation techniques related to the three triggering mechanisms are topology-driven, data-driven and request-driven. In the topology-driven mechanism, LSPs are created before the data transmission, while the data-driven method uses the reception of data flow, and the request-driven method uses user requests to initiate L S P label creation.  Label Assignment A label can be assigned by a downstream node or an upstream node. Currently, the downstream assignment method is more widely used, especially in explicit routing. We use downstream assignment in this thesis.  12  Label Setup Different label setup methods are allowed in M P L S , depending upon the routing methods that are used. In hop-by-hop routing, each L S R determines the next hop based on its own F T N table. This is similar to the one currently used in IP networks. In explicit routing, an explicit path (strict or loose) is specified by the L E R before the L S P is established. This explicit path is normally chosen by the administrator or from C B R (Constraint Based Routing). Each L S R follows the explicit path to setup the tunnel and assign labels. It is mainly used in traffic engineering.  Distribution Protocols There are two different kinds of protocols for label distribution: new defined protocol and protocol piggybacking on an existing protocol. L D P (Label Distribution Protocol) is a set of new protocols defined by the M P L S working group, used solely for M P L S label distribution. It is useful when no existing protocol can be used. However, using it can be expensive because a completely new protocol must be handled. Piggy-backing label distribution on another existing protocol is more preferable. Candidate protocols include: R S V P , O S P F , P I M and B G P . Adding some extensions to these protocols can save a lot of time rather than defining and handling a completely new one, such as L D P . In the thesis we use the R S V P - E R O extension to support M P L S label distribution, which is one of the piggy-backing methods.  13  2.2  RSVP  R S V P (ReSerVation Protocol) is a resource reservation setup protocol designed for an integrated services Internet [14]. R S V P is used by a host to request specific qualities of service from the network for particular application data streams or flows. It is also used by routers to deliver QoS requests to all the nodes along the paths of the flows, and to establish and maintain a state for providing the requested service. A n R S V P request generally results in some resource being reserved in each node along the data path. R S V P operates on top of IP. However, R S V P does not transport application data, but is rather an Internet control protocol. Similarly to I C M P and I G M P , R S V P works as a control part, not a forwarding part. R S V P is not itself a routing protocol. A n R S V P process consults the local routing database to obtain routes. R S V P is only concerned with the QoS of those packets that are forwarded in accordance with routing. For scalability, R S V P makes receivers responsible for requesting a specific QoS. R S V P protocol carries the request to all the nodes along the reverse data path to the data source. R S V P establishes a soft state for the resource reservation, which means R S V P sends periodic refresh messages to maintain the state along the reserved path(s). In the absence of refresh messages, the state automatically times out, and is released.  2.2.1  Data Flows  R S V P defines a session to be a data flow with a particular destination and transportlayer protocol. A n R S V P session is defined by the tuple of D'estAddress, Protocolld 14  and DestPort. DestAddress is the IP destination address.  Protocolld is the IP  protocol ID. Optional DestPort is a generalized destination port, for example, for further demultiplexing.  2.2.2  Reservation Model  A n elementary R S V P reservation request consists of a flowspec, together with a filterspec; this pair is called a flow descriptor. Theflowspecspecifies a desired QoS. The format and content of theflowspecare opaque to R S V P . The filterspc, together with a session specification defines the set of data packets: the flow to receive this QoS. In the most general approach, filter specs may select arbitrary subsets of the packets in a given session. The basicfilterspecformat defined in the present R S V P specification has a very restricted form: sender IP address and optionally SrcPort. R S V P message carrying reservation requests originate at receivers, and are passed upstream towards the sender(s). A t each intermediate node, a reservation request triggers two general actions, as follows: • Make a reservation on a link The R S V P process passes the request to admission control and policy control. If either test fails, the reservation is rejected, and an error message is returned to the appropriate receiver(s). If both succeed, the node sets its forwarding plane (e.g, packet classifier) and forwards the request upstream if needed. • Forward the request upstream A reservation request is propagated upstream towards the appropriate senders. The reservation request that a node forwards upstream may differ from the request that it received from downstream because of, for example, traffic merg-  15  ing.  2.2.3  Reservation Styles  The reservation styles are defined below: • A Wildcard-Filter (WF) style implies shared reservation and wildcard sender selection. • A Fixed-Filter (FF) style implies options with distinct reservation and explicit sender selection which creates a distinct reservation for data packets from a particular sender, instead of sharing them with other senders' packets for the same session. • Shared-Explicit (SE) style implies options with shared reservation and explicit sender selection which creates a single reservation shared by selected upstream senders.  2.2.4  Encapsulation of R S V P Messages  Generally R S V P messages are sent and received using protocol 46 as IP datagrams. However, they can also be encapsulated in U D P if the host systems do not support raw network I / O .  2.2.5  Basic Mechanism  There are two fundamental R S V P messages: Path and Resv. Each R S V P sender host transmits R S V P Path messages downstream along the uni-/multicast routes provided by the routing protocol(s), following the paths of the data. These Path messages store a path state in each node along the way. Each receiver host sends  16  Figure 2.3: Using RSVP  RSVP reservation request (Resv) messages upstream towards the senders. They create and maintain a reservation state in each node along the path. Figure 2.3 is an example that uses RSVP for multicasting, with SE as the reservation style. 2.2.6  Messages a n d Objects i n R S V P  Each RSVP message consists of a common header, followed by a body consisting of a variable number of variable-length, typed objects. Common Header Figure 2.4 shows the structure of the common header. Ver Protocol version number. The current version is 1. Flags Not yet defined. M s g Type l=Path,2=Resv,3=PathErr,4=ResvErr,5=PathTear,6=ResvTear,7=ResvConf  17  Ver  Flags Msg Type  (4)  (4)  R S V P Checksum (16)  (8)  Send_TTL  (Reserved)  R S V P Length  Figure 2.4: R S V P Common Header  RSVP Checksum The one's complement of the one's complement sum of the message. Send-TTL I P T T L value with which the message is sent. RSVP Length The total length of this R S V P message i n bytes, including the common header.  RSVP Object Every object is composed of one or more 32-bit words with a one-word header, with the format shown i n Figure 2.5. Length Total length of the object in bytes. Class-Num Defines object class. C-Type Object type. Unique within each Class-Num. Combined with Class-Num, determines an object.  Message Types Path A Path message travels from a sender to a receiver(s) along the same path(s) used by the data packets. It contains a S E N D E R _ T E M P L A T E object defining 18  Length (bytes) (16)  Class-Num (8)  C-Type (8)  (Object contents)  Figure 2.5: RSVP Object Format  the format of the data packets and a SENDER_TSPEC object specifying the traffic characteristics of the flow. Resv  A Resv message carries a reservation request hop-by-hop from the receiver to the sender, along the reverse path of the data flow for the session.  PathTear  A PathTear is used by an upstream node to request the downstream  node (next hop) to tear down the session. Receipt of a PathTear message deletes the matching path state. PathTear must be routed exactly like the corresponding Path message. ResvTear  A ResvTear message is initiated by the receiver or by any node in which  the reservation state has timed out, and travels upstream towards all matching senders. Receipt of a ResvTear deletes the matching reservation state. ResvTear must also be routed exactly like the corresponding Resv message. PathErr  A PathErr message reports errors in processing path messages. It travels  upstream towards the sender and is routed hop-by-hop using the path state. PathErr messages do not modify the state of any node through which it passes. It is only reported to the sender.  19  ResvErr A ResvErr message reports errors in processing the Resv message. It travels downstream towards the appropriate receiver, routed hop-by-hop using the reservation state. ResvConf A ResvConf is sent downstream to the receiver when requested, to acknowledge the reservation request. 2.2.7  R S V P - T E Extension for M P L S  RSVP-TE is an extension of RSVP for establishing label switched paths in MPLS networks. It was initially proposed by Daniel O. Awduche, Lou Berger, et al. [10]. The RSVP-TE protocol supports the instantiation of explicitly routed LSPs, with or without resource reservation. The LSPs created with RSVP can be used to carry Traffic Trunks. Since the traffic that flows along an LSP is defined by the label applied at the ingress node, and is opaque to all intermediate nodes, these paths can be treated as tunnels: LSP tunnels. The signaling protocol model defines the specification of an explicit path as a sequence of strict or loose routes. While this is not mandatory, resource reservation can be done together with LSP tunnel creation. LSPs without resource reservations can be used in, for example, best-effort network traffic. Basic Concepts When RSVP and MPLS are combined, a session can have a more flexible meaning, compared to the data flow defined in RSVP. The ingress node can use a variety of means to determine which packets are assigned a particular label, according to its admission control policy. In RSVP-TE, a SESSION object uniquely defines a traffic engineered tunnel 20  (TE tunnels) which may consist of several LSP tunnels, for example, to spread traffic to multiple paths or reroute traffic during failure. A tunnel ID and an extended tunnel ID are parts of a SESSION object. SENDER_TEMPLATE (downstream) and FILTER_TEMPLATE (upstream) objects carry the L S P J D . An LSP tunnel is uniquely identified by the SESSION object, and the SENDER_TEMPLATE or FILTER_TEMPLATE object. To create an LSP tunnel, the first MPLS node of the path (ingress) creates an RSVP Path message carrying a SESSION object with a session type of LSP.TUNNEL J P V 4 or LSP-TUNNEL J P V 6 and a L A B E L . R E Q U E S T object. The LABEL_REQUEST object indicates that label binding for this path is requested. The ingress may also add an EXPLICIT-ROUTE object (ERO) to the Path message. ERO specifies the route as a sequence of abstract nodes. The Path message is forwarded along the path specified by the ERO when there is one. When this Path message is received by the destination of the LSP path, the destination (normally egress) responds to the label request by including a L A B E L object in its response RSVP Resv message. The Resv message is sent back upstream toward the sender, following the path state created by the Path message, in reverse order. When a node receives a Resv message with a L A B E L object inside of it, the node binds this label as the outgoing label. If a node can not bind a label it sends a PathErr message with some error code inside, back to the sender. As the Resv message propagates towards the sender node, an LSP path is established.  21  M a j o r Message a n d O b j e c t F o r m a t s E x t r a objects such as L A B E L _ R E Q U E S T and L A B E L are added, and some extensions to existing messages are introduced to R S V P for supporting L S P signaling. Messages: Path  <C0MM0N HEADER> [<INTEGRITY>] <SESSI0N> <RSVP_H0P> <TIME_VALUE> [<EXPLICIT_ROUTE>] <LABEL_REQUEST> <sender d e s c r i p t o r ^ <sender  descriptor>::  <SENDER_TEMPLATE> <SENDER_TSPEC>  Resv  <C0MM0N HEADER> [<INTEGRITY>] <SESSI0N> <RSVP_H0P> <TIME_VALUE> <STYLE> <flow d e s c r i p t o r <flow d e s c r i p t o r  list>::  <FF flow d e s c r i p t o r l i s t > <FF flow d e s c r i p t o r  I <SE flow  list>::  <FL0WSPEC> <FILTER_SPEC> <LABEL> I <FF flow d e s c r i p t o r <FF flow <FF flow  list>  descriptor>  descriptor>::  [<FL0WSPEC>] <FILTER_SPEC> <LABEL> <SE flow  list>  descriptors:  22  descriptor>  <FLOWSPEC> <SE f i l t e r spec l i s t > <SE f i l t e r spec l i s t > : : <SE f i l t e r spec> I <SE f i l t e r spec l i s t > <SE f i l t e r spec> <SE f i l t e r  spec>::  <FILTER_SPEC> <LABEL>  PathErr  <C0MM0N HEADER> [<INTEGRITY>] <SESSI0N> <ERROR_SPEC> [sender  ResvErr  descriptor]  <C0MM0N HEADER> [<INTEGRITY>] <SESSI0N> <RSVP_H0P> <ERROR_SPEC> <STYLE> [<error f l o w  PathTear  <C0MM0N HEADER> [<INTEGRITY>] <SESSI0N> <RSVP_H0P> [<sender  ResvTear  descriptor>]  descriptor>]  <C0MMGN HEADER> [<INTEGRITY>] <SESSI0N> <RSVP_H0P> <STYLE> <flow d e s c r i p t o r  list>  H e l l o <common header> <HELL0> R S V P Hello extension enables R S V P nodes to detect when a neighbouring node is not reachable.  23  The Hello extension is  Figure 2.6: Generic Label Request Object  label (32)  Figure 2.7: Generic Label Object  composed of a H E L L O - R E Q U E S T object and a H E L L O _ A C K object. Each object carries source and destination instance values.  Objects: L A B E L _ R E Q U E S T class=19, C _ T Y P E = 1 L3PID:identifier of layer 3 protocol using this path. Standard Ethertype values are used.  L A B E L class=16,C_TYPE=l 4 octets.  Generic M P L S labels range from 0 to  1048575. ERO  class=20,C_TYPE=l A n E R O consists of multiple sub-objects, each of which represents an address of a virtual node.  L S P _ T U N N E L J P V 4 Session Object Class=SESSION,C_TYPE=7 L S P - T U N N E L J P V 4 Sender Template Object C l a s s = S E N D E R _ T E M P L A T E , C_TYPE=7  HELLO Class=HELLO, C_TYPE=1 HELLO_ACK Class=HELLO, C_TYPE=2  24  IPV4 tunnel end point address(32) Must be zeros.  Tunnel ID( 16)  Extended Tunnel ID(32)  Figure 2.8: L S P _ T U N N E L J P V 4 Session Object  IPV4 tunnel sender address (32) Must be zeros.  LSP ID (16)  Figure 2.9: L S P . T U N N E L J P V 4 Sender Template Object  Src_Instance (32) Dstjnstance (32) Figure 2.10: H E L L O R E Q U E S T Object  Src_Instance (32) Dstjnstance (32) Figure 2.11: H E L L O A C K Object  25  2.3 2.3.1  Linux Overview  Linux has become very popular since it emerged in 1991. Generally, as an OS it can be divided into two parts: user level processes and kernel side (protected) processes. The kernel contains device drivers, memory management, process management and communication management. Communication between user-level processes and the kernel is achieved through message passing using system calls. In the following we cite something closely related to this thesis. One can find a large number of references about Linux on the web. The kernel version which we use as the platform is Linux 2.4.2.  2.3.2  System Call  In Linux, system calls are used by user processes to request kernel services. The signal system call for example, is used to install a user-defined signal handler for a specific signal.  2.3.3  /proc File System  In Linux, /proc files and directories do not actually exist. The /proc file system registers itself with the Virtual File System. For example, the kernel's /proc/'devices file is generated from the kernel's data structures describing its devices. The kernel traps proc file access, and instead of executing ordinary file operations on them it calls special (individually registered) functions. When a file in the /proc directory is created, it is registered with a set of functions that tell the kernel what to do when the file is read from or written to. The /proc file system presents  26  a user readable window into the kernel's inner workings. Some of the files can be modified like a regular file, which facilitates changing a kernel working mode. 2.3.4  Linux Network  Linux T C P / I P Networking Layers Figure 2.12 shows how Linux implements the internet protocol address family as a series of layered software. BSD sockets are operated by a generic socket management software (concerned only with BSD sockets). Supporting it is the INET socket layer which manages the communication end points for the IP based protocols T C P and UDP. Underneath the IP layer are network devices, for example, ethernet and PPP. BSD Socket Interface The BSD socket interface is a general interface which not only supports various forms of networking, but is also an inter-process communication mechanism. Linux supports several classes of sockets, and these are known as address families. The socket address families or domains supported by Linux include the following: UNIX Unix domain sockets INET T C P / I P AX25 Amateur radio X25 IPX Novell IPX A P P L E T A L K Appletalk DDP X25 X25 protocol  27  Network Applications User Kernel BSD Sockets  Socket Interface INET Sockets  Protocols  TCP  UDP  layers  IP  Network  PPP  SLIP  Devices  Figure 2.12: Linux T C P / I P layers  28  Ethernet  There are several socket types and they represent the type of service that supports the connection. Linux BSD sockets support a number of socket types: Stream supported by the T C P protocol of the I N E T address family Datagram supported by the U D P protocol of the I N E T family  Raw Reliable Delivered Messages Sequenced Packets Packet Communication among processes via sockets uses the client-server model. A server provides a service, and clients make use of that service. Linux abstracts the socket interface with the B S D socket layer. The exact meaning of operations on a B S D socket depends on its underlying address family.  2.3.5  Configuration as a Router  In Linux, several tools are provided for configuring a network function. The ifconfig is used to configure network interfaces, ip addresses, netmasks, and so forth. The tool route is used to add and delete routing entries. To enable its working as a router, the contents of the file  /proc/sys/net/ipv4/ip-forward  should be 1. More information about the Linux router can be found in [15].  29  2.4  Transmission Control Protocol  2.4.1  TCP  Transmission Control Protocol (TCP) is a connection oriented, end-to-end reliable protocol designed to fit into a layered hierarchy of protocols which support multinetwork applications [16]. We use T C P applications to test system performance. • Reliability At the sender, T C P assigns a sequence number to each octet transmitted and requires a positive acknowledgment (ACK) from the receiver. If the Ack is not received within a timeout interval, the data is retransmitted. The receiver uses sequence numbers to correctly order segments. • Flow Control T C P provides a means for the receiver to govern the amount of data sent by the sender. This is achieved by returning a Window with every A C K indicating a range of acceptable sequence numbers beyond the last successfully received segment. • Connection In TCP, each connection is uniquely specified by a pair of sockets. (A socket is identified by an IP address and a socket port number.) Two processes must first establish a connection before they can communicate. Figure 2.13 shows the structure of a T C P header.  30  Source Port  Destination Port Sequence Number Acknowledgment Number  Data Offset U  U A P R c s  Reserved  G  S  K  H  R s T  S Y N  F  Window  w  Checksum  Urgent Pointer Options  Padding  Figure 2.13: T C P Header  2.4.2  T C P High Performance Extensions  One performance problem for T C P when it is applied on a large bandwidth network is the window size limit. A T C P header uses a 16 bit field to report the receiver window size to the sender; therefore, the largest window that can be used is 64KB. Another problem arises from the cumulative acknowledgement mechanism, which may cause a devastating effect on the throughput of T C P because all the packets following a lost packet must be retransmitted even if only a single one gets lost. For addressing the above and some other problems of T C P , T C P High Performance Extensions were introduced in RFC1323 ([18]). B y defining a new T C P option and Window Scale, and using Selective Acknowledgement ( S A C K ) , the window size is allowed to be larger than 64KB, and the problems caused by cumulative acknowledgement can be avoided.  2.4.3  T C P in Linux  The T C P protocol is implemented in Linux according to the specifications of RFC793, RFC1122 and RFC2001. Linux 2.4.2 supports RFC1323 T C P high performance extensions, including large T C P windows and selective acknowledgement ( S A C K ) . In 31  order to make use of them, the send and receive buffer sizes must be increased, and the SACK function should be enabled. These can be be set globally with the net. core.wmem-default, net.core.rmem-default  2.5 2.5.1  and net.ipv4-tcpsack sysctls.  O S D N M P L S project Introduction  There is a project group, MPLS-Linux in Open Source Development Network (OSDN), working on implementing an MPLS stack in the Linux kernel [4]. The MPLS-Linux package works as the forwarding component of the MPLS. The current version is 0.996. We use version 0.990 in this thesis. The package is implemented in C, and supplied as a difference file created from the utility diff. For performance consideration, the implementation (addition and modification) spreads through a lot of Linux kernelfiles;however this brings an inconvenience for understanding its processing for it is not modularized. It uses inlabel and outlabel to represent the label of the incoming packet and the label which is put onto the outgoing packet, respectively. It uses labelspace to enable global or local label contexts. Labels on different interfaces can exist in the same context or their own contexts, depending on the labelspace definition for each interface. The implementation also supports label stacks and ATM, PPP, and so forth, as the lower layer. We use it over Ethernet. The /proc/net/mpls_*filesprovide a window for viewing the internal working of the MPLS forwarding part.  32  2.5.2  Data Structure  Three major data structures are defined. They are as follows: ILM  Incoming Label Map This is used to help interpret the incoming label. A n I L M table consists of all the incoming labels that an L S R or egress L E R recognizes. The content of each I L M entry is the label, opcode, F E C and an optional link to the outgoing data structure. The general operation for incoming label processing is as follows: • extract label from top shim • lookup the label in the I L M table • further processing on the packet based on the opcode stored with the label  N H L F E Next Hop Label Forwarding Entry N H L F E is used to assist in adding outgoing labels. The N H L F E table consists of all the labels which can be pushed onto packets. A n N H L F E entry contains the label, outgoing interface and next hop information. The general processing that occurs when a packet reaches the N H L F E table is as follows: • form new shim that contains the label • push the shim onto the packet • forward the packet to the next hop via the outgoing interface F T N F E C To N H L F E The third structure is used to help the ingress L E R figure out what label should be tagged to a packet. The F T N table consists of all F E C s unto which  33  labels are needed to be added. A n F T N entry contains an F E C and an N H L F E entry. The general order of operation for processing an F T N is as follows: • decide what F E C a packet belongs to • find the F E C in the F T N table • forward the packet to the N H L F E entry that corresponds to the F T N  2.5.3  Label and Labelspace  The generic label is a 32-bit integer, the highest 12 bits of which is the labelspace. Labelspace means the label context of the interface this label is associated with, either as an incoming or an outgoing one. A l l inlabels are organized as a Radix Tree with a width of 4 and depth of 8, while outlabels are stored in another Radix Tree.  2.5.4  Operation and Interface  The L i n u x - M P L S package provides an interface through which it can communicate with user-level control components. To achieve this, it extends I O C T L and R T N E T L I N K message types. •  IOCTL — labelspace: setting labelspace — M P L S tunnel: adding and/or deleting an M P L S tunnel  • RTNETLINK — inlabel: adding and/or deleting an inlabel, setting the opcode A n inlabel is represented in the form of < l a b e l type>:<label value>: <label space>. For example, gen:32:0 represents a generic label valued 32 with 0 as its 34  labelspace. Opcodes or instructions for an inlabel include the following: * pop: remove the top label from the label stack * push: push another label onto the label stack * dlv: send this packet to the layer 3 protocol * peek: look at the label on top of the label stack, and start executing instructions associated with it. * set: set the incoming interface to something different from the real incoming interface * fwd: send a packet to an outgoing label structure to be processed outlabel: adding and/or deleting an outlabel, setting the opcode A n outlabel has the form of <label type>:<value>:<interface>:<nexthop family>:<nexthop address>. For example, gen:33:eth0:ipv4: represents a generic outlabel valued 33 with ethO as its outgoing interface, and an IPv4 address as its nexthop to which the packet is forwarded. Opcodes for outlabels include the following: * pop: remove the top label from the label stack * push: push another label onto the label stack * set: copy the outgoing interface and the next hop layer 2 destination from the outgoing label structure * fwd: send a packet to an outgoing label structure to be processed F E C : adding and/or deleting the mapping of an F E C To an N H L F E One limitation is that the F E C entry should already exist in the layer 3 35  192.168.1/ 192.168  add outlabel gen:32:1 :ipv4: set opcode push:gen:32:1 FTN to gen:32:l set labelspace 0 to interface 0 add inlabel gen:32:0 set opcode pop:fwd:gen:33:1 add outlabel gen:33:l:ipv4:  set labelspace 0 to interface 0 add inlabel gen:33:0 set opcode pop:dlv  setopcode push:gen:33:1  Figure 2.14: Operations in L i n u x - M P L S  routing table before it can be used, because the implementation makes use of the data structure of the original layer 3 routing table. Figure 2.14 is an example of creating an M P L S tunnel from node A to C for the F E C with as the destination prefix. Suppose all the labels that are to be assigned and reserved along this L S P (this is the task of the control plane of M P L S using R S V P - T E , for example.) are: at A , outlabel of 32 for interface 1; at B , inlabel 32 of interface 0 and outlabel 33 for interface 1; and at C, inlabel 33 of interface 0. To create this L S P tunnel, the operations that the control part should perform at each of the nodes are shown in the figure.  2.6  Failure Recovery  There are two main approaches to failure recovery: failure protection and failure restoration. Protection requires that resources are preallocated, while restoration relies on dynamic resource establishment. It may take longer time for a restoration strategy to recover from failure because extra steps are needed to reserve resources.  36  I n this thesis, we b a s i c a l l y focus o n failure p r o t e c t i o n , i n w h i c h resources are p r e a l located.  37  Chapter 3  Design 3.1  Overview  In this Chapter, a fast failure recovery algorithm is designed in M P L S . We try to quickly respond to network faults and reduce packet loss as much as possible using our fast reroute mechanism. The model is designed in Linux kernel 2.4.2. We assume network topology and primary and protection routes are already available, which can be obtained from O S P F or configuration files.  3.2  Label Binding  Downstream label binding is used in our design. A label is assigned by the downstream node, and the upstream node uses this label as the outgoing label which is put in the outgoing packet while the downstream node uses it as the incoming label. The M P L S data plane (Section 2.1) uses this incoming-outgoing label mapping to forward packets after the L S P tunnel is set up. The creation or destruction of label bindings can be triggered by data pack-  38  ets (referred to as data-driven label bindings) or control information (referred to as control-driven label bindings). In the data-driven approach, a label binding is triggered when the first packet of a flow, or several packets, are seen. However, this mechanism which is often used for best-effort traffic, is not applicable here because the primary and protection tunnels need to be created and initialized before fast recovery can be ready when a failure happens. Also, the control-driven approach is preferred for its scalability and simplicity. In this thesis, all primary and protection tunnels come from control information so the control-driven approach is thus adopted.  3.3  Primary and Protection Routes  In our model, primary and protection tunnels are chosen and set up, starting from ingress L E R s . Protection routes may come from network topology (obtained from O S P F , for example) and/or control information. A n ideal protection route should be one whose links are disjoint with the primary tunnel because this ensures that protection can always be available for any link or node fault along the primary path. We call this ideal backup tunnel the Disjoint tunnel in this thesis for simplicity. We suppose all protection routes are already set up. A n ingress router plays a key role in our recovery since all backup routes are chosen and stored at ingress routers. Compared with the distributed approach where backup routes are selected at every node to protect individual links, this centralized approach may have a disadvantage with respect to the scalability issue. For example, in Figure 3.1, to achieve the same protection robustness for the link failure BC and CD with a distributed approach where 2 local backup routes are needed for single link protection, 3 backup routes are needed here. 39  T o protect the route A - B - C - D for failure of link B C and/or C D , backup routes A - B - E - C - D , A - B - E - C - F - D , A - B - C - F - D can be used. F i g u r e 3.1: P r o t e c t i o n Routes  Importantly, our centralized strategy makes it easier to a p p l y global resource allocation policies, a n d to implement internal routers. Scalability issues are further discussed i n C h a p t e r 6.  3.4  Failure Detection  L i n k failure can be detected at either the physical layer or the d a t a link layer. Normally, when detection occurs at a lower level, a fault can be sensed more quickly. In our model, we can not o b t a i n the failure detection service at lower layers, a n d hence failure is detected i n the M P L S layer (the s h i m layer between the I P a n d the d a t a link layer). In the M P L S , P A T H / R E S V messages are used for creating a n d b i n d i n g labels. A timeout of these messages is used for tearing down L S P tunnels. However,  40  it can not be used for failure detection because the refresh period of these messages is in seconds for scalability consideration. The HELLO message is defined in RSVP-TE for detecting and recognizing neighbouring LSRs. Its refresh period can be set as short as 5 milliseconds or less. We use it to detect network failure in our design.  3.5  Failure Report  After a failure is detected by the locus node (the node closest to the fault), a failure notification is reported back to the ingress LER. A failure is detected and reported to the ingress node by the upstream node. In [3], Tae H.Oh, Thomas M.Chen and Jeffery L.Kennington use the downstream node to detect and process the failure. One disadvantage of using the downstream node for this role is that sometimes there may not exist any route between the locus node and the ingress L E R for the network failure to be reported. This can be seen in Figure 3.2. When link BC fails, downstream node C can not find a path through which the failure can be reported to the ingress L E R A. Although this limitation can be avoided by reporting the error to the egress L E R D, and letting it then communicate with LER A, it nevertheless introduces an inconsistency between backup route selection and failure reporting, and unnecessarily delays the final notification arrival at L E R A. While using the upstream node for the role, there always exists a path for this task which is just the reverse direction of the primary route, the same one RESV messages traverse. Hence, upstream nodes are used to detect and report failures, and this approach is simpler, responds more quickly, and is more efficient for failure handling.  41  Primary Route ° ~ Protection Route  Egress  When link B C breaks, there is no direct path for downstream C to report the failure to ingress L E R A . F i g u r e 3.2: Failure R e p o r t  3.6  Failure Processing  After a failure is detected a n d reported from the locus node to the ingress node through the reverse direction of either a p r i m a r y p a t h or a protection p a t h , it c a n then be processed at the ingress node. T h e ingress L E R chooses an alternative t u n n e l to which all following flows are redirected, a n d stops the former F T N m a p p i n g . D e p e n d i n g o n the type a n d location of the failure, as well as the control policies, it selects a Disjoint backup tunnel as preferred, or otherwise it selects another b a c k u p tunnel.  T h e problematic tunnel is invalidated a n d allowed to die out gradually  without being t o r n down explicitly. T h e session still exists, but w i t h no label b i n d i n g unless the deletion instruction from the upper layer is received (possibly because the network topology finally converges through for example, O S P F ) . T h i s lazy deletion strategy prevents the unnecessary system cost of releasing a n d recreating t u n n e l sessions when the failure is a temporary event. Besides notifying the ingress L E R a n d letting it make the  final  recovery  operation when a failure is detected, the locus node also adopts a local recovery  42  When the link BC fails, locus node B switches traffic back to ingress LER A through the loopback tunnel. Figure 3.3: Loopback Tunnel  step. If the recovery completely relies on the ingress LER, all packets would get lost after the failure until the recovery operation is completed at the ingress LER. Depending on the length of time, the total number of packets lost may vary. Giving this observation, we attempt to be more aggressive and try to save as many packets from being lost as we can, using a protection loopback tunnel for each primary route to reroute traffic back to the ingress L E R during that temporary period. The protection loopback tunnel is created and signaled along the reverse direction of the primary tunnel as soon as the primary tunnel is set up. When a failure occurs, the locus node first reroutes the influenced traffic back to the ingress L E R through this loopback tunnel, and simultaneously notifies the ingress L E R of the fault. This can be thought of as a temporary repairing step before the final one is made at the ingress LER. Figure 3.3 shows the use of a loopback tunnel. Thus, we achieve fast recovery through the combination of a centralized and local  43  recovery mechanism.  3.7  Operations at the Locus LSR  When a failure is detected, operations at the local L S R are as follows: • Finding a protection tunnel A n eligible protection tunnel should at least, not include the downstream link, and preferrablely not include the downstream node, for covering downstream node failure. If a regular protection tunnel can not be found, the protection loop is selected. In brief, the preferential order of choosing a protection tunnel is as follows: — Node Failure Protection Tunnel — Link Failure Protection Tunnel — Loopback Tunnel • Switching the traffic Reroute the traffic to this chosen protection tunnel • Notifying ingress L E R of this failure After a failure happens, the locus L S R stops sending the refresh R E S V messages to its P H O P (the upstream node) until it receives a refresh R E S V message from the N H O P (the downstream node). The failed L S P tunnel gets invalidated through the timeouts of the R E S V message. A receipt of an R E S V message from the N H O P for a failed tunnel is an indication that the failed link or node has recovered and reactivates the tunnel.  44  Disjoint Protection Tunnel  Primary Tunnel abfd: backup tunnel which can protect node c failure abfcd:backup tunnel which protects link cd failure  Figure 3.4: Route Types  3.8  O p e r a t i o n at the Ingress N o d e  An ingress LER initializes the setup of all primary and protecting tunnels. All MPLS tunnel sessions are not deleted unless management instructions are received or the network topology changes. The ingress LER also concatenates the protection loopback tunnel with one protection tunnel so that a recovery tunnel is ready to respond to a network fault. The ingress LER acknowledges a failure through the failure notification carried in a PATHERR message generated at the locus node. In response to a failure, it tries to select a backup tunnel to which the traffic is rerouted. The preference order for choosing a backup tunnel is as follows: • A Disjoint Tunnel  45  • Node Failure Protection Tunnel Basically two possible factors cause a network fault: link failure and node failure. It is best to choose a protection tunnel resistant to both cases if it is not possible to distinguish between the two types of failures. • Link Failure Protection Tunnel If no other choice is available, a link failure protection tunnel is chosen. If unfortunately, the fault is a downstream node failure, the selected protection tunnel (which can only resist a link failure) is invalidated later after a failure notification of that tunnel arrives at the ingress node. In this case, the ingress LER stops forwarding the influenced traffic until at least one tunnel is available again. The problematic primary tunnel is still used until it finally times out, and is reactivated in case a refresh RESV message is received from the NHOP or replaced by a new one if an instruction from the upper level arrives. It is up to the system whether the traffic which is already re-directed to a protection tunnel needs to be switched back to the primary one. In our design, we choose to switch it back.  3.9  Operation at the Egress LER  The only operation necessary for an egress LER as part of the failure recovery is to signal the protection loopback tunnel in the reverse direction for each primary tunnel. Thefirsttime an egress LER receives a PATH message for a primary tunnel it creates a reverse protection tunnel session and sends out PATH messages in the reverse direction. As the PATH message traverses all nodes hop-by-hop, along the reverse path of the primary route, a protection loopback tunnel is set up from the 46  egress L E R to the ingress L E R (in the context of data flow instead of the reverse loop tunnel).  3.10  Analysis of Potential Problems  Centralized failure recovery at the ingress L E R , a n d locally repairing the link w i t h enabling a reverse protection loop as previously described, makes it possible  to  quickly respond to network faults while d r o p p i n g as few packets as possible. However, the effectiveness of using the protection loop may be influenced by several factors which are analyzed i n the following.  •  Timing T h e failure recovery cycle is illustrated i n F i g u r e 3.5.  T l reflects the delay  of the detection of a n occurrence of failure. T 2 a n d T 4 reflect the delay for recovery operations to take effect after they are issued. T 3 arises m a i n l y f r o m network propagation. If we ignore the influence of the operation delay of T 2 and T 4 , w h i c h we can not control, the goal i n using a loopback t u n n e l is to try to save packet loss d u r i n g T 3 . O b s e r v i n g that we always have to lose some packets d u r i n g T l , T 2 or T 4 , we m a y ask if our approach would be u n d e r m i n e d by the packet loss w h i c h has to h a p p e n i n T l , T 2 or T 4 . F o r example, if T l is much longer t h a n or comparable w i t h T 3 , does this loopback tunnel protection still make sense?  • Reordering A n o t h e r issue which may undermine our strategy is the re-ordering of packets, which means that the destination m a y receive packets out of order.  Out-of-  order delivery of packets occurs when the final recovery step is finished at the  47  Tl  T2  T3  T4  L o c a l Repairing in Effect  Recovery Completed  Failure Happened  Notification Arrived  Failure Detected Local Repairing Issued Failure Notification Sent F i g u r e 3.5: Recovery C y c l e  ingress node, a n d all the following packets go t h r o u g h the backup tunnel while some former packets remaining i n the loopback tunnel must arrive later at the ingress L E R . T h i s out-of-order p r o b l e m lasts u n t i l all the packets already i n the loopback t u n n e l get redirected to the backup t u n n e l by the ingress L E R . F i g u r e 3.6 shows this problem.  After the recovery steps are completed  the ingress L E R , packets 101 a n d 102 are already o n their way i n the  at  new  backup tunnel, while packets 98,99 a n d 100 have to arrive later at the ingress node t h r o u g h the loopback tunnel a n d be forwarded into the chosen protection tunnel. T h e arrival sequence of these packets i n destination may be 101-10298-99-100 instead of the in-order delivery of 98-99-100-101-102.  •  End-to-end Protocol F r o m the previous explanation, it is clear that the end-to-end protocol used may influence the performance of our packet saving endeavor using loopback tunnels. If the end-to-end protocol can not resist the out-of-order p r o b l e m , or can not reorder packets, this aggressive strategy (using loopback tunnels) may  48  Protection Tunnel  o*-  104 103  Egress  Loopback Tunnel Locus Primary Tunnel Figure 3.6: Out-of-order Problem  not be suitable. Also, if there is no buffering mechanism in the end-to-end protocol, the loopback strategy may also need to be carefully reconsidered. T C P is a widely used end-to-end protocol which guarantees reliable transport (see Section 2.4). To make use of the benefits of our loopback strategy, the two T C P options, S A C K (Section 2.4.2) and large buffering (to hold all out-oforder packets) may be needed. We notice that before our recovery commands are issued and take effect, some packets already get lost. Without S A C K , T C P may resend all those packets following the first lost one, even though there may be only 1 packet lost. This means T C P without S A C K may not be able to make use of those saved packets in our loopback strategy.  The  buffering capacity reflects the ability of T C P to handle packet reordering. If the buffer (window size) is too small, T C P may too quickly resend packets that it thinks of as lost, although they may arrive a little later. In addition, T C P has a congestion control mechanism which curbs sending out new packets if an acknowledgement is not received within the expected time-out period. This  49  congestion control m a y also help alleviate the reordering p r o b l e m . L i n u x 2.4.2 supports S A C K a n d the large buffering option (Section 2.4.3) which facilitates our design. In this project, U D P is also used for our evaluation. Packet loss is measured in U D P to evaluate our packet saving strategy.  T h e implementation of the above design over L i n u x clusters is given i n C h a p 4. Performance evaluation over T C P a n d U D P is given i n C h a p t e r 5.  50  Chapter 4  Implementation 4.1  Overview  A s described i n Section 2.5, the M P L S forwarding plane (data plane) used is from O S D N . W e used L i n u x kernel 2.4.2  workstations as our L S R s , a n d implemented  the basic control plane i n user space.  T h e control plane uses R S V P - T E as the  label d i s t r i b u t i o n protocol i n accordance w i t h the I E T F draft [10], a n d realizes our fast recovery algorithm.  T h e system infrastructure, key d a t a structures, a n d the  inter-operation are described i n this chapter.  4.2  System Structure  O u r M P L S control plane runs as a user level p r o g r a m i n L i n u x .  W e use U D P to  transport L S P signaling messages, a n d adopt R S V P - T E as the label d i s t r i b u t i o n protocol [10].  N E T L I N K messages are used to convey control information to a n d  from the d a t a plane executed i n the kernel space. D i a g r a m 4.1 depicts the structure of our implementation.  51  Tunnel Manager  Session  Label  Manager  Manager  T RSVP  Neighbour  Packager  Detector  UDP  NETLINK  Transceiver  Messager J  Linux Kernel  Figure 4.1: Structure of Implementation  A description for each functional unit is given below. U D P Messager This is used to transmit and receive messages from other neighbouring LSRs. These messages include MPLS tunnel signaling messages (Path/Resv messages) and neighbouring LSRs' hand-shaking (HELLO messages). Both types of messages are encapsulated in UDP packets. Path/Resv messages are sent or updated and received at the frequency of seconds, while HELLO messages occur more frequently, normally every 100 milliseconds. N E T L I N K Messager  This is used to communicate with the kernel side MPLS  52  data plane mainly for label manipulation, such as adding labels, deleting labels, label binding, and so on.  RSVP Packager It is responsible for packaging and unpackaging R S V P messages to or from the U D P Transceiver.  Neighbour Detector The Neighbour Detector makes use of the H E L L O messages from the R S V P Packager to detect network failure, and notifies the Session Manager of the failure for related M P L S sessions. It also periodically sends out H E L L O messages through the service provided by the R S V P Packager to all its neighbours at the frequency set during initialization to detect failure.  Session Manager A session is a dynamic concept in our implementation. A n L S P tunnel created in memory, ready for M P L S sigalling, is called an L S P session. At an ingress L E R , the Session Manager reads in information of all primary and protection tunnels from the Tunnel Manager, and triggers an M P L S session for each tunnel. The Session Manager receives L S P tunnel signaling messages from its neighbouring LSRs through the R S V P Packager, and performs related tunnel operations. Upon the receipt of Path/Resv messages it reserves labels and does label bindings, and upon the receipt of Pathtear/Resvtear messages it revokes the reserved labels and undoes label bindings through the label manager. Other relevant operations are also issued when receiving Patherr/Resverr messages. The Session Manager also periodically updates the states of tunnels and sends out Pathtear/Resvtear messages in the case of a time-out. The Session Manager acknowledges a network failure through the service provided by the Neighbour Detector when a network fault happens. The failure  53  is then reported by the locus LSR to the ingress LER with a Patherr message. The Session Manager of the locus LSR also reacts to a network failure by redirecting the flow to a pre-signaled protection LSP tunnel. The redirection at the locus LSR is achieved by concatenating the failed tunnel with a protection one (Section 4.9). All other LSRs, except the locus and the ingress nodes, forward the Patherr message back to the ingress LER. Label Manager The purpose of the Label Manager is to manage label spaces. It maintains a table of labels in use for each network interface. On the request from the Session Manager it may reserve or recycle a label for a certain interface. Label binding or unbinding is then completed through the NETLINK Manager. Tunnel Manager The Tunnel Manager maintains all primary tunnels and protection tunnels. The Tunnel Manager gets the tunnel information from the routing table if the node is an ingress LER, or from the Session Manager, which receives tunnel sessions from its neighbours if it is an internal LSR. The Tunnel Manager also provides the information to the Session Manager for selecting a protection tunnel in case of network failure. In our implementation, a simple configuration file is used to contain all the tunnel information at an ingress LER.  4.3  MPLS Session  A session is an LSP tunnel whose status is active (i.e., created in memory and may not yet be signaled as OK). Each LSP session corresponds to a certain path or route 54  hash table  session grpl  session grp n  T session 1 session 2  session n  Figure 4.2: Organization of Sessions and Tunnels  of a data flow. A l l sessions are organized into a hash table. Each traffic flow ( F E C ) may have multiple corresponding tunnels (primary and protection) which are treated as one session group. A l l tunnels (corresponding to the same F E C ) in the same session group are stored as a linked list in a hash bucket. The organizational structure is shown i n Figure 4.2. We adopted this organization because all the tunnels for the same F E C are stored together, which can facilitate tunnel operations. A session group is identified by the Session Key, which is composed of a Destination Address Prefix, a Source Address and a Session Id. The Session Id is assigned at the ingress L E R , and corresponds to a certain F E C . In addition, an  55  F E C basically includes the destination address or prefix, plus e x t r a information if needed, such as the source address a n d / o r Q o S requirement.  It is possible that  multiple session groups for different traffic flows can exist between the same pair of source-destination, a n d hence Session Ids are used, c o m b i n e d w i t h source a n d destination addresses to identify these traffic flows. In our implementation, the Destination A d d r e s s Prefix is the address prefix of the final receiver (not egress L E R ) , while the Source A d d r e s s is the default I P address of the ingress L E R . A s already stated, each session group contains p r i m a r y a n d protection t u n nels.  E a c h tunnel w i t h i n a session group is uniquely identified by the field Sender  T e m p l a t e which contains the sender's I P address a n d the tunnel's L S P I D . Here, the sender refers to the ingress L E R , not the originator. T h e L S P I D is used to distinguish different tunnels ( L S P s ) w i t h i n the session group. E a c h L S P session needs two ways of signaling: upward a n d downward signaling (for example, each L S R receives Path's from, a n d sends Resv's back to, upstream L S R s , while it also receives Resvs from, a n d sends Paths to, downstream L S R s ) . In our notation, a n L S P session has two sub-sessions, Incoming a n d O u t g o i n g , each of which relates to one of the two traffic flow directions. A s a summary, the i m p o r t a n t d a t a fields i n a n L S P session (for b o t h Incoming a n d Outgoing) are described i n the following. Sender Template Sender's I P address, L S P I D  Hop P h o p (Previous Hop) for Incoming a n d N h o p (Next H o p ) for the O u t g o i n g I P address of a P h o p or N h o p  56  Time Value The interval between two refresh signals required by a Phop or Nhop (A refresh Path/Resv message should be sent out to a Nhop/Phop before the time-out.) Iif/Oif Incoming/Outgoing network interface index Label For Incoming, label assigned to a Phop For Outgoing, label assigned by a Nhop Pathtimer For Incoming, the last time when a Path message was received For Outgoing, the last time when a Path message was sent to the Nhop Resvtimer For Incoming, the last time when a Resv message was sent to the Phop For Outgoing, the last time when a Resv message was received Up,Down Upward or Downward session from or to which the tunnel is currently concatenated.  4.4  Primary Tunnels and Protection Tunnels  Tunnels and Explicit Route Objects In our implementation, a primary LSP and all its protection LSPs are contained in the same session group, sharing the same FEC. An LSP configuration file (routes.conf)  57  is used for configuring the tunnel groups. Each record in the routes.conf represents an FEC and all its LSPs with explicit routes. During LSP signaling, an LSP tunnel is identified at each LSR by the Session Key, Sender Template and LSPID. At an ingress LER each FEC is given a unique session ID within that ingress LER as part of a Session Key. All LSRs match the primary and protection tunnels by checking the Session Key and LSPID when RSVP messages are received. Explicit routing is used to create the LSP tunnels. An ERO (Explicit Routing Object) is carried in each Path message. Each LSR checks the ERO in an incoming RSVP message (Path message only) to see if the next hop to which the message is to be sent is the LSR itself. A Patherr message is sent back to the Phop when an unexpected Next Hop appears in the ERO. Each internal LSR (other than egress nodes) strips off the first hop (which is the LSR itself) from the ERO and puts the shortened ERO in the Path message, which is sent to the Nhop. Each LSR holds part of the explicit route for an LSP starting from itself, which can be used to help choose a protection tunnel in case of a failure (see Section 4.9). Explicit Route for A Loopback Tunnel As explained in the previous section, each LSR only records a shortened path of the complete route, and hence, an egress LER does not know all the upstream hops of the explicit route for an LSP tunnel. As a result, the egress LER is not able to signal a loopback protection tunnel to the ingress LER unless we add an extra mechanism. In our implementation, the problem is solved by carrying the loopback route as an extra segment following the normal explicit route for a primary LSP tunnel.  58  When an egress LSR receives the ERO in a Path message, it gets the loopback route which is left in the ERO. The egress node acknowledges the loopback route (from an ordinary ERO) or the request of loopback protection when the Nhop in an ERO equals the Phop from which this Path message is received. For example, consider a primary tunnel with loopback protection with an explicit route (at the ingress node) similar to A —> B —> C —> D -» C —> B —> A, where A and D are LERs. When the Path message from C arrives at egress LSR D, D gets the route D —>• C —> B —» A (the segment A —> B —> C has been stripped off at former hops) and determines that the last segment is a loopback route by checking if the Nhop of this ERO, which is C, equals the Phop of this Path message, which is also C. D, instead of forwarding the Path message to the Nhop as it would normally, initiates another tunnel session towards A for the loopback protection tunnel. An additional benefit of this mechanism is that an ingress LER can decide whether a protection loopback tunnel is enabled by adding or not adding the loopback segment onto the ERO.  Distinguishing Tunnels We have three different kinds of tunnels: primary, protection and loopback tunnels. Each needs to be treated differently in many cases. For example, we need to find out which one is the loopback tunnel in a session group when we need it, such as in the case of a fault. As well, because the loopback protection tunnel is concatenated to another protection tunnel at the ingress LER, obviously we want to ensure the looping operation does not happen on ordinary tunnels. The question is how can we distinguish them?  59  Primary tunnels are distinguished from protection tunnels by using different LSPIDs: 0 for primary tunnels, and a non-zero value for protection tunnels. A loopback protection tunnel uses the egress L E R ' s address in its Sender Template which contrasts to a regular tunnel that uses the ingress L E R ' s address instead. A loopback protection tunnel can be distinguished by seeing if the Sender IP Address in its Sender Template is different from the Source Address in its Session Key. Ordinarily, tunnels use the same address.  4.5  LSP Tunnel Signaling  R S V P - T E is used to signal L S P tunnels. Each L S R maintains the states for all L S P tunnels it currently holds. After a tunnel session is created, there are mainly two possible states for the session: Waiting and Connected. A session in a Connected state means the tunnel is successfully signalled, the incoming or outgoing labels are assigned, and the label mapping is created.  A session in a Waiting state means the session is  created, but still waiting for a response from the downstream LSRs, for example, for label assignment. Events that can make a tunnel session state change include: R S V P messages (Path, Patherr, Resv, Resverr, etc.), timeout, and network failure. A l l R S V P messages are inspected for their validity before further processing is made. These validity checks include: interface check, hop (Phop/Nhop) check and label check. A n R S V P message from an unexpected interface (for example, a Path message for an existing tunnel session received from an interface different from the last time) is discarded. A Path/Resv message from an unexpected Phop or Nhop is also discarded. 60  A detailed description for the processing of these events is as follows: 1. A new Path message initiates the creation of a new LSP tunnel session. 2. A Path message for an existing LSP tunnel session refreshes the session state (Pathtimer/Resvtimer of the session is refreshed to the current time value, for example). 3. A new Resv message with a label inside makes the LSR reserve a label for the incoming interface of this tunnel session and send out a Resv message to its upstream neighbour. A new Resv message makes an ingress LER redirect the traffic to the successfully created tunnel. An Resv message for an existing tunnel session refreshes the session state. 4. A Resv message for an existing session refreshes the session state. 5. A Pathtear message instructs the LSR to tear down the created tunnel without deleting the session, and for intermediate nodes, forward a Pathtear message downstream. 6. A Resvtear message gets the LSR to tear down the created tunnel without deleting the session, and forward a Resvtear message upstream, if applicable (for intermediate nodes). 7. Processing a Patherr message depends on what kind of error is carried inside. For network failure notification the Patherr message is just forwarded upwards until it reaches the ingress LER where the recovery operation is carried out (Section 4.8). 8. Receipt of an Resverr message may get the LSR to reserve a different label and reconfigure the tunnel. 61  9. T h e timeout of a t u n n e l session deletes the tunnel session f r o m memory. T h i s is the only way of deleting a t u n n e l session.  4.6  Label Management  A label is assigned by the downstream L S R on an interface basis. has a n independent label space.  E a c h interface  E a c h i n c o m i n g or outgoing label is identified by  the corresponding interface index a n d the label value. A valid label value is a 20-bit unsigned integer. T h e kernel side M P L S d a t a plane maintains a n d manages all i n / o u t labels currently i n use. T h e user-side control plane i n our implementation also maintains a copy (organized differently from the one at the kernel side) of all labels i n use for performance revisions: for example, a free label could be quickly assigned without having to consult the kernel side every time.  T h e control plane maintains a label  allocation table for each interface, a n d each M P L S session records the i n / o u t labels it is currently using.  4.7  Failure Notification  W h e n network failure is detected at the locus L S R , the error needs to be reported as quickly as possible to the ingress L E R . T h e question is how this error should be reported, or what k i n d of information should be carried i n the notification message. T o handle a network failure, the ingress L E R needs to know the location of the failure. It also needs to know for which L S P tunnel session this error occured. A s a result, the I P address of the locus L S R where the error is detected, together with an error code for network fault, are carried i n the P a t h e r r message for a failed L S P  62  tunnel (with its Session Key and Tunnel ID, etc.). The Patherr message traverses in the reverse route of the Path message, and each LSR forwards the Patherr message without modification back to its Phop of the LSP tunnel. On receipt of the Patherr message, the ingress LER knows the actual location where the error happened, and can make the appropriate recovery operation based on the error information. If the Patherr message is lost, the fault recovery would have to be delayed until a timeout occurs for the refresh Resv message for the tunnel session at the ingress node. No failure is reported for the loopback protection tunnel because the loopback tunnel does not need to be re-protected.  4.8  Ingress LER  General Operations An ingress LER maintains and manages mappings of FECs to LSP tunnels. The session ID (in the Session Key) is assigned for each FEC by an ingress LER. The LSPID is assigned by the ingress LER, and used together with the sender's IP address to identify an LSP tunnel within an LSP session group. LSPID 0 is assigned to the primary tunnel, while other values are assigned to protection tunnels starting from 1. For each traffic flow, only one FTN (FEC To Nexthop) mapping can be active at the ingress node, which means a certain FEC traffic can not have multiple independentflowsthrough our MPLS network. An Ingress LER gets all the tunnels' information (FECs and explicit primary/protection routes) from a configuration file. During initialization, an ingress LER takes in a session group entry (with its FEC and all tunnel information). The Ingress LER then assigns a Session ID and  63  L S P I D ' s a n d signals the sessions for all of its tunnels. W h e n a P a t h message for a loopback protection tunnel is r e c e i v e d , 1  the  ingress L E R assigns a label a n d sends back a Resv message c a r r y i n g this label backwards (downstream).  T h e n , instead of issuing a P O P / D L V operation for this  inlabel which would normally forward its representing traffic outside the M P L S area, the ingress L E R concatenates the loopback tunnel to a n already created protection t u n n e l for later use. T h e concatenation is done as follows:  Do label binding Loopback.inlabel—> Protection, outlabel Record the concatenation Loopback. down = Protection  a n d Protection.up = Loop-  back  T h e chosen protection tunnel should ideally be a Disjoint tunnel. A n ingress L E R requests the loopback protection by giving the loopback segment i n the E R O of the P a t h message, a n d disables the loopback protection without that segment.  Recovery Operations W h e n a network failure report is received at an ingress L E R , the ingress  node  processes the error by doing the following steps:  R e c o r d i n g the E r r o r T h e tunnel state is changed to Failed, a n d remains i n that state u n t i l a Resv message is received from its downstream neighbour. Here, this ingress can be considered as the egress of the loopback tunnel: however, we still call it ingress in the context of the traffic flow and a loopback tunnel is considered as a special protection tunnel of the primary tunnel. 1  64  Choosing a Protection Tunnel A protection tunnel in a good status is chosen according to the privilege order . described in Chapter 3. Rerouting the Traffic Traffic  is now rerouted to the chosen protection tunnel and the former FTN  mapping is revoked. After a tunnel is marked as Failed, all LSRs (except the ingress LER without an upstream neighbour) stop sending refresh Resv messages upwards until refresh Resv messages are received from downstream nodes. All LSRs still send refresh Path messages downwards to keep the session alive (not removed from memory). The tunnel along its path remains Failed until the network failure is fixed. Later, the reception of a Resv message from downstream makes the ingress LER acknowledge that the once failed tunnel is operational again, and in our implementation, the rerouted traffic is switched back to the primary tunnel.  4.9  Locus LSR  Fault Detection A network failure is detected using RSVP Hello messages. Each LSR periodically sends out Hello messages to all its neighbouring LSR's. Each LSR maintains the instance number of itself which is increased by one each time a Hello message is sent to its neighbour. Also included in a Hello message is the destination's instance number which was the one received most recently. These two source/destination instance numbers can be used by neighbouring LSR's to detect each other. The interval of two consecutive Hello messages should be as short as possible, 50ms in our 65  implementation.  M i s s i n g two consecutive Hello messages is used as the i n d i c a t i o n  of a network failure. T h e upstream L S R of the failure processes the failure (reports it to the ingress L E R a n d adopts a temporary local recovery). Sometimes external interference can cause a link or node to be instable d u r i n g some p e r i o d . T h i s m a y be reflected as the alternate occurrences of failure a n d recovery. T o avoid the flapping p r o b l e m of a network l i n k / n o d e failure, the Hello mechan i s m is not used for detecting network recovery (which means the failed l i n k / n o d e is recovered). Instead, the receipt of a Resv message from the downstream neighbour is used as to indicate the link recovery. T h i s is attractive for two reasons:  1. F i r s t , for recovery we don't need to be so hasty to detect it a n d reroute the traffic back to the p r i m a r y p a t h , compared to a network failure to which we have to be as quick as possible to respond to reduce packet loss.  2. Resv messages are processed i n much longer periods (normally several seconds). C o m p a r e d w i t h Hello messages, using R e s v messages to sense the link recovery can help avoid overburdening L S R s .  Local Recovery Operation W h e n network failure is detected, the upstream node notifies the ingress L E R by sending back a P a t h e r r message w i t h its own I P address inside, a n d also invokes a local recovery operation. If the tunnel is a p r i m a r y tunnel (protection tunnels are not re-protected!), a local recovery operation includes the following:  Choosing a Protection Tunnel If a regular protection tunnel can not be found, the loopback protection tunnel is chosen.  66  Rerouting the Traffic The original label mapping is replaced with the one which redirects the traffic to the protection tunnel chosen in the above. Failed.inlabel—t Protection, outlabel, Failed.down = Protection, Protection.up = Failed.  A loopback tunnel finally dies out with the timeout of the Path message . This 2  means the loopback concatenation does not last very long, just a short period (temporary recovery concatenation). This approach is elegant because we do not want the looping to last too long (consuming too many network resources). When the network is fixed (a Resv message with a label inside is received for the session), this concatenation (if still active) is disconnected by re-mapping the tunnels within themselves, as follows: Fixed.inlabel^rFixed.outlabel, Fixed.down = 0, Protection.up = 0, Protection, inlabel-> Protection, outlabel.  4.10  Egress L E R  Main tasks of an egress L E R include the following: 1. reserving a label and sending back Resv messages which set up the tunnel; and 2. creating and initiating signaling the loopback tunnel session for a primary tunnel, if required. T h i s happens because either the downstream link or node is broken and no more Path messages will come in. 2  67  W h e n a P a t h message for a p r i m a r y tunnel is received for the first time, the egress L E R creates a loopback t u n n e l session i n memory, a n d starts signaling the tunnel if there is a loopback segment i n the E R O of the received P a t h message. T h i s loopback t u n n e l session remains active as long as the p r i m a r y t u n n e l session remains alive. Receipt of a Resv message for a loopback tunnel session f r o m the upstream L S R is ignored because the first segment of the loopback t u n n e l is practically useless (there is no F T N for the loopback tunnel). A loopback t u n n e l session uses the same Session K e y value as that i n its regular p r i m a r y tunnel, but a different Sender T e m p l a t e value. T h i s means that for a loopback tunnel, the Session I D a n d the L S P I D are not special a n d the ingress I P address is still used for the Source, but the egress I P address is used for the Sender T e m p l a t e . In addition, this feature can be used to identify a loopback t u n n e l from an ordinary one by all L S R s .  4.11  OS Influence  T h i s section discusses some special factors i n L i n u x which influenced our implementation.  1. T i m e r In practice there is a n u p p e r limit to the frequency i n w h i c h a T i m e r can be set i n L i n u x .  W e tested a n d found that a p e r i o d of less t h a n 30ms would  cause the L i n u x kernel to crash. W e set a p e r i o d of 50ms for the T i m e r i n our experiment. T h i s is used for sending a n d receiving Hello messages. A s analyzed i n Section 3.10, we prefer that a network failure be detected as quickly as possible. T h e  68  timer limitation means the failure can only be detected within 100-200ms. 2. Communication and Scheduling A delay can not be completely avoided for an operation to be completed in Linux. This delay arises from the communication from user space to kernel side and OS scheduling. It can vary and is out of our control . A possible effect 3  of this delay is that it may take longer for a command issued in our user space, the MPLS control plane, to finally be completed than for a message to travel through our Ethernet network and complete at the ingress node. We have to artificially add some delay at some point in our code to make the timing consistent without losing generality, which we present in our experiment in the next chapter.  Although by implementing the whole stuff as a kernel module the delay can be reduced, it would be far more complex than our thesis can handle. 3  69  Chapter 5  Performance Evaluation 5.1  Overview and Experimental Environment  In the previous two chapters we described the design a n d implementation of a fast recovery system for network failures i n M P L S networks. F o r quickly responding to network failure, a quick failure detection mechanism using R S V P - H E L L O messages is adopted.  B y notifying the ingress L E R of the failure, which makes the ingress  router reroute the influenced traffic to the best pre-signaled backup tunnel, a n d simultaneously invoking a, local recovery operation using a loopback tunnel at the locus npde where the failure is detected, packet loss is expected to be reduced. In this chapter, we performed some experiments to evaluate our recovery performance m a i n l y over two end-to-end protocols: T C P a n d U D P . L i n u x workstations are used to b u i l d our experimental topology.  Since our  entire testing system is set up over L i n u x , a n d our concern is basically focused o n failure recovery response a n d packet loss, instead of a routing algorithm, it suffices to use simple network topology i n the experiment. F i g u r e 5.1 shows the topology we use as our test bed. L i n u x servers  70  A,B,C  A , B , C , D : M P L S routers P,Q: communication peers A l l : Linux 2.2. P H 266/128M F i g u r e 5.1: Testing Network T o p o l o g y  and  D compose our M P L S network where A a n d D are edge routers.  P and Q  are two hosts, each of which communicates to each other, traversing our testing network. T h e configuration of all the L i n u x boxes are: Intel P I I 266, 128M memory, and 100/10 E t h e r n e t . T h e L A N is 100M E t h e r n e t . W e investigated the c o m m u n i c a t i o n between P a n d Q while using P as the sender a n d Q as the receiver. T w o M P L S tunnels for the c o m m u n i c a t i o n P —» Q are pre-signaled: a p r i m a r y route ABD,  a n d a protection route ACT). T h e link BD  is made unreliable to simulate a network failure. T h e p r i m a r y a n d protection tunnels, plus a loopback protection tunnel are pre-signaled as O K . Normally, the traffic goes t h r o u g h the p r i m a r y tunnel W h e n the link BD  DBA ABD.  fails, the traffic is rerouted to the protection tunnel ACD  at  ingress A, a n d routed back to the loopback protection tunnel at locus node  B.  W h e n the failed link recovers, the traffic is routed back to the p r i m a r y tunnel  ABD.  A s described i n Section 4.11, the processing a n d scheduling delay i n L i n u x O S may result i n t i m i n g inconsistencies for our experiment on a L A N . T o achieve t i m i n g  71  consistency, a n d to simulate network propagation delay (latency), an artificial delay is introduced to the transport of P a t h e r r messages (for failure notification) a n d used as a parameter i n our experiment. T w o T C P features, S A C K a n d large T C P window size options are enabled i n L i n u x (2.4.3). another parameter i n the experiment.  Buffer size for sending a n d receiving is  W e show how these parameters affect  the  results below.  5.2  Evaluation Criteria  T o evaluate our recovery performance, response time a n d packet loss are investigated in our experiment. However, response time is sometimes h a r d to accurately measure in practice, a n d i n that case we compare our recovery performance w i t h the one without network failure which indirectly reflects the recovery time. W e also compare the performance of our recovery mechanism w i t h a n d without the protection loop.  5.3 5.3.1  Evaluation Results Using T C P  Experiment 1 In the first experiment, we transfer a large file (size of 7 0 0 M B ) between the two hosts P a n d Q v i a T C P t h r o u g h our M P L S network w i t h the link B-D periodically out of service (this is simulated by blocking the c o m m u n i c a t i o n between B a n d D periodically). W e use the total elapsed time for our evaluation. T h e experimental results are shown i n T a b l e 5.1. Failure between B a n d D occurs every 10 seconds, a n d lasts 5 seconds each time. W e rely o n T C P for speed  72  wsize 64KB N o Failure Loop  d= =100ms  100KB  640KB  59.7ms 60.4s  60.4s  60.2s  n-Loop  60.3s  60.3s  60.3s  Loop  60.00s  60.00s  59.96s  n-Loop  60.33s  60.33s  60.40s  d= :200ms d= =300ms  Loop  60.59s  60.26s  60.20s  n-Loop  64.00s  63.90s  63.30s  d= :500ms  Loop  60.8s  61.4s  60.5s  n-Loop  66.3s  69.9s  68.7s  Loop  68.9s  64.2s  63.8s  n-Loop  72.2s  72.8s  63.5s  d= :700ms  T a b l e 5.1: T i m e needed for transferring a d a t a of 7 0 0 M B  control (trying to send d a t a out as fast as possible).  In the table, d stands for the  latency, while wsize stands for T C P window size, Loop for recovery w i t h loopback protection a n d n-Loop for recovery without loopback protection. B y investigating these results a n d comparing the total time spent i n all the different cases, we notice the following:  1. For a latency less t h a n or equal to 500ms, L o o p performs very well (very close to the 0-loss performance). F o r a latency less t h a n or equal to 200ms, n - L o o p also performs very close to the G-Ioss case.  Failure c a n be quickly detected  (within 100ms) w i t h our recovery algorithm, a n d hence, packet loss is small.  2. W h e n the latency is low (100ms, 200ms) performance between L o o p a n d nL o o p is very close. T h i s m a y i m p l y that the effort for using a protection loop to reduce packet loss is not necessary for low latency networks. Network failure is detected w i t h i n 100ms.  A s analyzed i n Section 3.10, when the latency is  comparable w i t h , or close to, the delay of failure detection, loopback protection  73  does not make too m u c h difference.  3. T C P buffer size has no effect i n all the cases. T h i s is unexpected since a larger buffer size should p e r f o r m better, as we h a d expected i n Section 3.10.  The  answer lies i n T C P congestion control i n which T C P slows or stops sending out packets when a n acknowledgement is not received w i t h i n a expected m a x i m u m delay. Therefore, when a network failure happens, T C P does not continue to b l i n d l y send out packets u n t i l the buffer is full, but instead stops for a while until the next acknowledgement  is received from its peer (when a protection  tunnel is i n use).  4. W h e n the latency is 700ms, the L o o p mechanism performs worse t h a n it does for a latency of 500ms.  T h i s m a y at first glance contradict our expectation  that L o o p should p e r f o r m better for a higher latency. However, it is actually just a result of the cost of using a protection loop.  O u r average transport  speed is 9 0 M b s which is close to the network b a n d w i d t h (lOOmb E t h e r n e t ) . Since it would be i m p r a c t i c a l for us to implement resource allocation i n our testing of L i n u x routers, a traffic j a m arises when the same physical link is used by the bidirectional protection loop for a longer time! T h u s i n practice, although L o o p is more suitable for high latency network, consideration for the cost of network resources is still necessary.  O u r recovery algorithm (either L o o p or n-Loop) works well w i t h T C P . It reacts quickly to network failure so that packet loss can be prevented as soon as possible. U s i n g protection loop can help save extra packets from being lost, a n d expedite failure recovery for high latency networks. However, the resource c o n s u m p t i o n cost should be carefully calculated. Interestingly, the T C P congestion control mechanism  74  helps prevent the packet out-of-order p r o b l e m , a n d eliminate huge buffer requirement for otherwise h a n d l i n g out-of-order packets.  Experiment 2 In the second experiment, we try to measure total packet loss d u r i n g a one-time network failure. A comparison is given to our recovery a l g o r i t h m w i t h a n d without a protection loopback tunnel. A script using the tcpdump utility is used to measure the total number of ret r a n s m i t t e d packets, which is also total packet loss w i t h the S A C K option enabled in T C P . W e found that whatever values the T C P buffer size a n d network  latency  (from 100ms to 700ms) are, L o o p a n d n - L o o p get the same packet loss (2 packets)! T h i s just confirms our analysis i n the above, that T C P congestion control is i n effect i n this situation. W e can say that for our recovery L o o p buffering is not a n extra issue w i t h T C P , or it does not b r i n g i n extra b u r d e n for T C P buffering.  5.3.2  Using U D P  Since T C P has a mechanism for congestion control, it stops sending out new packets when it does not receive acknowledgements  from the other end for the  packets  already sent. T h i s implies that the sender, when a network failure happens, avoids sending more packets u n t i l the failure is fixed, or the traffic is rerouted to a backup tunnel.  T h i s is shown i n our former experiment that packet loss is the same for  Loop and n-Loop. T o more accurately measure a n d evaluate our recovery response time, a n d especially the performance of our L o o p mechanism, we would also like to know how  75  m a n y packets can be prevented from being lost d u r i n g a network failure, a n d how long it takes for our system to r e s p o n d to a failure. T h e U D P protocol is used i n this experiment.  A c o m m u n i c a t i o n p r o g r a m using U D P is written for measuring  packet loss between the two peers, P a n d Q. W e measured packet loss for our L o o p a n d n - L o o p d u r i n g a one-time network failure for different network latencies a n d source speeds. T a b l e 5.2 shows the results of the experiment. W e use the ratio of packet loss i n L o o p to that i n n - L o o p a n d call it the loss ratio. T h e lower the loss ratio, the better the saving effect. W e investigate the results of the source speed of 30Mbs, a n d find that the higher the network latency, the lower the loss ratio:  0.48, 0.82, 0.34, 0.23 a n d 0.13 for a network latency of  100ms, 200ms, 300ms, 500ms a n d 700ms, respectively. T h i s is also true for a l l other source speeds . 1  W e c a n briefly calculate response time R using the form: R = loss*8/[data speed]. For example, for a latency of 100ms, the response time is: R = 5 3 0 K B * 8 / 3 0 M b s = 140ms for L o o p , R = l . l M B * 8 / 3 0 M b s = 300ms for n - L o o p . T h e response delay arises from the delay for network failure detection a n d failure notification plus processing at routers. A s i m p l i e d i n the results, no more packets are lost after the traffic flow is redirected to the protection loop. F o r the latency of 100ms, the response time (140ms) is roughly the s u m of the delay of detection w h i c h is 40ms or so, a n d the latency of failure report which is 100ms. T h e results show that our recovery system can respond to a network failure w i t h i n 200ms.  Considering this has been u n d e r m i n e d by some limitations of the  T h e source speed control in our testing U D P program is subject to a Linux real time signal limit of 20ms, compared with an ideal situation that should send out packets continuously with a constant speed. This limitation influences some of our experiment data when the speed is slow: for example, loss ratio is 0.8 for network latency 700ms with source speed of IMbs. 1  76  D a t a Speed IMbs d=100ms d=200ms d=300ms d=500ms d=700ms  5Mbs  lOMbs  20Mbs  30Mbs 530KB  Loop  21KB  100KB  200KB  350KB  n-Loop  57KB  290KB  350KB  1.1MB  1.1MB  Loop  16KB  100KB  150KB  830KB  1.4MB  n-Loop  54KB  250KB  580KB  1.4MB  1.7MB  Loop  14KB  100KB  250KB  350KB  620KB  n-Loop  55KB  270KB  720KB  1.3MB  1.8MB  Loop  20KB  450KB  160KB  340KB  500KB  n-Loop  100KB  1.8MB  740KB  1.5MB  2.2MB  Loop  20KB  72KB  430KB  380KB  510KB  n-Loop  25KB  580KB  1.2MB  2.5MB  3.8MB  T a b l e 5.2: T o t a l number of bytes lost i n a single failure  L i n u x O S , the response time can be m u c h shorter i n real routers,  say less t h a n  100ms. T h e results also show, that L o o p is more suitable for higher latency networks, consistent w i t h the result i n the previous section a n d our analysis i n C h a p t e r 3. However, buffering is needed at the receiver side for h a n d l i n g the packet reordering problem. E s t i m a t e d buffer size should be 2 * L a t e n c y * B a n d w i d t h is the b a n d w i d t h reserved for the traffic).  77  (Here, B a n d w i d t h  Chapter 6  Conclusion and Future Work 6.1  Contribution and Conclusion  T i m e l y response to network failure is a n important requirement for m a n y real-time applications i n order to avoid packet loss a n d quality degradation. In this thesis, we designed a fast recovery mechanism i n M P L S networks which can quickly respond to failures. layer.  F i r s t , network failure is detected using R S V P - H E L L O at the  MPLS  A failure c a n be detected w i t h i n milliseconds, a n d the time needed c a n be  shorter if detection occurs at the hardware or datalink layer.  Second, the failure  is reported to a n ingress router through a P A T H E R R message.. A t the same time the local node performs local recovery by redirecting the flow to a backup tunnel or to a reverse loopback tunnel if a n o r m a l b a c k u p tunnel does not exist. F i n a l l y , the ingress router makes the final recovery. W e implemented our algorithm, as well as a n M P L S control plane, over the L i n u x platform. R S V P - E R O extension for M P L S is used i n this thesis as the label signaling protocol. We  investigated  the recovery performance of our mechanism w i t h L i n u x  78  routers. T C P and UDP are the two end-to-end protocols used for our evaluation. The results show that our recovery mechanism can react to a failure very quickly (within 200ms). It also shows that for a high network latency, and with the support of end-to-end protocols like TCP, loopback protection can help expedite failure response and reduce packet loss.  6.2  Limitation and Future Work  Our recovery mechanism requires all protection tunnels, including the loopback one, to be pre-created before a failure happens. This may however, not be feasible in some cases where network resources are limited. We may choose to provide this protection only for those traffic flows that really need it. Our failure report, or notification, is done on the tunnel basis. This may result in scalability problems; there may exist many tunnels for which the failure needs to be reported to ingress routers. The scalability problems can be solved by using multi-commodity traffic, for example, merging several similar flows (with the same pair of ingress-egress routers and QoS requirement) into single commodity traffic. For simplicity, we only focused on link failure in this thesis. However, the recovery algorithm can be extended to cover node failure as well. Node failure can be acknowledged at an ingress edge router by combining multiple link failure reports from different local nodes. The intricacy may exist in that we will need to wait a little longer than we currently do for collecting these reports (to resist node failures) before making a recovery decision, and this will further have to somewhat increase our recovery response delay, which however, is a side-effect.  79  Bibliography [1] A n a t B r e m i e r - B r a r r , Y e h u d a Afek, H a i m K a p l a n , E d i t h C o h e n a n d M i c h a e l M e r r i t t , "Fast Recovery of M P L S Paths",  ACM SIGMETRICS,  J a n . 2001.  [2] A y a n Banerjee, J o h n Drake, J o n a t h a n L a n g , B r a d T u r n e r , D a n i e l A w d u c h e , L o u Berger, K i r e e t i K o m p e l l a a n d Y a h o v Rekhter, " A n Overview of Signalling Enhancements a n d Recovery Techniques",  IEEE Communications Magazine,  J u l y 2001.  [3] T a e H . O h , T h o m a s M . C h e n a n d Jeffery L . K e n n i n g t o n , "Fault Restoration a n d Spare C a p a c i t y A l l o c a t i o n w i t h Q o S Constraints for M P L S  IEEE Globecom 2000, N o v . [4] O S D N , " M P L S for L i n u x " ,  Networks",  2000, p p . 1731-1735.  http://sourceforge.net/projects/mpls-linux.  [5] R a d i m Bartos a n d M y t h i l i k a n t h R a m a n , " A Heuristic A p p r o a c h to Service Restoration i n M P L S Networks",  nications,  IEEE International Conference on Commu-  J u n e 2001, p p . 117-121.  [6] M u r a l i K o d i a l a m a n d T . V . L a k s h m a n , "Dynamic R o u t i n g of L o c a l l y Restorable B a n d w i d t h G u a r a n t e e d Tunnels using Aggregated L i n k Usage Information",  Proceedings of IEEE Infocom 2001, A p r . 2001,  80  p p . 376-385.  [7] B r u c e Davie and Yakov Rekhter,  " M P L S Technology a n d Applications",  Mor-  gan Kaufmann Publishers, 2000. [8] P i n g P a n E d , D e r - H w a G a n a n d et al, "Fast Reroute Extensions to  RSVP-  T E for L S P Tunnels", Internet Draft draft-ietf-mpls-rsvp-lsp-fastreroute-OO.txt, May  2002.  [9] V i s h a l S h a r m a , B e n - M a c k C r a n e a n d et al, "Framework for M P L S - b a s e d R e covery",  [10]  Internet Draft draft-ietf-mpls-recovery-frmwrk-03.txt,  M a y 2002.  D a n i e l O . A w d u c h e , L o u Berger a n d et al, " R S V P - T E : Extensions to R S V P for  L S P Tunnels", Internet Draft draft-ietf-mpls-rsvp-lsp-tunnel-08.txt, [11]  D . Awduche,  J . Malcolm, J . Agogbua,  M . O'Dell and J . M c M a n u s ,  ments for Traffic Engineering O v e r M P L S " ,  [12]  E . Rosen,  A.  Viswanathan  and  R.  Feb. 2001.  IETF RFC2702,  Callon,  "MPLS  Sept.  "Require1999.  Architecture",  IETF  RFC3031, Jan. 2001. IETF RFC2210,  [13]  J . Wroclawski, " R S V P w i t h Intserv",  Sept.  1997.  [14]  R . B r a d e n E d . , L . Z h a n g a n d et al, "Resource Reservation Protocol",  IETF  RFC2205, Sept, 1997. [15] T o n y M a n c i l l , "Linux Routers", Prentice Hall, 2001.  IETF RFC!93,  [16]  D A R P A , "Transmission C o n t r o l Protocol",  [17]  W.Stevens, " T C P Slow Start, Congestion Avoidance,Fast Recovery A l g o r i t h m " ,  [18]  V.Jacobson, mance",  IETF RFC2001,  R.Braden and  IETF RFC1323,  Jan.  D.Borman,  May  1992.  81  Sept.  1981.  R e t r a n s m i t a n d Fast  1997.  " T C P Extensions for H i g h  Perfor-  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items