UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An application of multivariate analysis to time of day routing in telecommunication networks Smith, Isabelle 2001

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2001-0118.pdf [ 3.24MB ]
Metadata
JSON: 831-1.0099542.json
JSON-LD: 831-1.0099542-ld.json
RDF/XML (Pretty): 831-1.0099542-rdf.xml
RDF/JSON: 831-1.0099542-rdf.json
Turtle: 831-1.0099542-turtle.txt
N-Triples: 831-1.0099542-rdf-ntriples.txt
Original Record: 831-1.0099542-source.json
Full Text
831-1.0099542-fulltext.txt
Citation
831-1.0099542.ris

Full Text

A N APPLICATION OF MULTIVARIATE ANALYSIS TO TIME OF D A Y ROUTING IN TELECOMMUNICATION NETWORKS by ISABELLE SMITH B.ENG. in Mechanical Engineering, McGill University, 1998 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR T H E DEGREE OF MASTER OF SCIENCE in THE F A C U L T Y OF G R A D U A T E STUDIES Faculty of Commerce and Business Administration We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA December 2000 © Isabelle Smith, 2000 In p resen t i ng this thesis in partial fu l f i lment of the requ i r emen ts fo r an a d v a n c e d d e g r e e at the Univers i ty of Brit ish C o l u m b i a , I agree that the Library shal l m a k e it f reely avai lable fo r re fe rence and s tudy. I fur ther agree that p e r m i s s i o n fo r ex tens ive c o p y i n g of this thes is fo r scho lar ly p u r p o s e s may be g ran ted by the h e a d of m y depa r tmen t o r by his o r her representa t ives . It is u n d e r s t o o d that c o p y i n g o r pub l i ca t i on of this thesis for f inancia l gain shal l no t b e a l l o w e d w i t hou t m y wr i t t en p e r m i s s i o n . D e p a r t m e n t of T h e Un ivers i ty of Brit ish C o l u m b i a V a n c o u v e r , C a n a d a Da te }<>! /j/(JO D E - 6 (2/88) Abstract The work presented in this thesis is a component of a larger project that is currently under way at the Center for Operations Excellence. This project was initiated in May of 1999. The main goal of this project is to help Telus reduce their yearly investments in the non-toll network by optimizing the current use of their telephone network. We chose to accomplish this by determining a new set of routing rules that would make the most efficient use of their current network. Three main approaches were selected to obtain this new set of routing rules. The first one was the development of a simulation that could serve as a test-bed to compare different sets of routing rules and to determine the optimal set. The second approach was the use of a linear program that would generate routing rules that minimize network utilization. Finally, we have approached the challenge of optimizing the network utilization by using multivariate analysis to help develop routing rules for time of day routing. This last approach is described in this thesis. The objective of this thesis was to develop a methodology, using multivariate analysis, that would help us find if there were possible alternate routes that could be added to the current ones, which would take advantage of excess capacity in certain regions in the network. This methodology was developed in SAS and included the use of Principal Components Analysis, Clustering Analysis and the development of an algorithm that would search for adjacent arcs that had sufficient available capacity to serve as alternate routes during certain periods of the day. It was found that there are alternate routes that can be used during certain times of the day and that these routes can be found by using the methodology described in this thesis. This result is of great value to telephone companies as this means that there is a way for them to use existing capacity more efficiently and therefore avoid or delay investments into adding capacity to the telephone network. 11 Table of Contents Abstract ii Table of Contents iii List of Tables * v List of Figures vi Acknowledgment vii 1 Introduction 1 1.1 Value of time of day routing 1 1.2 Overview of the Telus project 2 1.3 Obj ective of this thesis 3 1.4 Scope of this thesis 3 2 Background 4 2.1 Telecom B ackground 4 2.1.1 Terminology 4 2.1.2 Assumptions 5 2.1.3 Routing rules 6 2.1.4 Demand Types 10 2.2 Technical Background 12 2.2.1 Multivariate Methods 12 2.2.2 Principal Components Analysis 12 2.2.3 Cluster analysis 18 3 Methodology 24 3.1 Overview of the approach 24 3.2 Overview of the methodology 26 3.3 Data preparation 28 3.4 Principal Components Analysis 29 3.5 Cluster analysis 29 3.6 Search for pair of adjacent arcs 30 3.7 Feasibility test 32 4 Results and Analysis 34 4.1 Results of the Principal Components Analysis 34 4.1.1 Description of the results 34 4.1.2 How many principal components should be retained? 37 4.1.3 Screening the data using PCA 39 4.1.4 Interpretation of the principal components 39 4.2 Results of the cluster analysis 42 4.3 Results of the search for adjacent arcs 46 4.4 How will these routes be used? 47 5 Recommendations 47 iii References 49 Appendix A Description of some SAS procedures 50 A-l Principal Components Analysis using the SAS PRINCOMPprocedure 50 A-2 Cluster analysis using the SAS FASTCLUSprocedure 51 A-3 Cluster analysis using the SAS CLUSTER procedure 52 A-4 Tree diagram using the SAS TREE procedure 53 Appendix B Results of the PCA performed on non-normalized data 54 Appendix C Results of the PCA using the correlation matrix 56 Appendix D Scatterplots of the first three principal components 57 Appendix E Time series of the Outliers 59 Appendix F Clusters obtained from using the FASTCLUS procedure 60 Appendix G Clusters obtained using the Average Linkage Method 63 Appendix H Clusters obtained from using Ward's Method 66 Appendix I Clusters obtained from using the Centroid Method 69 Appendix J Results of performing cluster analysis on raw data 71 Appendix K Results 72 Appendix L SAS code 73 iv List of Tables Table 1 Terminology 4 Table 2 Example of a routing table • 8 Table 3 Common methods used for hierarchical clustering 20 Table 4 Sample of data used in this analysis 28 Table 5 Example of search for pairs of adjacent arcs 31 Table 6 Adjacent arcs 31 Table 7 Example of the resulting new possible alternate routes 46 v List of Figures Figure 1 Terminology 5 Figure 2 Fixed Hierarchical Routing rules 8 Figure 3 Variation of the demand during the week 10 Figure 4 Variation of arc utilization with time of day 11 Figure 5 SCREE plot - Example 17 Figure 6 Example of a tree diagram 23 Figure 7 Example 24 Figure 8 Routing table with new alternate route 25 Figure 9 Overview of the methodology 26 Figure 10 Overview of algorithm used to find new routes 27 Figure 11 Calculation of available capacity 32 Figure 12 Simple statistics obtained from the P C A 36 Figure 13 Eigenvalues of the Covariance Matrix 37 Figure 14 Five first Principal Components 37 Figure 15 Principal Component Scores first the first observation 37 Figure 16 SCREE plot of eigenvalues 38 Figure 17 Graphical representation of first three principal components 40 Figure 18 Example 1: Interpretation of the principal components 41 Figure 19 Example 2: Interpretation of the principal components 42 Figure 20 Results of the cluster analysis using Average Linkage Method 44 Figure 21 Tree diagram obtained from using the Average Linkage Method 45 Figure 22 Eigenvalues of the correlation matrix 54 Figure 23 SCREE plot of the eigenvalues 54 Figure 24 First three normalized eigenvectors, O j , a 2 , a 3 55 Figure 25 Scatterplots of the 3 first components 58 vi Acknowledgment I would like to take the opportunity to acknowledge the help and efforts of the people who made it possible for me to complete this thesis. First of all, I would like to thank my thesis supervisor, Professor Martin L. Puterman, for his advice and help in writing my thesis. His assistance and recommendations were greatly helpful. I would also like to thank Professor David Glenn for his encouragement, support and input as member of my thesis examining committee. I extend sincere thanks to Jason Goto for his guidance and technical assistance. His help and suggestions were extremely valuable throughout this project. I would like to thank the Centre for Operations Excellence (COE) for providing myself and other master students the opportunity to apply our skills and knowledge to industry challenges. In addition, the financial assistance provided by the C O E was greatly appreciated. I would especially like to thank Stephen Jones for his assistance and support. Also, many thanks to all of the students and staff who were involved in this project. I would finally like to express my sincere gratitude to my family and friends for their encouragement and support. vn 1 Introduction The demand on a telephone network, i.e. the capacity that is used, fluctuates throughout the day. This demand varies depending on the area of the network as well as on the period of the day. For example, regions of the network that are located in business areas are usually very busy during the day and relatively calm during the evening, while the opposite is true for suburban areas. This.means that at any given time, there is excess capacity in certain regions of the network. There is therefore a great potential for savings if this idle capacity could be used. In an effort to use this excess capacity, time of day routing rules can be put in place in order to distribute the demand more evenly across the network. This could be done, for example, by redirecting some of the demand away from the business areas and through the residential areas during the day. 1.1 Value of time of day routing Taking advantage of time of day routing is of great value to a telephone company since this means that investments into adding capacity to the network can be avoided or delayed. Every year, telephone companies have to increase the capacity of their network in order to respond to the constantly growing demand on the network. This increase in the demand has been accelerated in the past years due to the increased Internet traffic, the special offers such as the long distance flat rate packages and the arrival of new services such as call fowarding. This increase in traffic forces telephone companies to add capacity to their network in order to sustain their level of service and remain competitive. Alternatively, telephone companies can find ways to use their currently available capacity more efficiently: this is the challenge that was presented to our group at the Center for Operations Excellence (COE). 1 1.2 Overview of the Tel us project The work presented in this thesis is a component of a larger project that is currently under way at the COE. This project was initiated in May of 1999. The main goal of this project is to help Telus reduce their yearly investments in the non-toll network by optimizing the current use of their telephone network. We chose to accomplish this by determining a new set of routing rules that would make the most efficient use of their current network. Three main approaches were selected to obtain this new set of routing rules. The first one was the development of a simulation that could serve as a test-bed to compare different sets of routing rules and to determine the optimal set. This simulation, which was programmed in C++, reproduces the network activity and can therefore be used to test different sets of routing rules. The inputs to the simulation are the call length and frequency distributions, the arc capacities and the routing rules. The main call statistics that are collected while the simulation is running are the number of blocked calls and the network utilisation. These call statistics can then be used as criteria to determine the optimal set of routing rules. The simulation can also permit us to determine problem areas in the network where it can be observed that the link capacity is not sufficient to handle all of the calls that are routed through that area. The second approach was the use of a linear program that would generate routing rules that minimise network utilisation. A multi-commoditiy flow model that analyses a static interval of network activity was developed. This linear program generates a set of routing rules that minimises the network utilisation (i.e. it makes the most efficient use of the network currently in place given a specific data set). This approach has led to further research into using non-linear programming. 2 Finally, we have approached the challenge of optimizing the network utilization by using multivariate analysis to help develop routing rules for time of day routing. This is the approach that is described in this thesis. 1.3 Objective of this thesis The objective of this thesis was to develop routing rules for time of day routing. More precisely, the objective was to develop a methodology, using multivariate analysis, that would help us find if there were possible alternate routes that could be added to the current ones, which would take advantage of excess capacity in certain regions in the network. 1.4 Scope of this thesis The research for this thesis was performed on a 32-switch subset of the Telus circuit-switched non-toll network. Since Telus is moving toward using packet switching, the results of this thesis will only provide temporary benefits until packet switching is widely used. It is not possible to implement dynamic routing rules (i.e. routing rules which change dynamically based on the network utilization) in the subset of the network which was studied due to the fact that it contains more than one type of switch (i.e. it is not a homogeneous network). Therefore, static routing rules were developed. We have limited the number of routing rule tables to two. This means that during one part of the day, one of the routing tables will be used and during the other part, the other table will be used. 3 2 Background 2.1 Telecom Background 2.1.1 Terminology The main terms that will be used in this thesis are explained below. Although the telecom industry is quite complex in nature, only a few terms and concepts need to be explained, as they are sufficient to understand this thesis. Note: the numbers in brackets refer to Figure 1. Switch (or node) (1) A device that is used to connect circuits and make it possible for calls to be routed through the network Trunk (or circuit) (2) A communication link connecting two switches. It can carry only one call at a time. Trunk group (or arc) (3) Consists of a group of trunks between two switches, where each trunk can carry only one call at a time. Arc capacity Number of trunks that compose an arc. For example, the arc joining switch 01 to switch 02 has a capacity of 2. A Z pair Refers to a pair a switches that are connected directly by an arc. In the figure below, there are three A Z pairs: 01-02, 01-03 and 02-03 Arc utilization Refers to the call volume that is carried on an arc (or trunk) during a certain period of time. Centi-call seconds (or CCS) Measures arc or trunk utilization (or call volume) in volumes of 100 seconds of call time. For example, if 1 trunk is used for 200 seconds during an hour, then the trunk utilization during that hour was of 2 CCS. Similarly, if 2 trunks on an arc were both used for 200 seconds during an hour, then the arc utilization on that arc was of 4 CCS. Routing rules Rules that determine the different paths through which a call can be routed from its origin to its destination (see routing rules section) Blocked call A call is blocked when all of the circuits on all of the possible routes are busy (i.e. are already in use) Table 1 Terminology 4 K D Figure 1 Terminology 2.1.2 Assumptions There are three main assumptions that are made in the Telus project. These assumptions are described below. • Each phone number is assigned to the end-office switch which serves that region and the telephone is connected to that switch through a smaller network composed of a series of devices and smaller switches. For the purpose of this thesis, we concentrate on the main telephone network composed of the end-office switches. Also, we assume that a call originates at the end-office switch where the call enters the main network. Similarly, we assume that it terminates at the end-office switch where the call leaves the network. To simplify the text, from now on the end-office switches will be referred to simply as switches, as the work in this thesis is focused solely on the non-toll network composed of end-office switches. • In addition to the main network, there is an overlaying network which is responsible for directing the long distance calls. Similarly, we assume that a long distance call originates at the switch where if first enters the non-toll network and that it terminates 5 where it exits the non-toll network. Therefore, a long distance call is treated as a local call based on the portion of the non-toll network that is uses. • Switches also have a limited capacity: they cannot handle more than a certain number of calls at any given time. However, since capacity constraints usually occur due to trunk capacity limits, we assume that the switch capacities are unlimited. 2.1.3 Routing rules A brief description of routing rules is given in this section. A more thorough description of these routing rules can be found in the master thesis written by Darrel Braun1, a former member of the COE Telus project. Routing rules are rules that dictate which paths a call can use when it is routed from a specific origin to a specific destination across the network. All the routes that are possible paths between an origin and a destination switch are summarised into a routing table. The network that is being studied for this project is composed of 32 switches that are almost all interconnected by arcs. These switches are divided into a West and East sector, depending on their location. Two of these switches, one in the East sector and one in the West sector, can serve as tandems. As it is shown in Figure 2, knowing the sector in which a switch is located is important because this determines how the calls will be routed. The routing rules, which are used in this 32-node network, are based on f i x e d h i e r a r c h i c a l r o u t i n g . When a call is placed, it is routed to the destination switch through fixed hierarchical routing rules. These rules vary depending on whether the origin and the destination switches are located in the same sector (see Figure 2). ' Darrel Braun, Efficient routing of telephone calls in a circuit-switched network, 2000. 6 Routing between switches within a sector If the two switches are in the same sector, then a call would be routed using the following rules. The call would first be routed from the origin switch to the destination switch using the Primary High Usage (PHU) route, i.e. the trunk group connecting the origin and the destination switches directly. If all of the circuits in this trunk group were busy, i.e. they were all carrying calls, then the call would be routed through the Intermediate High Usage (IHU). The IHU is the name given to the route joining the origin switch to the tandem switch in its sector and then to the destination switch. If all of the trunks on this last route were busy, then the call would be blocked (the person making the call would hear a fast-busy signal and would have to try calling later). Routing between switches in different sectors These routing rules are slightly different in the case where a call is routed between an origin switch and a destination switch located in different sectors. In this case, the call would once again first be routed through the PHU. However, if this first route were busy, the call would be routed through the tandem in the sector opposite to that of the origin switch. This second route is called the IHU. The last route through which the call could be routed is called the Alternate Final (AF). This rule routes the call from the origin, through the two tandems and finally to the destination switch. All of these rules are summarized in Figure 2. 7 West Sector Figure 2 Fixed Hierarchical Routing rules The routing rules between an origin and a destination switch are then summarized into a routing table. In this table, the possible routes are listed in the order in which they should be tried. For example, if we look at Figure 2, the possible routing rules to route calls from switch 1 to 4 would be summarized in the routing table as follows: Origin Tandem 1 Tandem 2 Destination 01 - - 04 01 East Tandem - 04 01 West Tandem East Tandem 04 First route Second route Third route Table 2 Example of a routing table The routing table shown above, in Table 2, indicates that when calls are routed from the origin switch to the destination switch, the call would first be routed on the direct arc connecting the origin to the destination (the arc 01-04). If all of the lines on this first route were busy, then the call would be routed using the second route. This second route would direct the call from the origin switch to the east tandem and then from the east 8 tandem to the destination switch (i.e. the route would be 01 - "east tandem" - 04). Finally, if this second route were busy as well, then the call would be routed using the third route. This third route would direct the call from the origin to the west tandem, then from the west tandem to the east tandem and finally from the east tandem to the destination (i.e. the route would be 01-"west tandem"- "east tandem" - 04). An alternative to these fixed hierarchical routing rules would be to use dynamic routing rules. This would mean that the possible routes would change depending on the time of the day. These routes take into account the network activity throughout the day and are used to direct the call traffic away from busy areas. These routing rules make more efficient use of the network, i.e. the amount of idle capacity is minimized. However, dynamic of routing rules cannot be implemented on the TELUS network as it is not a homogeneous network (i.e. it has more than one type of switches). Dynamic routing requires that the network be composed of only one type of switches because it uses a switch function that is not compatible from one switch type to another. Another alternative, and this is what we are investigating in this thesis, is to have more than one routing table. For example, one set of routing rules (or routing table) could be used during the morning and another one could be used for the rest of the day. These routes would still follow fixed hierarchical rules, but they would be adapted based on the usual network activity during that period of the day. These new routes may require that some switches, other than the two main tandem switches, be used as tandems. This is 9 possible as every switch has the capacity to tandem calls. The objective of this thesis was to develop these time of day routing rules. 2.1.4 Demand Types The demand on a network, i.e. the number of calls carried at a certain point in time, fluctuates continuously. The demand across the network varies with the time of day, day of week, and with holidays. For example, Mothers' day has the highest demand of the year. The demand also varies based on the origin and the destination switch. It also varies due to holidays. An example of how this demand varies is shown in Figure 3. This figure represents the variation of the demand, at an aggregate level for a subset of the network, for 16 days. In this example, we can see that the weekdays are generally busier than the weekend (except for the first Tuesday, which is Remembrance day). 1800000 1600000 1400000 1200000 w 1000000 O 800000 600000 J 400000 200000 0 MonTueWedThu Fri Sat SunMonTueWedThu Fri Sat S u n M o n T u e Day of Week Figure 3 Variation of the demand during the week 10 AZ ID=04-13 C C S 28000 26000 24000 22000 20000 18000 16000 i 14000 t 12000 10000^ 8000 : I i i i i I i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 HOUR AZ ID=23-29 C C S I i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | i i i i | I I i i | i i i i | i i i i | i i i i |' 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 HOUR Figure 4 Variation of arc utilization with time of day 11 In the top graph in Figure 4, the arc utilization is very high in the morning, relatively high in the afternoon and it is low in the evening. These two switches are located in business areas. The opposite pattern is shown in the bottom graph of Figure 4, where the arc utilization is low in the morning and in the afternoon, but it is high in the evening. The switches 23 an 29 are located in residential areas. On each of these arcs, we can see that there is excess capacity during some periods of the day. As it was mentioned earlier, the object of this thesis was to find routing rules that would take advantage of these differences in arc utilization. 2.2 Technical Background 2.2.1 Multivariate Methods Some multivariate methods are used to identify similarities among observations in a data set with many variables. For example, in the case of this thesis, multivariate methods are used to find similarities among the demand patterns of the arcs in the network. Multivariate methods are especially useful in the cases where the data sets consist of very large numbers of observations and contain many variables. This is the case for the data analyzed in this thesis. There were 460 observations representing pairs of switches on which the arc utilization had been recorded for 24 consecutive hours. Two multivariate methods are used in this thesis: principal components analysis and cluster analysis. These two methods are defined in the following sections. 2.2.2 Principal Components Analysis Objectives of the PCA The main objective of the PCA is to identify new variables that will help determine the true dimensionality of the data set while keeping most of the information. These new 12 variables can then be used for further analysis such as the cluster analysis used in this thesis. When a PCA is performed, the number of principal components created is equal to the number of original variables. However, these new variables are now uncorrelated. In addition, the first principal component is the most important since it explains most of the variability. Each subsequent principal component is less important as it explain less and less of the variability. The number of variables is reduced by eliminating the principal components that have little variability. Definitions behind the PCA Briefly, performing a PCA "involves a mathematical procedure that transforms a set of correlated variables into a smaller set of uncorrelated variables called principal components"2. These principal components are obtained as follows. In the following definitions, x is a vector chosen randomly from a population, where X [ X j , Xj Xp J and the vector population mean is designated by p. The first principal component is defined by yx = a[(x- p). This equation can be rewritten in the following form: y\ =au(xi -M) + an(x2 -ju) + ... + a,p(xp - ju), where the vector a, is chosen such that the variance of af (x - JJ) is the greatest and a, is subject to the constraint: a\ax = a\x + a\2 +... + afp = 1. 2 Johnson, p.3. 13 The variance of the first principal component is equal to the largest eigenvalue, X i, of the variance-covariance matrix, I . This relation can be written as: Var( v,) = Var(a\(x- ju)) = A\. The second principal component is defined similarly by y2 =ar1{x-ju), where a2 is chosen such that the variance of a[ (x - ju) is the greatest except that now, the vector a2 is subject to the two following constraints: aT2a2 - a\x + a\2 +... + a\p = 1 and a[ax - 0 The second constraint means that the second principal component is uncorrelated with the first. The variance of the second principal component is equal to the second largest eigenvalue, X 2, of the variance-covariance matrix, £ . This relation can be written as: Var(v 2 ) = Var(a [ (*- / / ) )= A2. The subsequent principal components are defined in a similar manner. This definition can be generalized as follows: The j t h principal component (j=3, 4, p) is defined by v y =aT]{x-[S), where a ; is chosen such that the variance of aj (x - ju) is the greatest and the vector a ;is subject to the two following constraints: a^cij - a2jx +a2J2 +... + a2Jp =1 and a^ak = 0, wherek=l, 2 , . . . , j - l . The variance of the j t h principal component is equal to the j t h largest eigenvalue, Aj, of the variance-covariance matrix, Z . This relation can be written as: Var(v y ) = V a r ( « ; ( x - ^ ) ) = / l , 14 Therefore, X\ > X2 ^ • • • ^ Xp represent the eigenvalues of S and a\, ap, represent the normalized eigenvectors of E. As the eigenvectors, aj, are normalized to have a length of 1, the sum of the squares of all of the coefficients or weights composing this vector is equal to 1. In order to determine how much of the variability is captured by each of the principal component variable, we need to evaluate the ratio A,j/tr(S), where j = l ,2 , . . . ,p Xj = variance accounted for by the j t h principal component tr(S)=?w +X2+ ... + Xp Therefore, the ratio A,j/tr(S) represents the ratio of the variance accounted for by the j t h component with respect to the total variability. Principal Components Scores The principal component scores are defined as the values of the principal component variable for each observation and are calculated as follows: yrj =ajr(xr -f*)> f o r J = 1.2, . . . ,pandr= 1,2, . . . , N where, yrj = the j t h principal component score for the r t h observation The principal components can either be calculated from the correlation matrix or the variance-covariance matrix of the data. However, these can give extremely different results. The variance-covariance matrix should not be used unless the units for each variable are comparable and the variables have been standardized in some way. If the 15 correlation matrix is used, the equations described above remain the same, except the mean of the variables, ju, is not included in the calculations. How many principal components should be retained? As we wanted to reduce the number of variables in our data set, we needed to determine how many principal components should be kept. The first few principal components usually capture most of the variability while the rest of them do not bring any more valuable information about the data. Therefore, we must keep only meaningful principal components as we are trying to reduce the dimensionality of the data. There are three main criteria that can be used to decide how many principal components to keep: the eigenvalue-one criterion, the SCREE test and the proportion of variance that is accounted for by each principal component. The eigenvalue-one criterion should only be used if the computations to obtain the principal components were carried out on the correlation matrix. The two other criteria can be used in either case. A. Eigenvalue-one criterion Using this criterion, only the principal components with eigenvalues greater than 1.00 are kept. The reason for doing this is that components with eigenvalues less than 1.00 do not account for much of the variance and are therefore considered as trivial and are dropped. This criterion should never be used when performing the PCA on the raw data or the variance-covariance matrix. B . SCREE test Creating a SCREE plot can help in making this decision. This plot is composed of the eigenvalues (on the y-axis) versus the eigenvalue number (on the x-axis). This is illustrated in Figure 5. From this plot, we can see that the first three points have large 16 eigenvalues while the rest of them are small and tend to level off. These last points can be dropped as their eigenvalues are small and therefore, they do not explain much of the variability. When using a SCREE plot, it is assumed that the true dimensionality of the data is equal to the number of points that are not dropped. In this example, the SCREE plot would therefore suggest that the true dimensionality of the data is equal to three. Figure 5 S C R E E plot - Example C. Proportion of the variance accounted for As it has been explained earlier, the ratio A,j/tr(E) (i.e. the variance accounted for by a specific principal component divided by total of the eigenvalues of the variance-covariance matrix) can be used to determine how much of the variance is captured by each of the principal components. This ratio can be used as a selection criterion. For example, a researcher could decide to retain only the principal components which account for more than 5% of the total variance. 17 Using PCA for screening data It is possible to use Principal Components Analysis (PCA) to screen the data. PCA can help identify if there are any abnormalities, special patterns or outliers within the data (i.e. observations whose measured variables may seem inconsistent compared to those on other observations). These peculiarities about the data may not be obvious when looking at the measured variables. However, once a PCA is performed, it may become quite evident. In brief, while performing a PCA, new variables called principal component scores are created. In turn, the principal component scores can be used to create graphs that will enable the researcher to determine if there are any abnormalities with the data, as these will affect further analyses. In the case where outliers are found, they should be removed prior to pursuing the analysis. Using the results of PCA for clustering In addition to being very useful for screening data, PCA is very helpful when performing clustering (i.e. dividing the observations from a data set into subgroups that show similar characteristics). Instead of using all of the variables provided in the data set, the dimensionality of the data set can be reduced by using PCA and the new principal component scores obtained from the PCA can then be used as inputs to the cluster analysis. 2.2.3 Cluster analysis Objectives of Cluster Analysis Cluster analysis is a method used to make sense out of large and complicated data sets, which contain a large number of observations with many measured variables. Cluster analysis is used to separate observations into subgroups based on the similarities of their 18 measured variables. If meaningful clusters are formed, then it becomes possible to analyze the behavior displayed by each group. This method is therefore very useful since it reduces the analysis effort by decreasing the number of observations to a few groups. Clustering methods can be divided into two main categories depending on whether they are hierarchical or nonhierarchical in nature. Non-hierarchical Clustering Methods These non-hierarchical clustering methods are initiated by first selecting a set of points that will serve as cluster seed points and then assigning each observation point to its closest seed point. The distances between each observation point and the seed points are calculated by using one of the dissimilarities or distance measure method that is described further in this thesis. Once this first step is finished, the size of each cluster and the distances between each pair of clusters are evaluated. Based on specified criteria, when a cluster is too large it is split and when some clusters are too close to each other, they are joined. The FASTCLUS procedure in SAS (see Appendices for more details) is an example of a non-hierarchical clustering method. The way this procedure works can be summarized as follows: • A set of cluster seed points are selected as initial approximations of the means of the clusters. • Each point in the data set is temporarily assigned to its closest seed. • The seed points are replaced by the mean of the temporary clusters. • These two last steps are repeated until the clusters do no change anymore. 19 Although these are good sorting methods, they are very dependent on the initial set of seed points and on the specified number of clusters. Therefore, many different solutions could be obtained from the same data set. Hierarchical Clustering Methods When a hierarchical clustering method is used, each observation initially belongs to a cluster by itself (i.e. if there are 40 observations, there will be 40 clusters). Then, the two closest clusters are joined to form a new cluster (the two original clusters are therefore removed). This step is repeated until all of the clusters are merged into one single cluster. The hierarchical methods differ in the way that the distance between the clusters is calculated (see next section, Distance or Dissimilarity Measures). The SAS CLUSTER procedure makes it possible to chose among different hierarchical clustering methods. The more common methods are summarized Table 33: Method Distance between two clusters is defined as... Average Linkage The average distance between all the possible pairs of observations, one in each cluster Centroid Method Euclidean distance (or squared Euclidean distance) between the centroids or the means of the two clusters Complete Linkage Distance between their furthest members Single Linkage (nearest neighbour method) Distance between their closest members Ward's Minimum Variance Method (Square of the distance between the cluster means)/(sum of the reciprocals of the number of points within each cluster) Table 3 Common methods used for hierarchical clustering 3 SAS/STAT, volume 1, pp.530 to 636 and Johnson pp.322 to 327. 20 Distance or Dissimilarities Measures The main measures that are used to determine the similarities or distances between the points are the Euclidean distance and the Squared Euclidean distance. The Euclidean distance is simply the length of the straight line which joins two points. This is calculated as follows: • The distance between a point P l v,) and a point P2 (x2,y2) is equal to • The distance between two points in a multidimensional space is equal to JJ](Ai ~ P U ) 2 ' f ° r i = l t o t n e number of dimensions and pn represents the coordinate of point 1 in dimension i and p2j represents the coordinate of point 2 in dimension i. How many clusters should be retained? The hierarchical clustering methods require that we decide how many clusters to keep. The total number of clusters created by hierarchical clustering methods is equal to the number of observations minus one. This number of clusters must be reduced to a number of clusters that is neither too small (this would oversimplify the data) or too large (this would separate observations that are relatively similar into several groups, thus creating more groups than necessary). In order to determine the appropriate number to retain, the Cubic Clustering Criterion (CCC), the Pseudo Hotelling's f Test (PST2) and the Pseudo F statistics (PSF) results can be examined. The values of PST2 are a good indication of whether or not the combination of the two clusters should have been done. When the value of PST2 for a particular combination is 21 small relatively to the other values, this indicates that the clusters should be combined. If it is large, they should probably not. The PST2 is qualified as small when it is small relative to the others. The C C C can help determine the number of clusters to retain when it is plotted against the number of clusters. On this plot, the peaks that have values of C C C that are greater than 3 indicate that they correspond to suitable number of clusters. Another criterion that can be used to judge the number of clusters in a data set is to look at the pseudo F statistic (PSF). Relatively large values indicate that the number of clusters is probably appropriate. Another way of determining the optimal number of clusters is to look at the tree diagram generated from the cluster analysis results. From this diagram, it is sometimes possible to determine visually into how many clusters the observations seem to fall. An example of a tree diagram is shown in Figure 6. This diagrams shows how the clusters were formed. The first level of the diagram shows as many clusters as there are observations. As we look higher into the graph, we can see how the different clusters were merged sequentially until all of the clusters are merged into one. This tree diagram shows that there are three main clusters. These are circled on the diagram. 22 1 .50 "I Figure 6 Example of a tree diagram 23 3 Methodology 3.1 Overview of the approach The following example, illustrated in Figure 7, gives an overview of the approach used in this thesis. Suppose that we have an arc between switches A and B which is highly utilized during the afternoon, but which is less busy during the morning. We are looking for two adjacent arcs that display a complementary demand pattern, i.e. arcs which are less utilized during the afternoon. Hour Figure 7 Example Once these arcs have been identified, we can then investigate to see if there is sufficient capacity on the pair of adjacent arcs to justify adding them as a new alternate route. If there is enough capacity, then this new alternate route can be added to the routing table during the period of the day when it is needed. For example, suppose that we find that arcs 01-07 and 04-07 could be used as an alternate route during the morning between the origin switch 01 and the destination switch 04. In other words, suppose that these arcs have enough available capacity to carry the traffic that would normally be blocked 24 between 01 and 04, then this new route can be added to the routing table as a last alternate route (see Figure 8). This new alternate route would only be used in the event that all the circuits are busy on the first three routes. Origin Tandem 1 Tandem 2 Destination 01 - - 04 01 East Tandem - 04 01 West Tandem East Tandem 04 01 07 - 04 First route Second route Third route New alternate route Figure 8 Routing table with new alternate route The methodology used to obtain these new alternate routes can be divided into four main steps. The first one is to prepare the data set so that it is meaningful and it can be analyzed. The second one is to perform a PCA. This step is performed to reduce the dimensionality of the problem. Once we have decided how many principal components to keep, these new variables are used as inputs to the next step: the cluster analysis. The cluster analysis is performed to obtain groups which display similar demand patterns (e.g. group together A Z pairs which have a busy morning compared to a low afternoon and low evening). This step is necessary to find pairs of arcs that could be used as new alternate routes. These arcs have to be adjacent and have a demand pattern that is complementary to that of the Primary High Usage arc. The last step is to determine if there is enough capacity on these new routes to carry the extra traffic. All these steps were implemented by code programmed in SAS (see Appendix L ). 25 3.2 Overview of the methodology The methodology used in this thesis was developed in SAS (Statistical Analysis Software). An overview of the numerous steps involved in finding new alternate routes is shown in Figure 9 and Figure 10. These steps are described more thoroughly in the next sections. This step requires manual input This step requires manual input 1. Read in call data 2. Normalize the data 3. Perform Principal Components Analysis 4. Determine how many principal components should be kept 5. Perform Cluster Analysis 6. Determine how many clusters should be created -7. Find adjacent pairs of arcs 8. Verify that there is enough available capacity on these arcs to use them as an alternate route \ J ' n ' i— v-r-1 9. Output a list indicating which routes can be used as alternate routes and when they can be used These two steps are shown in more detail in Figure 10. Figure 9 Overview of the methodology 26 i is equal to 1 and/ is equal to 1 No Form group 1 with the observations in cluster i Form group2 with the observations in cluster j Create a new data set named match that only contains observations where adjacent arcs For each observation (AZ pair) in group 1: if arcs in group2 which start with the same "A" node are found, merge them with the observation in group 1. This new data set is named start a. For each observation (AZ pair) in group 1: if arcs in group2 which finish with the same "A" node are found, merge them with the observation in group 1.This new data set is named end_a. Append data sets start_a and enda. This new data set is named match a For each observation (AZ pair) in match_a: if arcs in group2 which start with the same "Z" node are found, merge them with the observation in match a. This new data set is named start z. For each observation (AZ pair) in match_a: if arcs in group2 which end with the same "Z" node are found, merge them with the observation in match a. This new data set is named end z. Append data sets start_z and end_z. This new data set is named match z Yes Increment j by one (/' =j +1) No End of algorithm Verify that there is enough available capacity on these arcs to use them as an Create new data set named match that contains only the observation where adjacent arcs are found (i.e. two of the arcs have a node in common) Figure 10 Overview of algorithm used to find new routes 27 3.3 Data preparation The data used for this analysis were a subset of the data that were provided to us by Telus. The data included summarized call volumes (in call seconds) on an hourly basis for each A Z pair. This analysis was performed on data obtained for a particular day, July 31st 2000, but it could easily be repeated for data of any other day. In order to protect the confidentiality of the data, the name of the 32 switches were replaced by numbers ranging from 0 to 31. The data set was then transposed as shown in Table 4. This table represents the measured variables for one of the 460 observations (AZ pairs). For each A Z pair, the total call volumes in CCS, i.e. the total traffic for both directions, was added and summarized on an hourly basis. For example, the value under H8, i.e. 2498 .4, for the A Z pair 0-1 represents the sum of all the traffic carried from 0 to 1 and from 1 to 0 during the 8:00AM to 9:00AM period. Obs A z _ N A M E : •HO : • , H1 H 2 . A . M H 3 ^ H 4 I M i H ' , 1 00-01 0 1 C C S 110.8 97.9 92.8 114.0 172 .5 2 4 0 . 5 Dbs_._ Hfi . H7 H8 H 9 mo ... H U _ H 1 2 _ H U H 1 4 H I 5 . 1 384 .4 966 .7 2498 .4 4217 .3 4419 .4 4568 .2 3356 .7 3695 .8 3928 .0 3696 .8 O b s . . .. H ± 6 _ _ H 1 8 IH2U. ... II2iliiili| 122E-4 H23 1 2443 .2 962 .6 551 .7 406 .5 373 .3 372 .7 243 .4 199.9 Table 4 Sample of data used in this analysis The data contained in Table 4 were then normalized for each A Z pair. The hourly call volumes were scaled for each A Z pair so that the values of the call volumes would range from 0 (for the minimum hourly call volume) to 1 (for the maximum hourly call volume). This was done so that it would be possible to compare the demand patterns of each A Z pairs even though they did not have the same level of utilization. We are interested in the shape of the demand patterns, not in the volume itself. In addition, only the data for the time period between 7:00AM and 22:00PM were kept as the demand on the network is 28 generally low for the rest of the time and there is therefore no need to look for alternate route. 3.4 Principal Components Analysis The first step that we took in our analysis of the data was to screen the data using Principal Components analysis. This analysis was performed on the normalized data using the PRINCOMP procedure in SAS. The reason why this analysis was performed on the normalized data is that our goal is to find similar demand patterns in the data, rather than finding similar call volume levels. The difference in arc utilization from one arc to another during the same hour can be as large as 27530.8 CCS. Therefore, performing a PCA on the non-normalized data might only try to differentiate arcs based on the arc utilization (i.e. arcs that generally have a high arc utilization would be grouped together while arcs with smaller capacities and therefore lower arc utilization would be grouped together). A PCA was performed on the non-normalized data to show that this was the case (see Appendix B ). Once the PCA was done, we decided how many principal components needed to be kept based on the eigenvalue-one criterion, the SCREE test and the proportion of variance that is accounted for by each principal component (see Background section for more details about these criteria). The results obtained from the PCA and the evaluation of the criteria are shown in the Results section. 3.5 Cluster analysis The cluster analysis was then performed using the FASTCLUS and the CLUSTER procedures in SAS (see Appendix A for more details about these procedures). These 29 analyses used as inputs the variables (the principal components) that were obtained from the PCA. Clustering using the FASTCLUS procedure: Using the FASTCLUS procedure was a difficult task as it requires that the user defines how many clusters to form and this was not obvious from looking at the data. The cluster analysis using the FASTCLUS was repeated several times, specifying different number of clusters, but it was difficult to conclude from this analysis what was the ideal number of clusters that should be used. Clustering using the CLUSTER procedure The clustering was performed again using the CLUSTER procedure in SAS. In order to determine the appropriate number of clusters, the Cubic Clustering Criterion (CCC), the Pseudo Hotelling's T2 Test (PST2) and the Pseudo F statistics (PSF) results were examined (these criteria are described in the Background section). 3.6 Search for pair of adjacent arcs After the A Z pairs were grouped using the cluster analysis, those groups were used to search for pairs of adjacent arcs. This search is illustrated as follows. Suppose that we are analyzing the two groups shown in Table 5. We first proceed by finding, for each A Z pair in group 1, arcs in group 2 which either start or end with the same node as the "A" node of the A Z pair. Then we search, once again for each A Z pair in group 1, for arcs in group 2 which either start or end by the same node as the "Z" node of the A Z pair. By doing this, a table similar to that in Table 6 is obtained. 30 Groupl: Observation A Z AZ_id 1 01 07 01-07 2 02 08 02-08 3 04 11 04-11 G r o u p 2 : Observation A Z AZ_id 1 01 11 01-11 2 01 17 01-17 3 04 05 04-05 4 05 11 05-11 Table 5 Example of search for pairs of adjacent arcs. Observation A Z AZ_id Starts or ends with A Starts or ends with Z 1 01 07 01-07 01-11 2 01 07 01-07 01-17 3 02 08 02-08 4 04 11 04-11 04-05 05-11 Table 6 Adjacent arcs The next step is to determine if any of the arcs which start or end with the "A" node of the A Z pair have a node in common with any of the arcs which start or end with the "Z" node of the A Z pair. In the example above, there is a match only for observation 4, i.e. the A Z pair 04-11. This means that the arcs 04-05 and 05-11, belonging to group 2, are a pair of adjacent arcs that could possibly be used as an alternate route for the A Z pair 04-11. However, before adding these new routes to the routing tables, we first need to 31 determine if and when they have enough available capacity in order to carry the extra traffic offered by the A Z pair. 3.7 Feasibility test Once the file containing all of the adjacent pairs of arcs was created, we then verified whether or not there was sufficient available capacity on the new alternate route to carry the extra traffic. This was done by calculating the minimum available capacity. Since there were no accurate data available concerning the capacity of each arc at the time when the code for this thesis was developed, the capacity of each arc was assumed equal to the maximum arc utilization during the day that was studied. It was then assumed that an arc could only be used as part of an alternate route if there was more than 30% of its capacity available. This seemed like a reasonable assumption as we want to make sure that we will not be incurring any blockage on the arcs forming the new routes: this would change the network flow and may create other problem areas. The hourly available capacity was therefore calculated by subtracting the hourly arc utilization from 70% of the maximum daily arc utilization (see Figure 11). Arc utilization Z Maximum daily arc utilization (CCS) Available capacity maximum daily arc utilization 70% of Hour Figure 11 Calculation of available capacity 32 Since we are using maximum daily arc utilization instead of using the actual capacity, this calculation provided a lower bound to the available capacity. However, the maximum daily arc utilization should not be too far from the actual arc capacity since arcs are designed such that there is overflow. Therefore, the maximum arc utilization should be fairly close to the actual capacity. The available capacity for the new routes was calculated by taking the minimum of the available capacity of the two arcs composing that route. Once this hourly available capacity was calculated for each route, we then tried to determine which routes had sufficient available capacity to be used as new alternate routes. The following criterion was used to determine this: If there is sufficient available capacity on the route to carry 5% of the maximum daily arc utilization on the A Z pair of interest, then this route can be used as an alternate route. This value, 5%, can be increased depending on the percentage of blocked calls (i.e. calls that could not be made because all of the circuits on all of the possible routes were busy) that are noticed. 33 4 Results and Analysis 4.1 Results of the Principal Components Analysis 4.1.1 Description of the results The principal components can either be calculated from the correlation matrix or from the variance-covariance matrix. Since the data used for the principal component analysis had been normalized, the PCA was first performed using the covariance-variance matrix. The results obtained from the PCA are shown in Figures 12 to 14. The first figure, Figure 12, shows the mean and the standard deviation for each of the 16 variables (the normalized arc utilization on an hourly basis for hours 7:00 to 22:00). The table in Figure 13 has four columns containing the PCA results. The first column lists the eigenvalues. The second column indicates the difference between two successive eigenvalues. For example, the difference between the first and the second eigenvalue is 0.37805360 - 0.11056362, which is equal to 0.26748998. The third column indicates the amount of variance accounted for by each eigenvalue. For example, the proportion of the variance that is accounted for by the first eigenvalue is equal to Wtr(Z) = XXI{XX + X2 + ...+ he) = 0.378054/0.595029 Therefore, Xx/tr(L) = 0.6354. The last column in Figure 13 indicates the cumulative variance that has been explained by the eigenvalues. For example, the proportion of the variance that was accounted for by the three first eigenvalues is equal to: X{/tr(L) + X2/tr(Z) + A.3/tr(S) = 0.6354 + 0.1858 + 0.0504 = 0.8716 34 Figure 14 shows the elements of the five first principal components. For example, the first principal component (or eigenvector ai) is equal to: a, = [-0.024492, -0.063914, -0.088346, 0.511045, 0.391758]7 It can be verified that every eigenvector is normalized to have a length of 1 by taking the square root of the sum of squares of each element that compose an eigenvector. For example, the length of the first eigenvector is equal to: Length ofay = ^(-0.024492)2 + (-0.063914)2 +... + (0.391758)2 = 1 Finally, Figure 15 shows the principal component scores for the first three components of the first observation (the complete table includes the 16 principal component scores for each of the 458 observations). For example, the first principal component score for the A Z pair 00-01 is equal to -1.308. This can be calculated as follows: yji = ai(x; - ju), where a, = [-0.024492, -0.063914, -0.088346, 0.511045, 0.391758]7 x, =[0.167, 0.521,0.919, ...,0.030, 0.000]T, and ft = [0.007730199, 0.349277637, 0.772263555, 0.66555774, 0.457278907]T Therefore, v/y = -0.024492 (0.167 - 0.007730199) + . . . + 0.391758(0.000 - 0.457278907) = -1.308 35 O b s e r v a t i o n s 4 5 8 V a r i a b l e s 16 S i m p l e S t a t i s t i c s N7 N8 N9 N10 N11 N12 Mean 0 . 0 0 7 7 3 0 1 9 9 0 3 4 9 2 7 7 6 3 7 0 . 7 7 2 2 6 3 5 5 5 0 9 1 1 5 2 1 6 1 7 0 8 8 4 1 7 0 . 7 2 1 6 6 S t D 0 . 0 2 8 7 3 9 1 2 1 0 1 0 5 3 6 4 6 2 3 0 . 1 3 2 6 8 5 2 4 9 0 1 1 8 7 8 2 9 2 2 0 1 3 3 4 3 0 . 1 4 2 8 4 N13 N14 N15 N16 N17 N18 Mean 0 . 7 2 7 7 4 5 0 7 1 0 7 4 1 3 0 4 5 3 4 0 . 7 3 8 4 7 7 4 1 6 0 7 0 2 1 8 2 7 5 7 0 5 4 7 8 9 0 . 5 1 7 4 S t D 0 . 1 4 4 0 9 8 7 7 1 0 1 5 1 0 2 5 5 8 1 0 . 1 4 6 2 0 9 1 1 6 0 1 5 1 0 9 7 1 5 1 0 1 9 7 1 4 0 . 2 2 1 5 8 N19 N20 N21 N22 Mean 0 . 5 6 6 0 9 5 7 5 9 0 6 3 1 9 8 0 5 5 9 0 . 6 6 5 5 5 7 7 4 0 4 5 7 2 7 8 9 0 7 S t D 0 . 2 6 2 0 1 6 6 7 5 0 2 9 5 7 3 3 5 6 4 0 . 3 2 3 6 9 1 9 1 2 0 2 7 6 6 8 3 3 2 3 Figure 12 Simple statistics obtained from the PCA T h e PR INCOMP P r o c e d u r e E i g e n v a l u e s o f t h e C o v a r i a n c e M a t r i x E i g e n v a l u e D i f f e r e n c e P r o p o r t i o n C u m u l a t i v e 1 0 3 7 8 0 5 4 0 2 6 7 4 9 0 0 . 6 3 5 4 0 . 6 3 5 4 2 0 1 1 0 5 6 4 0 0 8 0 5 6 3 0 . 1 8 5 8 0 8 2 1 2 3 0 0 3 0 0 0 0 0 0 1 6 1 6 8 0 . 0 5 0 4 0 8 7 1 6 4 0 0 1 3 8 3 3 0 0 0 4 4 4 4 0 . 0 2 3 2 0 8 9 4 8 5 0 0 0 9 3 8 8 0 0 0 1 0 7 0 0 . 0 1 5 8 0 9 1 0 6 6 0 0 0 8 3 1 9 0 0 0 0 4 3 8 0 . 0 1 4 0 0 9 2 4 6 7 0 0 0 7 8 8 1 0 0 0 1 9 8 8 0 . 0 1 3 2 0 9 3 7 8 8 0 0 0 5 8 9 2 0 0 0 0 3 4 9 0 . 0 0 9 9 0 9 4 7 7 9 0 0 0 5 5 4 3 0 0 0 0 4 8 2 0 . 0 0 9 3 0 9 5 7 1 10 0 0 0 5 0 6 2 0 0 0 0 2 8 7 0 . 0 0 8 5 0 9 6 5 6 11 0 0 0 4 7 7 5 0 0 0 0 4 0 8 0 . 0 0 8 0 0 9 7 3 6 12 0 0 0 4 3 6 6 0 0 0 0 4 8 4 0 . 0 0 7 3 0 9 8 0 9 13 0 0 0 3 8 8 2 0 0 0 0 3 7 4 0 . 0 0 6 5 0 9 8 7 4 14 0 0 0 3 5 0 8 0 0 0 0 0 9 6 0 . 0 0 5 9 0 9 9 3 3 15 0 0 0 3 4 1 2 0 0 0 2 8 6 3 0 . 0 0 5 7 0 9991 16 0 0 0 0 5 4 9 0 0 0 0 9 0 0 3V 0 0 0 0 Figure 13 Eigenvalues of the Covariance Matrix E i g e n v e c t o r s P r i n l P r i n 2 P r i n 3 P r i n 4 P r i n 5 N7 - 0 . 0 2 4 4 9 2 - 0 . 0 0 4 0 3 7 - 0 . 0 1 0 8 2 6 0 . 0 2 8 3 8 9 - 0 . 0 3 2 1 8 3 N8 - 0 . 0 6 3 9 1 4 0 . 1 3 4 5 4 8 0 . 2 0 5 9 7 9 0 . 3 4 9 6 2 2 - 0 . 1 3 2 6 1 6 N9 - 0 . 0 8 8 3 4 6 0 . 2 0 0 6 0 5 0 . 3 3 5 3 1 2 0 . 5 1 0 7 9 7 0 . 3 5 2 5 6 4 N10 - 0 . 0 6 9 6 3 4 0 . 2 2 7 6 4 1 0 . 1 3 0 8 7 6 0 . 1 0 5 0 0 1 0 . 5 4 7 4 1 7 N11 - 0 . 0 6 8 6 6 2 0 . 3 1 4 4 0 5 0 . 0 0 1 5 0 4 0 . 0 1 7 6 7 2 0 . 1 2 4 4 4 4 N 1 2 0 . 0 0 2 1 7 2 0 . 3 8 0 9 1 4 - 0 . 0 7 3 2 6 6 0 . 0 4 3 5 6 2 - 0 . 0 2 7 3 9 1 N 1 3 - 0 . 0 2 3 6 4 2 0 . 3 7 3 9 4 9 - 0 . 1 3 6 4 3 9 0 . 2 0 7 1 8 4 - 0 . 2 1 6 1 2 2 N 1 4 - 0 . 0 4 6 4 2 5 0 . 3 9 5 6 9 2 - 0 . 1 4 3 3 7 5 0 . 0 4 2 4 0 3 -0 . 3 8 7 3 5 N 1 5 - 0 . 0 0 8 8 7 4 C . 3 8 4 2 - 0 . 1 1 8 7 8 6 - 0 . 0 4 1 0 6 6 - 0 . 1 9 1 5 2 3 N 1 6 0 . 1 2 0 6 2 6 0 . 3 2 5 3 9 9 - 0 . 0 0 4 1 6 7 - 0 . 2 5 5 4 4 3 0 . 0 9 9 4 8 2 N 1 7 0 . 2 6 5 2 6 2 0 . 2 4 2 8 4 0 . 0 2 1 9 9 2 - 0 . 3 3 5 4 5 3 0 . 2 7 5 3 0 5 N18 0 . 3 2 3 9 7 8 0 . 1 4 9 9 9 7 0 . 1 5 3 3 5 6 - 0 . 4 0 2 0 6 6 0 . 1 6 6 7 4 1 N 1 9 0 . 4 0 0 0 0 9 0 . 0 5 1 8 5 0 . 3 3 8 2 5 5 - 0 . 0 6 9 7 6 - 0 . 2 8 4 9 6 8 N 2 0 0 . 4 5 8 9 3 8 - 0 . 0 4 0 4 7 5 0 . 3 2 6 1 6 9 0 . 2 3 0 1 8 - 0 . 2 5 6 8 0 1 N21 0 . 5 1 1 0 4 5 - 0 . 0 9 6 2 6 9 -C . 0 3 0 5 3 0 . 2 8 9 2 0 6 0 . 1 4 0 5 6 9 N 2 2 0 . 3 9 1 7 5 8 - 0 . 0 3 0 0 5 7 - 0 . 7 2 3 5 2 4 0 . 2 7 4 7 2 6 0 . 1 6 5 3 4 3 Figure 14 Five first Principal Components Obs A Z _ I D N 7 N 8 N 9 N 1 0 N 1 1 N 1 2 N 1 3 N 1 4 N 1 5 1 0 0 - 0 1 0 . 1 6 7 0 . 5 2 1 0 . 9 1 9 0 . 9 6 6 1 . 0 0 0 0 . 7 2 0 0 . 7 9 8 0 . 8 5 2 0 . 7 9 9 N 1 6 N 1 7 N 1 8 N 1 9 N 2 0 N 2 1 N 2 2 T R I N T " P R I N 2 P R I N 3 0 . 5 0 9 0 . 1 6 6 0 . 0 7 1 0 . 0 3 8 0 . 0 3 0 0 . 0 3 0 0 . 0 0 0 - 1 . 3 0 8 0 . 0 4 3 \ - 0 . 0 4 3 Figure 15 Principal Component Scores first the first observation 4.1.2 How many principal components should be retained? We determined how many principal components we needed to keep by looking at the SCREE plot and by looking at how much variance each principal components accounted for. The principal component analysis was also repeated using the correlation matrix. 37 The result of the PCA using the correlation matrix (instead of the variance-covariance matrix) is shown in Appendix C . Once again, we looked at the three criteria in order to determine how many principal components to keep. This time, when looking at the eigenvalues (see Appendix C ), we found that the three first eigenvalues were greater than one. This suggested that three principal components should be kept. Then, we looked at the proportion of the variance explained by each of the eigenvalues. From this criterion, we decided to keep the three first principal components as the three first eigenvalues each capture more than 5% of the variance. Finally, from the SCREE plot of the eigenvalues, Figure 16, we could see that value of the eigenvalues tends to level off after the third eigenvalue. This suggests that the dimensionality of the problem can probably be reduced to 3: principal component 1 (PRTN1), principal component 2 (PRTN2) and principal component 3 (PRJN3). Figure 16 SCREE plot of eigenvalues 38 Since all three criteria were suggesting that we should keep the first three components, we pursued the analysis only using these new variables instead of using the 16 original variables. 4.1.3 Screening the data using PCA The scatterplots of PRIM vs. PRIN2, PRTN2 vs. PRIN3 and PRIN2 vs. PRTN3 were then plotted to screen the data and determine if there are any outliers (see Appendix D ). From the graph of PRTN2 vs. PRIN3 and PRTN2 vs. PPJN3 it was apparent that there were two outliers. These outliers correspond to the demand on A Z pairs 01-04 and 11-27. The reason why these two A Z pairs were outliers was because they had an unusually high morning demand (see Appendix E ). These were removed from the data before performing the next cluster analysis so they would not influence the statistics4. The PRINCOMP procedure was run with the new data (without the two outliers) and once again, the scatterplots were graphed. This time, there were no apparent outliers. There were no easily identified clusters on these new scatterplots so another method was used to group the A Z pairs. However, the 3 principal components scores obtained from the principal components analysis were still useful as these new variables can be used instead of the 16 original variables. 4.1.4 Interpretation of the principal components The first three principal components are shown in Figure 17. These components can be interpreted as follows: • Component 1: Weighs highly the A Z pairs which have very low arc utilization in the morning and in the afternoon and a high utilization in the evening. 4 The analysis was also performed keeping the two outliers. These outliers were detected by the cluster analysis: they were grouped together, with no other observations. This did not form a useful cluster. It is therefore better to first screen the data using the PCA, then to remove the outliers and continue the analysis without the outliers. 39 • Component 2: Weighs highly A Z pairs that have high arc utilization in the afternoon and low arc utilization in the morning and in the afternoon. • Component 3: Weighs highly A Z pairs that have high arc utilization in the morning, low arc utilization in the afternoon and moderate arc utilization in the evening. U) 0.8 0.6 0.4 0.2 0 -0.2 -0.4 \ I \ / / \ /JSC \ 1/ y \ • « • * 1 7 1 1 1 1 9 11 1 1 1 1 1 1 1 1 1 1 13 15 17 19 21 Hour Component 1 Component 2 Component 3 Figure 17 Graphical representation of first three principal components Based on these interpretations of the principal components, we would expect that A Z pairs which have demand patterns similar to the pattern of the weights given to each hour by a certain principal component to have a high score for that component. These interpretations are illustrated in the following examples. Example 1: The demand for the A Z pair 02-29 is shown in Figure 18. From this time series of the normalized arc utilization, we can see that the demand in the morning is relatively low compared to that of the afternoon and that the demand in the evening is relatively high compared to the demand in the afternoon. Based on the interpretations described above, we can expect that the first principal component score will be very high, the second principal component score will be very low and that the third principal component score will be very low. This was the case. The first principal component 40 scores range from -6.0584 to 4.83846 and this A Z pair obtained a score of 4.68528, which is a high score and means that it shape resembles that of the principal component one. The second principal component scores range from -8.66150 to 5.61533 and this A Z pair obtained a score of -6.30489. Finally, the third component scores range from -2.55006 to 3.2398 and this A Z pair obtained a score of-2.34666. AZ_ID=02-29 ccs_n 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 HOUR Figure 18 Example 1: Interpretation of the principal components Example 2: The A Z pair shown in Figure 19 shows a high demand in the morning relatively to a lower demand in the afternoon and a very low demand in the evening. We can therefore expect that the score for the first principal component will be very low, as this pattern is exactly the opposite of that depicted by principal component one. The second principal component score will be low and the third principal component will also be low, as the demand in the evening is not high. This A Z pair obtained a score of -5.56782 for the first principal component, -1.15865 for the second principal component and -0.8837 for the third one. 41 AZ_ID=00-30 c c s _ n 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 HOUR Figure 19 Example 2: Interpretation of the principal components 4.2 Results of the cluster analysis The clustering was first performed using the FASTCLUS procedure. This analysis was done on the three first principal components. The clusters obtained from using this procedure are shown in Appendix F . This procedure seemed to give good results as the clusters were sufficiently large (none of the clusters had small clusters with only 1 or 2 observations). The main concern that we had with this procedure is that it is very dependent on the user-specified number of clusters and on the selection of the initial set of seed points. We did not know from using this clustering procedure what would be an appropriate number of clusters. We therefore looked at hierarchical clustering methods, since they have criteria that can be used to help determine how many clusters should be kept. We used the CLUSTER procedure in SAS and then selected the clustering method. 42 We first performed the clustering using the average linkage method. The cluster analysis was performed using the first 3 principal component scores5. Figure 20 shows the output for the 15 first clusters of the cluster analysis. Each row represents the statistics calculated after two clusters have joined. The first column, NCL, represents the number of clusters left. In the complete table, in the first row N C L would be equal to the total number of observations minus 1. The two next columns show which clusters have joined during this step. For example, when the number of clusters was 11, the cluster 240 (i.e. a group of A Z pairs that was renamed cluster 240) was merged with the A Z pair 07-27 (this A Z pair was still in a cluster by itself before this step). The next column shows the frequency or number of observations in each group. Finally, the rest of the columns represent statistics calculated on these clusters. In order to determine the appropriate number of clusters, the Cubic Clustering Criterion (CCC), the Pseudo Hotelling's T2 Test (PST2) and the Pseudo F statistics (PSF) results were examined (see Figure 20). These criteria are explained more in detail in the background section. As it was mentioned earlier in this thesis, when the value of PST2 is small, this indicates that the clusters should be combined. From looking at the values of PST2 in Figure 20, we can see that this statistic indicates that the appropriate number of clusters should be either 5, 9 or 12 since the PST2 values for these mergers are relatively small compared to the others. 5 We also attempted to perform the cluster analysis directly on the raw data (instead of on the principal components retained), but these clusters were less interpretable. Therefore, we continued out analysis using the principal components as inputs to the cluster analysis. See 43 The CCC statistic is not very helpful in this case since only two of the values are positive and none of them of greater than three. Therefore, the C C C statistic cannot be used as an indication of the number of clusters to use. Another criterion that can be used to judge the number of clusters in a data set is to look at the pseudo F statistic (PSF). Relatively large values indicate that the number of clusters is probably appropriate. From reading the PSF values in Figure 20, we can see that this criterion indicates that the ideal number of clusters is possibly 2, 9 or 11. From these three criteria, it seems that the number of clusters should be between 2 and 9 (forming more than 9 clusters is not very useful for this analysis as we are trying to simplify the data). Average L i nkage C l u s t e r A n a l y s i s C l u s t e r H i s t o r y Norm T RMS i NCL - - C l u s t e r s J o i n e d — FREQ SPRSQ RSQ ERSQ CCC PSF PST2 D i s t 15 CL32 CL85 48 0 . 0029 0 .88 0 904 - 6 . 4 233 16 .7 0 .487 14 CL15 CL52 56 0 . 0049 0 . 875 0 899 - 6 . 2 240 2 2 . 4 0 .49 13 CL14 CL24 109 0 . 0181 0 . 857 0 893 - 8 . 7 223 71 .4 0 .515 12 CL20 CL59 6 0 . 0012 0 . 856 0 887 - 7 . 2 241 4.1 0 .53 11 CL240 07-27 3 0 . 0009 0 . 855 0 879 - 5 . 6 264 3 6 . 8 0 .557 10 CL17 CL18 177 0 . 0312 0 . 824 0 871 - 9 . 6 233 94.1 0 .56 9 CL16 CL33 51 0 . 0042 0 .82 0 861 -8.1 256 12 .6 0 .56 8 CL22 CL10 231 0 . 0385 0 . 781 0 849 -12 230 8 5 . 6 0 .622 7 CL27 CL9 62 0 . 0108 0 . 771 0 834 -11 252 2 8 . 5 0 .642 6 CL19 CL44 47 0 .006 0 . 765 0 815 - 6 . 6 294 4 6 . 9 0 .653 5 CL11 CL12 9 0 . 0026 0 . 762 0 788 - 3 . 3 363 5 .6 0 .678 4 CL6 CL13 156 0 . 0546 0 . 707 0 748 - 4 . 6 366 147 0 .729 3 CL7 CL5 71 0 . 0126 0 . 695 0 682 1 .4 518 22.1 0 .8 2 CL8 CL3 302 0 . 1158 0 . 579 0 548 2 . 2 627 179 0.891 1 CL4 CL2 458 0 .579 0 0 0 627 1 .298 Figure 20 Results of the cluster analysis using Average Linkage Method Another way of determining the optimal number of clusters is to look at the tree generated from the cluster analysis. This tree is shown in Figure 21. From this tree, it seems that there are three main clusters. These are then subdivided to form a total of 7 or 44 8 clusters (two of which include only a few observations). This value is within the range found using the three criteria described above. Therefore, we continued the analysis using 7 clusters. These 7 clusters are shown in Appendix G . Each graph represents the box plot of the normalized time series of the A Z pairs belonging to that cluster. 1 .50 " D 1 2 5 I S 1 0 0 -A Z pairs Figure 21 Tree diagram obtained from using the Average Linkage Method Which hierarchical method should be used? We repeated this cluster analysis using the Ward's and the Centroid clustering methods. The clusters obtained from these methods are shown in Appendix H and Appendix I. The Centroid Method tended to identify outliers and then group by themselves. The Average Linkage Method also tended to form small clusters (see clusters 5 and 6). The clusters obtained using Ward's Minimum Variance Method seemed more reasonable. The size of the clusters seemed relatively similar. This is probably the group sizes that are the most useful for the rest of the analysis. Therefore, we continued the analysis using Ward's method. This result is similar to that obtained by Puterman and Dunn (1986) in their cluster analysis on data concerning children with low birthweight. In their 45 study, they compared several hierarchical methods, including nearest neighbour, farthest neighbour, centroid, median and Ward's methods. They found that the results obtained from using Ward's method corresponded the most closely with their objective. 4.3 Results of the search for adjacent arcs The groups found from using the cluster analysis were then used to find adjacent arcs. From using the methodology described in section 3.6 and illustrated in Figure 10, we were able to find new alternate routes that can be used during certain periods of the day. For this thesis, the methodology was used to find if there were possible alternate routes in each of the other groups. For example, for each of the A Z pairs in the first group, we searched for possible alternate routes in groups 2 to 7. The table shown in Appendix J shows a sample of the resulting new routes. This table summarizes which nodes can used as tandem for a certain A Z pair. In addition, it indicates when these nodes can be used as tandems ("Y" indicates that the node can be used as a tandem during that hour and "N" indicates that it cannot be used during that hour). For instance, in the example shown in Table 7, the first line indicates that the node 00 could be used as a tandem between 7AM and 9AM and between 4PM to 23PM for the A Z pair 02-05. This new alternative route would direct calls from the origin switch 02, through the tandem switch 00 and finally to the destination 05. The next line shows that node 01 could also be used as a tandem for the A Z pair 02-05 during these hours. A Z _ i d T a n d e m H7 H8 H9 H10 H11 H12 H13 H14 H15 H16 H17 H18 H19 H20 H21 H22 02-05 00 Y Y N N N N N N N Y Y Y Y Y Y Y 02-05 01 Y Y N N N N N N N Y Y Y Y Y Y Y Table 7 Example of the resulting new possible alternate routes However, in this example, the node 00 should not be used as a tandem during the time period between 9AM and 4PM since there is either not sufficient available capacity on 46 the arc 02-00 or 00-05. Similarly, there is not sufficient available capacity on either the arc 02-01 or 01-05 for this route to be used as an alternate route between 9AM and 4PM. 4.4 How will these routes be used? All the new alternate routes that were found do not take into account the possibility that a same arc could be used as part of more than one new alternate route. For example, the arc 05-06 could be part of the new alternate route 05-06-09 for the A Z pair 05-09 and it could also be listed as the new alternate route 01-05-06 for the A Z pair 01-06. The reason for this is that the work done in this thesis was to generate all of the possible routes to see if there was enough potential to use this methodology to find new routes and then implement them. Therefore, it would not be possible to implement them all of these new routes at once. The user would rather add only a few the possible routes, making sure that the these new routes do not have any arcs in common. This routes would be added to the routing table during the periods when they have sufficient capacity. For example, based on the example shown in Table 7, if the user noticed that a large number of the calls going from the switch 02 to the switch 05 were blocked between 4PM and 6PM, then either the route 02-00-05 or 02-01-05 could be added to the routing table during these hours. Based on these results and our methodology, we have listed a few recommendations. 5 Recommendations The results show that it is possible to group the demand patterns on the arcs using cluster analysis and that is possible to find routes based on those groups that have a demand pattern complementary to that of a specified A Z pair. However, the results obtained for this thesis are based on one day of data, which was all that was available at the time of 47 this study. Therefore, once data is available for other days, the methodology described in this thesis should be applied in order to see how much the results vary from one day to another. In addition, once data is available regarding the actual arc capacities, these values should be used, instead of the maximum arc utilization, to calculate the available capacity on each arc. Once more data is available, the analysis described above should be performed on this new data to determine if the number of significant principal components and if the number of useful clusters vary from one day to the other. If it does not, then the methodology described in this thesis could be automated. In addition, the code could be adjusted so that instead of searching for all of the possible new alternate routes, the code only searched for possible alternate routes for a specific A Z pair. This adjusted code could be incorporated into a tool (with a visual interface) that could be used by Telus. This tool would be very useful to Telus since if they start to see blocked calls on a certain A Z pair, they could use this tool and determine if there are any potential alternate routes that could carry extra traffic. This would permit Telus to avoid or delay adding capacity to that A Z pair. 48 References Braun, D., Efficient routing of telephone calls in a circuit-switched network, unpublished M.Sc. thesis, University of British Columbia, Faculty of Commerce and Business Administration, 2000. Cook, D., Principal Component Analysis, November 23, 2000, http://www.public.iastate.edU/~dicook/stat501/97/lectures/2.6.html Hatcher, L. , Stepanski, E. J., A Step-by-Step Approach to Using the SAS® System for Univariate and Multivariate Statistics, Cary, NC:SAS Institute Inc. 1994, 552 pp. Johnson, D. E. , Applied Multivariate Methods for Data Analysts, Duxbury Press, 1998. Puterman, M . L. , Dunn, H. G., 'Statistical Analysis of Mild Brain Dysfunctions', In: Henry G. Dunn (Ed.), Sequelae of Low Birthweight: The Vancouver Study, London: Mac Keith Press, 1986, pp.114-125. SAS Institute Inc., SAS/STAT® User's Guide, Version 6, Fourth Edition, Volume 1, Cary, NC: SAS Institute Inc., 1989. 943 pp. SAS Institute Inc., SAS/STAT® User's Guide, Version 6, Fourth Edition, Volume 2, Cary, NC: SAS Institute Inc., 1989. 846 pp. SAS Institute Inc., SAS® User's Guide.Basics, Version 5 Edition, Cary, NC: SAS Institute Inc., 1985. 1290 pp. SAS Institute Inc., SAS Macro Language: Reference, First Edition, Cary, NC: SAS Institute Inc., 1997. 304 pp. 49 Appendices Appendix A Description of some SAS procedures A-1 Principal Components Analysis using the S A S PRINCOMP procedure The procedure PRINCOMP in SAS can be used to perform PCA. The syntax for this procedure is the following6: P R O C PRINCOMP <options>; B Y variables; FREQ variables; PARTIAL variables; V A R variables; WEIGHT variables; J required statement optional statements In our analysis, we have used the following options: • COVARIANCE (COV): indicates that the principal components should be computed from the covariance matrix. If this option had not been specified, the principal components would have been calculated using the correlation matrix. The COV option should not be used unless the units for each variable are comparable and the variables have been standardized in some way. The variables in this data set are all measured in centi-call seconds (CCS) and they have been normalized. Therefore, it is possible to use the COV option. • D A T A : indicates which data set will be analyzed. • OUT: creates a data set that contains, in addition to all of the original data, the principal component scores. • OUTSTAT: creates a data set that contains the means, standard deviations, number of observations, correlations or covariances, eigenvalues and eigenvectors. We have only used the following optional statement: • VAR: lists the variables that will be analyzed. 6 More information about this procedure can be found in SAS/STAT® User's Guide, volume 2, pp.1243 to 1263. 50 A-2 Cluster analysis using the S A S F A S T C L U S procedure The procedure FASTCLUS in SAS can be used to perform cluster analysis. The syntax for this procedure is the following7: PROC FASTCLUS MAXCLUSTERS=n RADIUS=t <options> required statement V A R variables; ID variables; FREQ variables; WEIGHT variables; B Y variables; optional statements As part of the required statement, either the maximum number of clusters allowed (MAXCLUS) or the minimum distance allowed between each cluster seeds (RADIUS) must be specified. In our analysis, we have used the following options: • MAXITER: indicates the maximum number of iterations allowed for recomputing the cluster seeds • DATA: indicates the name of the data set containing the data to be clustered • OUT: creates a new data set that contains, in addition to all of the original data, the variables CLUSTER and DISTANCE We have also used the following optional statement: • VAR: indicates which variables to use in the cluster analysis. 7 More information about this procedure can be found in SAS/STAT® User's Guide, volume 1, pp.825 to 850. 51 A-3 Cluster analysis using the S A S C L U S T E R procedure The procedure CLUSTER in SAS can be used to perform cluster analysis. The syntax for this procedure is the following : P R O C C L U S T E R METHOD=name <options>; required statement! B Y variables; COPY variables; ID variables; RMS STD variables; V A R variables; optional statements FREQ variables; required when RMSSTD statement is used, otherwise it is optional As part of the required statement, the clustering method to be used needs to be specified. The methods available in SAS are: Average, Centroid, Complete, Density, E M L , Flexible, McQuitty, Median, Single, Twostage and Ward. These clustering methods are described in more detail in the background section of this thesis. In our analysis, we have used the following options: • DATA: indicates the name of the data set containing the data to be clustered. • CCC: prints the Cubic Clustering Criterion. This option should not be used in conjunction with the METHOD=Single. • OUTTREE: creates a new data set that can be used by the T R E E procedure to draw a tree diagram. • PSEUDO: prints pseudo F and t2 statistics. This option is effective only when the data are coordinates or METHOD=Average, Centroid or Ward. This option should not be used in conjunction with the METHOD=Single. We have also used the following optional statement: • VAR: indicates which variables to use in the cluster analysis. 8 More information about this procedure can be found in SAS/STAT® User's Guide, volume 1, pp.519 to 614 . 52 A-4 Tree diagram using the S A S T R E E procedure The procedure TREE in SAS can be used to print tree diagrams. This can be used to view how clusters were formed at different levels. The syntax for this procedure is the following9: P R O C T R E E <options>; N A M E variables; HEIGHT variables; PARENT variables; B Y variables; COPY variables; FREQ variables; ID variables; required statement! optional statements In our analysis, we have used the following options: • DATA: indicates the name of the data set containing the data used to define the tree. • OUT: creates a new data that lists all of the observations in the tree, as well variables describing to which clusters each observation belongs to. If this option is used, either NCLUSTERS or L E V E L options must be used. • NCLUSTERS: Specifies the number of clusters to keep. We have also used the following optional statement: • COPY: indicates which variables to copy to the output file. • ID: indicates which variable should be used to identify the objects in the tree in the output file. 9 More information about this procedure can be found in SAS/STAT® User's Guide, volume 2, pp.1613 to 1631. 53 Appendix B Results of the PCA performed on non-normalized data The results of this analysis can be summarized as follows. The SCREE plot showed that only one principal component should be used. This principal component attributed basically the same weight to each variables. Each element of the first eigenvector was around 0.25, which corresponded to giving an equal weight to each variable. The values of 0.25 can be calculated as follows: the 16 elements (representing hours 7 through 22) were given a similar magnitude therefore each of them has a value of ± 1/Vl6 = ±0.25 . Thus, this principal component only captured the variance due to the volume of calls, not the pattern of the variation of the demand throughout the day. This supports using normalized data for the analysis. E i g e n v a l u e D i f f e r e n c e P r o p o r t i o n C u m u l a t i v e P R I N 1 1 5 . 3 7 0 5 1 4 . 9 2 7 7 0 . 9 6 0 6 5 7 0 . 9 6 0 6 6 P R I N 2 0 . 4 4 2 8 0 . 3 3 0 5 0 . 0 2 7 6 7 4 0 . 9 8 8 3 3 P R I N 3 0 . 1 1 2 3 0 . 0 7 9 7 0 . 0 0 7 0 1 9 0 . 9 9 5 3 5 P R I N 4 0 . 0 3 2 6 0 . 0 1 8 3 0 . 0 0 2 0 3 7 0 . 9 9 7 3 9 P R I N 5 0 . 0 1 4 3 0 . 0 0 8 6 0 . 0 0 0 8 9 6 0 . 9 9 8 2 8 Figure 22 Eigenvalues of the correlation matrix 18 1 6 -Number Figure 23 S C R E E plot of the eigenvalues 54 P R I N 1 P R I N 2 P R I N 3 7 0 . 2 4 1 8 4 5 - 0 . 2 6 6 4 0 1 0 . 7 7 2 3 9 9 8 0 . 2 4 7 7 8 6 - 0 . 3 0 7 9 5 1 0 . 2 8 1 2 9 9 9 0 . 2 5 0 9 5 1 - 0 . 2 3 9 0 0 3 - 0 . 1 4 2 0 7 3 10 0 . 2 5 1 6 5 6 - 0 . 1 9 3 6 1 3 - 0 . 2 4 8 2 6 7 11 0 . 2 5 1 8 5 2 - 0 . 1 9 3 0 8 5 - 0 . 2 3 7 6 6 8 12 0 . 2 5 3 4 0 4 - 0 . 1 4 1 8 5 8 - 0 . 1 3 8 8 9 1 13 0 . 2 5 2 9 0 9 - 0 . 1 7 2 9 6 2 - 0 . 0 8 4 3 1 8 14 0 . 2 5 2 6 9 - 0 . 1 7 5 2 3 4 - 0 . 1 1 9 9 2 9 15 0 . 2 5 3 6 0 3 - 0 . 1 2 5 7 4 3 - 0 . 1 1 6 7 4 3 16 0 . 2 5 4 4 3 - 0 . 0 1 7 3 3 2 - 0 . 1 3 6 4 5 1 17 0 . 2 5 2 8 5 0 . 1 6 0 2 3 1 - 0 . 0 9 5 5 6 2 18 0 . 2 4 9 9 5 5 0 . 2 7 3 8 5 9 - 0 . 0 6 1 6 8 9 19 0 . 2 4 7 3 9 2 0 . 3 4 9 1 2 3 - 0 . 0 0 4 4 2 7 2 0 0 . 2 4 6 9 5 8 0 . 3 6 1 6 6 4 0 . 0 1 5 1 5 2 21 0 . 2 4 5 9 2 1 0 . 3 8 6 7 3 3 0 . 0 5 3 6 5 1 2 2 0 . 2 4 5 3 9 8 0 . 3 2 0 1 9 5 0 . 3 1 1 2 3 5 Figure 24 First three normalized eigenvectors, ax,a2,a3 55 Appendix C Results of the PCA using the correlation matrix E i g e n v a l u e s o f t h e C o r r e l a t i o n M a t r i x E i g e n v a l u e D i f f e r e n c e P r o p o r t i o n C u m u l a t i v e 1 6 . 3 2 8 9 2 7 2 4 1 . 1 6 0 5 1 6 2 6 0 . 3 9 5 6 0 3 9 5 6 2 5 . 1 6 8 4 1 0 9 8 4 . 1 6 2 7 2 8 8 0 . 3 2 3 0 7 1 8 6 3 1 . 0 0 5 6 8 2 1 8 0 . 2 3 3 6 3 8 8 1 0 . 0 6 2 9 0 7 8 1 4 4 0 . 7 7 2 0 4 3 3 7 0 . 2 4 6 5 7 0 6 5 0 . 0 4 8 3 0 8 2 9 7 5 0 . 5 2 5 4 7 2 7 2 0 . 1 5 7 3 1 5 5 5 0 . 0 3 2 8 0 8 6 2 5 6 0 . 3 6 8 1 5 7 1 8 0 . 0 3 9 2 2 0 6 7 0 . 0 2 3 0 8 8 5 5 P r i n l P r i n 2 P r i n 3 P r i n 4 P r i n 5 N7 - 0 . 2 1 8 0 6 8 - 0 . 1 0 7 5 5 3 - 0 . 1 3 1 2 3 1 0 . 7 6 7 4 4 9 0 . 5 3 7 8 1 N8 - 0 . 2 2 1 3 3 6 0 1 2 7 6 2 4 0 . 5 6 2 1 5 8 0 . 3 7 0 3 6 4 - 0 . 4 4 1 N9 - 0 . 2 4 2 7 4 2 0 1 5 6 2 8 6 0 . 5 7 9 2 8 7 - 0 . 0 2 5 6 0 5 0 . 1 3 1 7 8 N 1 0 - 0 . 2 2 5 8 9 7 0 2 2 1 1 7 7 0 . 1 9 3 2 8 5 - 0 . 3 9 6 4 6 6 0 . 5 3 3 2 3 N11 - 0 . 2 2 1 9 8 5 0 2 8 2 9 6 7 - 0 . 1 0 7 0 8 3 - 0 . 0 9 2 8 4 8 0 . 2 5 4 6 8 N 1 2 - 0 . 1 0 7 0 6 9 0 3 7 1 1 3 6 - 0 . 1 4 0 6 9 6 - 0 . 0 0 2 8 8 4 0 . 0 2 4 4 5 N 1 3 - 0 . 1 5 0 8 1 0 3 4 4 4 9 8 - 0 . 1 5 2 1 1 9 0 . 1 6 7 6 0 8 - 0 . 1 1 3 9 N14 - 0 . 1 7 8 7 9 7 0 3 3 3 5 6 7 - 0 . 2 3 9 0 5 6 0 . 0 8 9 4 8 4 - 0 . 1 9 8 N15 - 0 . 1 1 9 1 3 4 0 3 5 7 7 1 7 - 0 . 1 8 3 1 2 8 0 . 0 4 7 4 1 7 - 0 . 2 0 9 1 N 1 6 0 . 1 0 7 0 7 8 0 3 6 7 8 4 7 - 0 . 0 4 6 3 6 1 - 0 . 0 5 8 7 9 7 - 0 . 0 4 1 N 1 7 0 . 2 6 6 6 2 5 0 2 8 4 2 8 9 0 . 0 2 6 0 9 5 0 . 0 0 8 2 0 5 0 . 1 0 8 2 4 N18 0 . 3 1 3 0 6 9 0 2 1 6 8 5 9 0 . 1 1 3 9 5 3 0 . 0 5 0 4 6 3 0 . 1 2 2 7 1 N 1 9 0 . 3 3 7 9 3 2 0 1 5 8 1 1 4 0 . 2 1 0 6 0 5 0 . 1 1 5 2 6 9 0 . 0 9 6 9 9 N20 0 . 3 5 4 2 4 8 0 1 1 6 7 6 5 0 . 2 1 0 2 0 9 0 . 0 9 0 0 6 9 0 . 0 7 8 8 7 N21 0 . 3 6 7 8 4 5 0 0 9 7 4 9 9 0 . 0 9 0 0 4 3 0 . 1 1 0 3 9 5 0 . 0 9 4 4 6 N22 0 . 3 2 9 0 7 6 0 1 0 8 5 4 1 - 0 . 1 9 0 4 5 4 0 . 1 7 1 3 9 7 - 0 . 0 1 3 9 56 Appendix D Scatterplots of the first three principal components > 1 1 \ 1 1 1 1 1 1 1 1 1 1 1 ) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -9 -8 -7 -G -5 -4 -3 -2 -I 0 1 2 3 4 5 PRIN2 Cluster • • • 1 • • • 2 • • • 3 • • • 4 © © <3 5 i° I V A ", o # . *, i . ; 0 •I, » « I I I I I I I I I I I I I I I I I I I I I I I ' I I 1 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I [ I I I I I 3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 II 12 13 PRIN3 Cluster • • • 1 • • • 2 9 9 9 2 • • • 4 # e • 5 57 5 4 3 2 1 0 -1 -2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 1 I ' 1 1 1 I ' 1 1 1 I 1 1 1 1 I ' 1 1 1 I 1 1 1 ' I 1 1 1 ' I ' 1 1 ' I 1 1 1 1 I 1 1 1 ' I 1 1 1 1 I 1 1 1 1 I 1 1 1 1 I 1 1 1 1 I 1 1 ' 1 I 1 - 3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 PRIN3 C l u s t e r • • • 1 • • • 2 • • • 3 • • • 4 O O 9 5 Figure 25 Scatterplots of the 3 first components 12 13 58 Appendix E Time series of the Outliers AZ ID=01-04 59 Appendix F Clusters obtained from using the FASTCLUS procedure Cluster=1 CCS N , T I I I I I I I I I I I I ~1 I T 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 HOUR Cluster=2 CCS N , I I I I I I I I I I I I I I I T 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 HOUR 60 Cluster=3 61 Cluster=5 CCS N i i i i i i i i i i 1 1 1 r 08 09 10 11 12 13 14 15 16 17 18 19 20 21 HOUR 62 Appendix G Clusters obtained using the Average Linkage Method C L U S T E R S c c s n 07 08 09 10 II IZ 13 14 IS 16 17 18 19 20 21 22 CLUSTER-2 c c s _ n ' -o 1 T in Fi T T T 07 08 09 10 II 18 13 14 15 IB 17 18 19 20 21 22 hour CLUSTER=3 c c s n 07 08 09 10 11 12 13 14 IS IS 17 18 19 20 21 63 CLUSTER=4 c c s _ n 07 08 09 10 II 12 13 14 15 16 17 18 19 20 21 22 hour CLUSTERS c c s _ n 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 hour CLUSTER=6 c c s _ n 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 64 CLUSTEH-7 65 Appendix H Clusters obtained from using Ward's Method CLUSTER"1 c c s n 07 08 09 tO 11 12 13 14 15 IB 17 18 19 20 21 22 CLUSTER-2 c c s _ n 07 08 09 10 II 12 13 14 15 IB 17 18 19 20 21 22 hour CLUSTER»3 c c s _ n 07 08 09 10 11 12 13 14 IS IB 17 18 19 20 21 22 hour 66 CLUSTEH=4 c c s _ n 07 08 09 10 II 12 13 14 15 16 17 18 19 20 21 22 hour CLUSTER=5 c c s n 07 08 09 10 II 12 13 14 15 16 17 18 19 20 21 22 67 C L U S T E R S c c s _ n 07 08 09 10 11 12 13 M 15 16 17 18 19 20 21 22 hour CLUSTER=7 c c s _ n 07 08 09 10 11 12 13 14 IS 16 17 18 19 20 21 22 68 Appendix I Clusters obtained from using the Centroid Method CLUSTER-! c c s _ n 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 hour CLUSTER=2 c c s _ n 1 .0 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 hour CLUSTER-3 c c s _ n 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 69 CLUSTERM c c s _ n 07 08 09 10 11 12 13 14 15 16 17 18 19 ZO 21 22 hour CLUSTER'S c c s _ n 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 hour C L U S T E R S c c s _ n 07 08 09 10 11 12 13 14 IS IE 17 18 19 20 21 22 70 Appendix J Results of performing cluster analysis on raw data The cluster analysis described in the results has been performed using the principal components as inputs. However, we have also tried using a variation of the methodology described in section 3. In this case, the cluster analysis was performed on the normalized raw data (instead of using the principal components obtained for the principal component analysis). The results obtained from this cluster analysis showed that the clusters obtained were similar to the ones obtained from performing the analysis using the principal components. This comparison was done by comparing the box-plots of the 7 clusters created by both analyses. This showed that the shape of the clusters were comparable. The final results using both methods were also compared. This showed that most of the new alternate routes obtained from both analyses were the same ones. However, even though the results seemed similar, we continued our analysis using the principal components as these gave more interpretable results. For example, the observations could be plotted in a three dimensional space and we could see visually if there were any peculiarities about the data or about how the clusters were formed. 71 Appendix K Results AZ_id tandem H7 H8 H9 H10 H11 H12 H13 H14 H15 H16 H17 H18 H19 H20 H21 H22 01-02 16 Y N N N N N N N N N N Y N Y Y Y 01-03 5 Y Y N N N N N N N N Y Y Y Y Y Y 01-03 8 Y Y N N N N N N N N Y Y Y Y Y Y 01-03 10 Y Y N N N N N N N N Y Y Y Y Y Y 01-03 14 Y Y N N N N N N N N Y Y Y N Y Y 01-03 16 Y N N N N N N N N N N Y Y Y Y N 01-03 20 Y Y N N N N N N N N N N N Y N Y 01-03 23 Y Y N N N N N N N N Y Y Y Y Y Y 01-03 24 Y Y N N N N N N N N Y Y Y Y Y Y 01-03 25 Y Y N N N N N N N N Y Y Y Y Y Y 01-03 26 Y N N N N N N N N N Y Y Y Y Y Y 01-03 28 Y N N N N N N N N N Y N N Y Y Y 01-03 30 Y Y N N N N N N N N Y Y Y Y Y Y 01-03 31 Y Y N N N N N N N Y Y Y Y Y Y Y 01-05 0 Y Y N N N N N N N Y Y Y Y Y Y Y 01-07 0 Y Y N N N N N N N Y Y Y Y Y Y Y 01-08 0 Y Y N N N N N N N Y Y Y Y Y Y Y 01-09 0 Y Y N N N N N N N Y Y Y Y Y Y Y 01-10 0 Y Y N N N N N N N Y Y Y Y Y Y Y 01-10 16 Y Y N N N N N N N N Y Y Y Y Y Y 01-10 28 Y N N N N N N N N N N N Y Y Y Y 01-11 16 Y N N N N N N N N N N Y N N Y Y 01-11 26 Y N N N N N N N N N Y N Y Y Y Y 01-12 0 Y Y N N N N N N N Y Y Y Y Y Y Y 01-13 5 Y Y N N N N N N N N Y Y Y Y Y Y 01-13 7 Y Y N N N N N N N N Y Y N Y Y N 01-13 8 Y Y N N N N N N N N Y Y Y Y Y Y 01-13 10 Y Y N N N N N N N N N Y Y Y Y Y 01-13 25 Y N N N N N N N N N Y N Y N Y Y 01-13 27 N N N N N N N N N N N N N N N Y 01-13 31 Y Y N N N N N N N N Y Y Y Y Y Y 01-14 0 Y Y N N N N N N N N Y Y Y Y Y Y 01-15 0 Y Y N N N N N N N Y Y Y Y Y Y Y 01-16 0 Y Y N N N N N N N N Y Y Y Y Y Y 72 Appendix L SAS code * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Filename: Thesis - A p p l i c a t i o n of M u l t i v a r i a t e A n l y s i s t o ; * Time of Day Routing; * Author: I s a b e l l e Smith; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * J options pagesize = 28 Is =100; options pagesize = 32 ls=120 nomprint; options obs=max; ******************************************************************************* » % l e t st_hr=7; % l e t end_hr=22; * The c u t o f f v a r i a b l e i s used to determine the l e v e l at which an arc i s busy or not; * This value i s a r b i t r a r i l y chosen; % l e t cutoff=0.7; * % l e t pathl =p:\telus\analysis\mva\analysis; % l e t pathl =C:\WINDOWS\Desktop\courses\TELUS\analysis; libname swork "&path1.\SASW0RK"; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * J * Macro d e f i n i t i o n s ; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 3 * Macro used to p r i n t ; %macro p r ( n ) ; proc p r i n t data= &n; run; %mend pr; * Macro used to append two nodes; %macro PAIR_ID (VAR1= ,VAR2=, CBN_VAR=); length &CBN_VAR. $5.; i f &var1 I t 10 then aa = compress('0' || put (&var1, 2 . ) ) ; e l s e aa = put (&var1, 2.); i f &var2 I t 10 then zz = compress('0' || put (&var2, 2 . )); e l s e zz = put (&var2, 2.); i f &VAR1 < &VAR2 then do; &CBN_VAR. =compress( PUT(aa, 2.) || "-" || PUT(zz, 2 . ) ) ; end; e l s e do; &CBN_VAR. =compress( PUT(zz, 2.) || "-" || PUT(aa, 2 . ) ) ; end; %mend PAIR_ID; 73 ************************************************** * Read i n the f i l e c o n t a i n i n g AZ p a i r , Hour and CCS; ********************************************************** J * The f o l l o w i n g f i l e contains data f o r the 31st of J u l y , 2000; •filename fn1 ' p : \ t e l u s \ a n a l y s i s \ p e r s o n a l \ i s a b e l l e \ c s . t x t ' ; filename fn1 'C:\WINDOWS\Desktop\courses\TELOS\analysis\saswork\cs.txt 1; data swork.cs; i n f i l e fn1 d e l i m i t e r = ' , ' ; input a z _ i d $ houN CS; i f a z _ i d = ' a z _ i d ' then d e l e t e ; run; data swork.data_01(drop=hour1 c s ) ; format a z 2.; length a z 8.0; set swork.cs; hour = hour1-24; *the input data had the hours numbered from 24-47; CCS = CS/100; *the input data had c a l l volume i n c a l l seconds; a = s u b s t r ( l e f t ( a z _ i d ) , 1 , 2 ) ; z = s u b s t r ( l e f t ( a z _ i d ) , 4 , 2 ) ; i f &st_hr <= hour <= &end_hr; run; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * P l o t the time s e r i e s of the arc u t i l i z a t i o n f o r each AZ p a i r ; ******************************************************************************* symboll i = j o i n v=plus; proc gplot data=swork.data_01; p l o t ccs *hour; by a z _ i d ; run; q u i t ; * / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Normalize the CCS f o r each AZ p a i r to obtain CCS_n between 0 and 1 ******************************************************************************* * Find max and min ccs; proc summary data=swork.data_01 nway missing; c l a s s a z _ i d ; var ccs; output out=data_01b max=max_ccs min=min_ccs; 74 run; proc s o r t data=swork.data_01; by a z _ i d ; run; proc s o r t data=data_01b; by a z _ i d ; run; data data_01c; merge swork.data_01(in=have) data_01b(in=want); by a z _ i d ; i f have; run; * remove the two o u t l i e r s t h a t were found: 01-04 and 11-27; data data_01d; set data_01c; where a z _ i d <> '01-04' and a z _ i d <> '11-27'; run; * Normalize CCS; data norm_01; set data_01d; ccs_n = (ccs - min_ccs) / (max_ccs - min_ccs); run; *%pr(norm_01); /* * P l o t the time s e r i e s of the normalized arc u t i l i z a t i o n f o r each AZ p a i r ; proc gplot data=norm_01; p l o t ccs_n * hour; by a z _ i d ; run; q u i t ; */ *************************************************** * Prepare data f o r P r i n c i p a l Components A n a l y s i s ; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * i proc transpose data=norm_01 out=Tnorm_01 prefix=N; by a z _ i d a z; var ccs_n; i d hour; run; ******************************************************************************* * P r i n c i p a l Components A n a l y s i s ; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * proc princomp data=Tnorm_01 out=swork.pca_out outstat=swork.pca_stat; var n&st_hr--n&end_hr; run; /* s c a t t e r p l o t s of the p r i n c i p a l components 75 */ / * symbc-11 v=dot i=none h=0.5; proc g p l o t data=swork.pca_out; p l o t p r i n l * pri n 2 ; run; q u i t ; symboll v=dot i=none h=0.5; proc g p l o t data=swork.pca_out; p l o t p r i n 2 * pr i n 3 ; run; q u i t ; symboll v=dot i=none h=0.5; proc gplot data=swork.pca_out; p l o t p r i n l * prin3 ; run; q u i t ; proc f a c t o r data=swork.pca_out nfactors=3 scree; var n&st_hr--n&end_hr; run; proc gchart data=swork.pca_out; vbar p r i n l ; run; q u i t ; proc s o r t data=swork.pca_out; by a z _ i d ; run; proc s o r t data=norm_01; by a z _ i d ; run; data t e s t s ; merge swork.pca_out(in=want keep=az_id p r i n l p rin2 prin3) norm_01(in=have); by a z _ i d ; i f have; run; % p r ( t e s t 5 ) ; data t e s t 6 ; set t e s t 5 ; *where prin1<-4 and prin2<-2and prin3>0; where prin1<-4 and prin2<-2; run; symbol i = b o x t j ; proc gplot data=test6; p l o t ccs_n *hour/ overlay; run;quit; * / ************************************************** ) * C l u s t e r A n a l y s i s - Using f a s t c l u s procedure; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * J proc f a s t c l u s maxclusters=5 maxiter=10 data=swork.pca_out out=swork.clus_01; 76 var p r i n l p rin2 p r i n 3 ; run; symboll v=dot i=none h=0.5; proc g p l o t data=swork.clus_01; p l o t p r i n l * pr i n 2 = c l u s t e r ; run; q u i t ; symboll v=dot i=none h=0.5; proc g p l o t data=swork.clus_01; p l o t p r i n 2 * prin 3 = c l u s t e r ; run; q u i t ; symboll v=dot i=none h=0.5; proc g p l o t data=swork.clus_01; p l o t p r i n l * pr i n 3 = c l u s t e r ; run; q u i t ; * / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * i * P r i n t CCS vs Shour f o r each c l u s t e r ; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * data data_pr1; set Tnorm_01(keep=az_id n&st_hr--n&end_hr); run; proc s o r t data=data_pr1; by a z _ i d ; run; proc transpose data=data_pr1 out=data_pr2; by a z _ i d ; run; data data_pr3; set data_pr2(rename=(col1=ccs_n)); i f (substr(left(_name_),2,2) < 10) then hour=compress(0||substr(left(_name_),2,1)); el s e hour=substr(left(_name_),2,2); run; proc s o r t data=data_pr3; by a z _ i d ; run; proc s o r t data=swork.clus_01; by a z _ i d ; run; data clus_02; merge data_pr3(in=have) swork.clus_01 (in=want keep=az_id c l u s t e r ) ; by a z _ i d ; i f have and want; run; proc s o r t data=clus_02; by c l u s t e r ; run; symboll i = b o x t j ; proc gplot data=clus_02; by c l u s t e r ; 77 p l o t ccs_n *hour; run; q u i t ; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * C l u s t e r A n a l y s i s ************************* J % l e t numclus=7; proc c l u s t e r data var p r i n l i d a z _ i d ; run; / * symboll v=plus i=none; a x i s l label=(angle=90); proc gplot data=tree; p l o t _ c c c _ * _ n c l _ /vaxis = a x i s l haxis=0 to 16 by 2; run; q u i t ; */ proc t r e e data=tree out=treeout nclusters=&numclus; copy p r i n l p r i n 2 p r i n 3 ; i d a z _ i d ; run; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * J * p l o t normalized time s e r i e s by c l u s t e r ; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * i proc s o r t data=data_pr3; by a z _ i d ; run; proc s o r t data=treeout; by a z _ i d ; run; data clus_03; merge data_pr3(in=want) treeout(in=have keep=az_id c l u s t e r ) ; by a z _ i d ; i f have; run; symboll i= b o x t j ; proc s o r t data=clus_03; by c l u s t e r ; run; proc gplot data=clus_03; by c l u s t e r ; p l o t ccs_n *hour; run;quit; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * f i n d 2 adjacent complementary arc s ; - Using c l u s t e r procedure; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * =swork.pca_out method=ward pseudo ccc outtree=tree prin2 p r i n 3 ; 78 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / * % l e t gr1=1; % l e t gr2=2; */ % l e t o u t f i l e = o u t c l 5 ; filename m y f i l e " p : \ t e l u s \ a n a l y s i s \ m v a \ a n a l y s i s \ s a s w o r k \ & o u t f i l e " ; filename m y f i l e "C:\WINDOWS\Desktop\courses\TELUS\analysis\saswork\&outfile"; %macro m a k e f i l e ( g r 1 , g r 2 ) ; * Form two groups to be used to f i n d adjacent p a i r s of arc s ; proc s o r t data=clus_03; by a z _ i d c l u s t e r ; run; proc summary data=clus_03 nway missing; by a z _ i d c l u s t e r ; output out=clus_04; run; data groupl(drop=az_id); format a z $2.; set clus_04(keep=az_id c l u s t e r ) ; a = s u b s t r ( l e f t ( a z _ i d ) , 1 , 2 ) ; z = s u b s t r ( l e f t ( a z _ i d ) , 4 , 2 ) ; az_id1 = a z _ i d ; where cluster=&gr1; run; data group2(drop=az_id c l u s t e r ) ; format a z $2.; set clus_04(keep=az_id c l u s t e r ) ; a = s u b s t r ( l e f t ( a z _ i d ) , 1 , 2 ) ; z = s u b s t r ( l e f t ( a z _ i d ) , 4 , 2 ) ; az_id2 = a z _ i d ; where cluster=&gr2; run; * Find arcs i n group 2, f o r each AZ p a i r i n group 1, that s t a r t w i t h the same "a" node; proc s o r t data=group1; by a; run; proc s o r t data=group2; by a; run; proc s q l ; create t a b l e s t a r t _ a as s e l e c t groupl.a, groupl.z, a z _ i d 1 , az_id2a from groupl, group2(keep=a az_id2 rename=(az_id2=az_id2a )) where groupl.a=group2.a; 79 * Find arcs i n group 2, f o r each AZ p a i r i n group 1, that end with the same "a node; proc s o r t data=group1; by a; run; proc s o r t data=group2; by z; run; proc s q l ; create t a b l e end_a as s e l e c t groupl.a, groupl.z, a z _ i d 1 , az_id2a from g r o u p l , group2(keep=z az_id2 rename=(az_id2=az_id2a z=a)) where groupl.a=group2.a; * Append the t a b l e c o n t a i n i n g arcs that s t a r t with "a" with the t a b l e c o n t a i n i n g arcs t h a t end w i t h "a"; data match_a; set s t a r t _ a end_a; run; proc s o r t data=match_a; by a z; run; * Find arcs i n group 2, f o r each AZ p a i r i n match_a, that s t a r t w ith the same "z" node; * Note: we only need t o look at AZ p a i r s i n match_a, not a l l of the AZ p a i r s i n groupl; proc s o r t data=match_a; by z; run; proc s o r t data=group2; by a; run; proc s q l ; create t a b l e s t a r t _ z as s e l e c t match_a.a, match_a.z, a z _ i d 1 , az_id2a, az_id2z from match_a, group2(keep=a az_id2 rename=(az_id2=az_id2z a=z)) where match_a.z=group2.z; * Find arcs i n group 2, f o r each AZ p a i r i n match_a, that end with the same "z" node; proc s o r t data=match_a; by z; run; proc s o r t data=group2; by z; run; proc s q l ; create t a b l e end_z as s e l e c t match_a.a, match_a.z, a z _ i d 1 , az_id2a, az_id2z from match_a, group2(keep=z az_id2 rename=(az_id2=az_id2z)) where match_a.z=group2.z; * Append the t a b l e c o n t a i n i n g arcs that s t a r t with "z" with the t a b l e c o n t a i n i n g arcs that end with "z"; 8 0 data match_z; set s t a r t _ z end_z; run; proc s o r t data=match_z; by a z; run; * Keep the observations f o r which the az_id2a and the az_id2z are adjacent; data match(drop=M); set match_z; M=0; i f ( s u b s t r ( l e f t ( a z _ i d 2 a ) , 1 , 2 ) = s u b s t r ( l e f t ( a z _ i d 2 z ) , 1 , 2 ) or s u b s t r ( l e f t ( a z _ i d 2 a ) , 1 , 2 ) = s u b s t r ( l e f t ( a z _ i d 2 z ) , 4 , 2 ) or s u b s t r ( l e f t ( a z _ i d 2 a ) , 4 , 2 ) = s u b s t r ( l e f t ( a z _ i d 2 z ) , 1 , 2 ) or s u b s t r ( l e f t ( a z _ i d 2 a ) , 4 , 2 ) = s u b s t r ( l e f t ( a z _ i d 2 z ) , 4 , 2 ) ) then M=1; i f M=1; run; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * J * f i n d i f the a v a i l a b l e c a p a c i t y on the adjacent arcs i s s u f f i c i e n t ; ******************************************************************************* J data a v a i l _ 0 1 ; set norm_01; a v l _ c c s = max_ccs - ccs; * use ca p a c i t y i n s t e a d of max_ccs...; run; * merge a v a i l a b l e c a p a c i t y and max_ccs on AZ p a i r ; proc s o r t data=avail_01; by a z _ i d hour; proc s o r t data=match; by az _ i d 1 ; proc s q l ; create t a b l e match_01 as s e l e c t hour, match.a, match.z, match.az_id, a v l _ c c s , max_ccs, az_id2a, az_id2z from match(rename=(az_id1=az_id)), avail_01(keep=avl_ccs max_ccs a z _ i d hour) where match.az_id=avail_01.az_id; * merge a v a i l a b l e c a p a c i t y and max_ccs on f i r s t arc of a l t e r n a t e route; proc s o r t data=match_01; by az_id2a hour; proc s q l ; create t a b l e match_02 as s e l e c t match_01.hour, match_01.a, match_01.z, match_01.az_id, a v l _ c c s , max_ccs, match_01.az_id2a, avl_ccsa,max_ccsa, az_id2z 81 \ f r o m m a t c h _ 0 1 , a v a i l _ 0 1 ( k e e p = a v l _ c c s a z _ i d hou r max_ccs rename=(max_ccs=max_ccsa a v l _ c c s = a v l _ c c s a a z _ i d = a z _ i d 2 a ) ) where m a t c h _ 0 1 . a z _ i d 2 a = a v a i l _ 0 1 . a z _ i d 2 a and m a t c h _ 0 1 . h o u r = a v a i l _ 0 1 . h o u r ; * merge a v a i l a b l e c a p a c i t y on s e c o n d a r c o f a l t e r n a t e r o u t e ; p r o c s o r t d a t a = m a t c h _ 0 2 ; by a z _ i d 2 z h o u r ; p r o c s q l ; c r e a t e t a b l e match_03 as s e l e c t m a t c h _ 0 2 . h o u r , m a t c h _ 0 2 . a , m a t c h _ 0 2 . z , m a t c h _ 0 2 . a z _ i d , a v l _ c c s , m a x _ c c s , m a t c h _ 0 2 . a z _ i d 2 a , a v l _ c c s a , m a x _ c c s a , m a t c h _ 0 2 . a z _ i d 2 z , a v l _ c c s z , m a x _ c c s z f r o m m a t c h _ 0 2 , a v a i l _ 0 1 ( k e e p = m a x _ c c s a v l _ c c s a z _ i d h o u r rename=(max_ccs=max_ccsz a v l _ c c s = a v l _ c c s z a z _ i d = a z _ i d 2 z ) ) where m a t c h _ 0 2 . a z _ i d 2 z = a v a i l _ 0 1 . a z _ i d 2 z and m a t c h _ 0 2 . h o u r = a v a i l _ 0 1 . h o u r ; * a v a i l a b l e c a p a c i t y on a l t e r n a t e r o u t e i s e q u a l t o t h e minimum a v a i l a b l e c a p a c i t y b e t w e e n ; * t h e f i r s t a r c and t h e s e c o n d a r c : m i n ( a v l _ c c s a , a v l _ c c s z ) ; * t h e H i g h demand f l a g i n d i c a t e s " Y " when t h e AZ p a i r i s u s i n g more t h a n 70% o f i t s maximum a r c u t i l i l i z a t i o n o f t h a t d a y ; * t h e S u f f i c i e n t c a p a c i t y f l a g i n d i c a t e s " Y " when t h e two a r c s f o r m i n g t h e new r o u t e a r e u s i n g l e s s t h a n 70% o f i t s maximum; * a r c u t i l i l i z a t i o n o f t h a t day and t h e a v a i l a b l e c a p a c i t y on t h a t r o u t e i s s u f f i c i e n t ; d a t a m a t c h _ 0 4 ( d r o p = a z a z _ i d 2 a a z _ i d 2 z ) ; s e t m a t c h _ 0 3 ; a v l _ c c s r = m i n ( a v l _ c c s a , a v l _ c c s z ) ; High_dem = " N " ; S u f f _ c a p = " N " ; i f ( ( s u b s t r ( l e f t ( a z _ i d 2 a ) , 1 , 2 ) = s u b s t r ( l e f t ( a z _ i d 2 z ) , 1 , 2 ) ) o r ( s u b s t r ( l e f t ( a z _ i d 2 a ) , 1 , 2 ) = s u b s t r ( l e f t ( a z _ i d 2 z ) , 4 , 2 ) ) ) t h e n t a n d e m = s u b s t r ( l e f t ( a z _ i d 2 a ) , 1 , 2 ) ; i f ( ( s u b s t r ( l e f t ( a z _ i d 2 a ) , 4 , 2 ) = s u b s t r ( l e f t ( a z _ i d 2 z ) , 1 , 2 ) ) o r ( s u b s t r ( l e f t ( a z _ i d 2 a ) , 4 , 2 ) = s u b s t r ( l e f t ( a z _ i d 2 z ) , 4 , 2 ) ) ) t h e n t a n d e m = s u b s t r ( l e f t ( a z _ i d 2 a ) , 4 , 2 ) ; i f a v l _ c c s < 0 . 3 * m a x _ c c s t h e n H igh_dem="Y" ; i f ( a v l _ c c s a > m a x _ c c s a * 0 . 3 ) a n d ( a v l _ c c s z > m a x _ c c s z * 0 . 3 ) and ( a v l _ c c s r > m a x _ c c s * 0 . 0 5 ) t h e n S u f f _ c a p = " Y " ; * i f H igh_dem = " Y " and S u f f _ c a p = " Y " ; r u n ; p r o c s o r t da ta=ma tch_04 ; by a z _ i d t a n d e m ; p r o c t r a n s p o s e da ta=match_04 o u t = s w o r k . I i s t & g r 1 & g r 2 p r e f i x = H ; by a z _ i d t andem; v a r S u f f _ c a p ; 82 i d h o u r ; d a t a _ n u l l _ ; s e t s w o r k . I i s t & g r 1 & g r 2 ; f i l e m y f i l e mod; tmp1 = c o m p r e s s ( A Z _ i d | | " , " | | t a n d e m | | " , " ) ; pu t ( tmp1) ( 9 . ) ( H & s t _ h r - - H & e n d _ h r . ) ( 3 . 0 ' , ' ) ; r u n ; %mend m a k e f i l e ; %macro m k f i l e s ; %macro h e a d e r ; %do i = & s t _ h r %to & e n d _ h r ; % i f & i = & s t _ h r %then %do; " H & i " %end; % e l s e %do; " , H & i " %end; %end; %mend; d a t a _ n u l l _ ; f i l e m y f i l e ; pu t ' A Z _ i d , t a n d e m , ' % h e a d e r ; r u n ; %do i=1 %to & n u m c l u s ; %do j=1 %to & n u m c l u s ; %i f ( & i ne & j ) %then %do; % m a k e f i l e ( & i , & j ) ; %end; %end; %end; %mend m k f i l e s ; % m k f i l e s ; * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * D e t e r m i n e how many h o u r s t o k e e p ; ******************************************************************************* /* * T h i s s e c t i o n was used t o d e t e r m i n e d u r i n g w h i c h p e r i o d s were t h e AZ p a i r s ; * u s i n g more t h a n a c e r t a i n p e r c e n t a g e ( c u t o f f p o i n t ) o f t h e i r c a p a c i t y ; p r o c summary d a t a = s w o r k . d a t a _ 0 1 nway m i s s i n g ; c l a s s a z _ i d ; v a r c c s ; o u t p u t out= t e s t l max=max_ccs m i n = m i n _ c c s ; 83 r u n ; p r o c s o r t d a t a = s w o r k . d a t a _ 0 1 ; by a z _ i d ; r u n ; p r o c s o r t d a t a = t e s t 1 ; by a z _ i d ; r u n ; d a t a t e s t 2 ; merge s w o r k . d a t a _ 0 1 ( i n = h a v e ) t e s t l ( i n = w a n t k e e p = a z _ i d m i n _ c c s m a x _ c c s ) ; by a z _ i d ; i f h a v e ; r u n ; d a t a t e s t 3 ; s e t t e s t 2 ; f l a g = 1; i f c c s < ( & c u t o f f * m a x _ c c s ) t h e n f l a g = 0 ; i f f l a g = 1 ; r u n ; p r o c p r i n t d a t a = t e s t 3 ; whe re hour<8 ; r u n ; * F i n d i n g s : We can f i n d t h a t g e n e r a l l y , AZ p a i r s a r e b u s i e s t be tween 8 :00AM and 2 3 : 0 0 P M ; * / 84 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0099542/manifest

Comment

Related Items