... > R31 1 and s existscr,j = max1. 0 otherwisenodes in R which have increased their level by j + 1 and the remaining r — Cr,j will haveincreased their level by j. Furthermore the nodes in the region formed after the j-th rstep will have weights in order, (Wcr,,+i +.. + W,.+21), (Wcrj+2j+i + + Wcrj+2j++i)• •. (Wcr,j+(m_i)2i+1 + •.. + W j+m2i), (w1 + • •• + w + WCrJ+m2j+1 + ... + Wr) wherem=[r/2j]—1.Proof: The proof of this theorem mirrors the one given for Theorem 3.1. We knowthat in any r-step where the number of nodes present is odd, the smallest two nodes willv1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19Chapter 4. Speeding up Region-based Methods 36combine to form a new node, and this new node will then combine with the largest nodepresent, as was pictured in the last figure. We also know that this new node will be thesole largest node present in the next region. It was this observation which showed thatany node would have its level increased by 2 during at most one r- step; after this, itwould be a descendant of a largest node, whose levels only go up by one every r-step.The formal proof is by induction on lv. The base case of lv = 1 is simple; r is a power oftwo, and so there are never any leftover nodes to combine after any r-step. Assume thatthe theorem is true for all lv up to in— 1. We need to show it works for lv = in. Considera1.If a1 = 0, we know that there are an odd number of nodes present. Hence the twosmallest nodes present will have their levels increased by 2 in the first r-step. The numberof nodes which have their levels increased by two will thus be 2 plus twice the number ofnodes in the next level which have their levels increased by two. This is because everynode at the next level represents two in the original level. In the case in> 2, by inductionthis number is2 + 2(2(2a2_1 +2a—1-J-a + r2_l+aa+a4 + . . . +2a21.ds))= 2(20 + 2a2 +2a-J-a3 + a2÷a3÷a4÷+2a1++as)= Cr,j.If in = 2, note that s = 1 (otherwise there are not enough nodes to combine) and so thenumber of nodes which increase their level by 2 is 2 + 0 = 2(a1) = Cr,j.If a1 > 0 then there are an even number of nodes present and so in the next r-step,every node will have its level increased by exactly one. Thus the number of nodes whichhave their levels increased by two will be twice the number of nodes at the next levelthat do. Thus, by induction,2(2(21 + r1’2+2a,1+a3 + . . +2a1—1+...-f-aa)) =Chapter 4. Speeding up Region-based Methods 37nodes have their levels increased by two.It remains to show how the nodes are partitioned. We will use induction on thenumber r-steps to do this. The statement about partitioning is obvious for one r-step,regardless of whether r is even or odd. Assuming it is true for j — 1 r-steps, we shall proveit for j. Let the nodes in order be v1,. . . , v7. If a = 0, then there are an odd number ofnodes present. After one r-step the nodes in order are v3,4v5,6... , Vr_2,r_1, V(i,2),7. Byindnction these nodes get partitioned as pictured in Figure 4.10.C[r/21,fi 2,1-1 2JCr,j 2.1 2Figure 4.10: Partition according to inductive hypothesisObserve that Cr,j = 2 + 2C[r/],j_l. From the diagram and the inductive hypothesis,we see that the first Cr,j nodes and the r— (Cr,j+m21) largest nodes in 11 share a commonancestor in the region formed after j r-steps, . It is also clear that the rn2’ nodes inbetween are partitioned in sections of size 2. If a1 > 0, then r is even and it is easy tosee that Cr,j =2C[r/2],j_l and the partition of the nodes in R is as stated. UHaving shown how to predict the nodes formed after j r- steps, we now need to findout when it is economical to do so.Theorem 4.2 Let the region currently being processed be R, and assume it contains rChapter 4. Speeding up Region-based Methods 38nodes. Let the gap height between R and its lowest adjacent neighbor be h. We candetermine the region S formed after h— 1 r-steps in O(rlg(fg)) time.Proof: There are approximately (fr) nearly evenly spaced values to find in R. Usingthe linear time selection algorithm we can find the middle one in 0(r) time. Split thelist on this value into two approximately equal halves and again find the middle value ineach; this will take 0(i) + 0(i) = 0(r) time. After O(lg(ç)) steps, we will know thepartitioning of R according ancestors in the new region 5, at a cost of 0(r lg(*)) time,and so we can accurately form the nodes in S by just summing the weights in each nodespartition of R. UNow use the conventional means of combining lmcps in S until all nodes have entered the nearest adjacent region and we are finished. This only costs an additionalO(()lg()) steps, including the time necessary to sort the list of entering nodes, sothe entire running time is still 0(r lg()).4.2 The Constant Factor CaseWe now outline the linear time algorithm for weights within a constant factor. That is,for the case where there is a constant C such that max{w1}/min } < C. We make useof new linear time generalized selection algorithm discovered by Maria Klawe. Becauseof the complexity of the result and its recent discovery we only provide an outline of themethod and refer the reader to a forthcoming (as of August, 1992) paper of the authorand Maria Klawe.As before, our objective is to determine the levels of the leaf nodes in the lmcp tree.The underlying idea is an extension of the techniques in the case where the input weightsare within a factor of two. Specifically, we use a region-based method to process theChapter 4. Speeding up Region-based Methods 39weights region by region in increasing order by category number until we are left with asingle region of crossable nodes. At this point we apply Theorem 3.1 to determine thelmcp tree levels of the nodes in this final region, and then work backwards from thisinformation to find the lmcp tree levels of the original weights.There are a number of fundamental problems that arise in implementing this strategy.The first is that given a region of r nodes that may be either crossable or noncrossable,finding the lmcps that will be formed out of the region can take IZ(r lg r) time since wemay be forced to sort the crossable nodes. Thus we cannot afford to actually determinewhich nodes pair together in lmcps, nor the weights of the lmcps formed. Instead we onlyobtain coarser information about the structure of the lmcp tree. In particular, we willdetermine groups of nodes that pair together in the sense that the lmcp partner of eachnode in the group is also in the group. As this procedure is iterated, we must be able to“process” regions where we only have a coarse description of the region. Specifically thedescription of a region is an ordered list of node-groups, where each noncrossable node isa singleton node-group, and each crossable node-group, G, is described by an node-grouplist representing the children of the nodes in G. For each node-group we will know thenumber of nodes in the group, and the largest and two smallest weights of the nodes inthe node-group.Working with node-group lists introduces a number of technical challenges. For example, given a node-group list for a region, 1?, we must construct a node-group list forL, the set of lmcps containing nodes in R. If the number r of nodes in R is even, becausethe regions adjacent to R have higher category numbers, it is easy to show that the lmcppartner of each node of R is in R. This makes constructing the node-group list for L fairlytrivial (i.e. it consists of a single node-group), except for the problem of determining thelargest and smallest weights of the nodes in L. The reason we need to know the largestand smallest weights of the nodes in each node-group arises from handling the case whenChapter 4. Speeding up Region-based Methods 40r is odd. In this case, R has exactly one node, a wallflower, whose lmcp partner is not anode in R. We need to know the largest weight in each node group in order to find thewallflower, and we will need to know the two smallest weights in order to know, at thenext stage, which node will be the lmcp partner of the wallflower. Our general strategyfor efficiently computing the largest weight among the lmcps that will be produced froma region, is to construct (in linear time) a list of weighted nodes for which we can obtainall the lmcps in linear time, and to prove that the largest weight lmcp from this list isthe same as that from the region. We use a similar method to find the two smallestweights. The choice of these approximate lists is fairly natural, but proving that theypossess the appropriate extremal lmcps requires establishing a number of tools to dealwith how modifying work lists affects the extremal lmcps.Another problem that arises is in applying Theorem 3.1 to the final region of crossablenodes. In the case where the original weights differ by at most a factor of two, we knowthe actual weight of each crossable node in the region and hence can apply the usuallinear-time selection algorithm to find the k smallest weight nodes for the appropriatevalue of k. In the constant factor case, however, we only know the largest and twosmallest weights in each node-group, which makes selecting the k smallest weight nodesmuch more difficult. What we need is a linear-time technique for producing a node-grouplist that represents the k smallest weight nodes, and a node-group list representing theremaining nodes in the region. From these we can work backwards to deduce the lmcptree levels of each of the original weights.The linear-time algorithm for selection in node-groups consists of two pieces. Thefirst piece is a generalized selection algorithm, which given a collection of sets and a“fast” selection algorithm for each set, does “fast” selection on the union of the sets.Here the term “fast” selection algorithm for a set A means that for any a between 0 and1, in time linear in the number of nodes in A we can partition A into A1 U A2 accordingChapter 4. Speeding up Region-based Methods 41to the following conditions. First, we have “fast” selection algorithms for both A1 andA2; second, every element in A1 is less than or equal to every element in A2; and third,A1 = [cyAj. The other piece of the node-group selection algorithm is a technique thatdoes the following task: Given “fast” selection algorithms for each node-group in a node-group list for a region R and an integer k, produce a node-group list that represents the ksmallest weight nodes in L, the set of lmcps constructed from 1?, as well as a node-grouplist representing the remaining nodes in L. By iteratively interleaving these two pieceswe eventually obtain an algorithm which performs the desired selection in the final regionin time linear in the number of weights in the original list (though as might be expected,the constant depends on C, the bound on the ratio between the original weights).4.3 The Exponentially Separated CaseIn this section we show that the basic region processing method yields an 0(n) timealgorithm for the case where the input weights are exponentially spread-out. Formallywe say the the input weight sequence w1,.. . , w, is exponentially separated if thereexists a constant C such that for all n,I{i: [lgw]= k} 1< CforallkE Z.Theorem 4.3 If the input weights in1, . . . w are exponentially separated then we canbuild an optimal alphabetic binary tree on them in 0(n) time.Proof: It is straightforward to observe that there are at most 2C nodes in anyintermediate region we process. Since this is the case, we can easily process every regionof size r in 0(r) time. Since this is the case, we can afford to use the basic regionprocessing method to build the entire lmcp tree in 0(n) time. In an additional 0(n)time we can build the optimal alphabetic binary tree from the leaf-levels in the lmcpChapter 4. Speeding up Region-based Methods 42tree. 0Chapter 5Hardness ResultsWe begin with a simple hardness result that shows constructing the intermediate lmcptree produced by Hu-Tucker based algorithms in any model of computation is at leastas difficult as sorting in that model. We also give a more complicated reduction fromsorting to constructing the optimal alphabetic tree by means of a region-based method.At this point we are uncertain what model of computation is realistic to use for thisproblem, so we do not state explicit lower bounds for a given model, but rather showthat the problems stated are at least as hard as sorting in whatever model used. Givenan unsorted list of n numbers, in 0(n) time we show we can transform the problemof sorting the list into a problem about constructing a tree, and from the informationrecorded in the structure of tree produced, compute the sorted order of the numbers in0(n) time. Hence, if a w(n) time lower bound exists for sorting in the model, then wehave non-trivial lower bound on constructing the tree.One easy way of getting a trivial lower bound is counting the number of possibleoutputs the algorithm must generate. In any binary decision tree model in which outputsappear at distinct leaves, the algorithm must be able to differentiate each of the differentoutput cases, and so any decision tree must have maximum depth at least lg s, where.s is the number of different solutions to the problem. A trivial 12(n) time lower boundfollows easily by showing that the number of possible alphabetic trees built on a sequenceof n weights is To see this, let c,,, be the number of alphabetic binary trees with nleaves. As the root of the tree can split the input list in any of n — 1 places it follows that43Chapter 5. Hardness Results 44C = C1_ +c2,_ + .. . c_1 for n > 1, with c1 = 1. The solution of this recurrenceis given by the Catalan sequence, {C}, [9], with c = C_1==5.1 Finding the lmcp tree is as Hard as SortingLemma 5.1 Let x1,x2,. . . ,x, be distinct real numbers drawn from [2,4). Let yt =X[/2]i, for i = 1 . . . 2n. If (yi,.. . , Y2fl) is given as input to any lmcp finding algorithm, the set of the first n lmcps found, disregarding order, will be{(yi, y2), (y, y4),... , (yfl—i, y2fl)}.Proof: First note that 1