"Science, Faculty of"@en . "Mathematics, Department of"@en . "DSpace"@en . "UBCV"@en . "Lastman, Gary Joseph"@en . "2011-11-21T18:58:43Z"@en . "1963"@en . "Master of Arts - MA"@en . "University of British Columbia"@en . "In this thesis we investigate arithmetic subroutines and round-off procedures. An error analysis of single operations in normalized floating point arithmetic leads us to the construction of an improved form of addition-subtraction subroutine. In addition the properties of several types of round-off procedures are examined (adding \u00C2\u00BD; adding random digits; dropping digits).\r\nThe experimental work (using the above mentioned subroutines) with the system x\u00E2\u0080\u00A2 = y, y\u00E2\u0080\u00A2 = -x shows that the systematic accumulation of round-off error observed by Huskey is due to the type of rounding-off procedure used. Furthermore, Hartree's explanation of this effect is found to be inadequate because carrying extra digits throughout the calculations does not eliminate the systematic round-off."@en . "https://circle.library.ubc.ca/rest/handle/2429/39178?expand=metadata"@en . "ROUNDING ERRORS IN DIGITAL COMPUTER ARITHMETIC SUBROUTINES t y GARY JOSEPH LASTMAN B.A.Sc., The U n i v e r s i t y of B r i t i s h Columbia, 1961 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS i n the Department of MATHEMATICS We accept t h i s , t h e s i s as conforming to the r e q u i r e d standard THE UNIVERSITY .OF BRITISH COLUMBIA March,.1963 In presenting t h i s t h e s i s i n p a r t i a l f u l f i l m e n t o f the requirements f o r an advanced degree a t the U n i v e r s i t y of B r i t i s h Columbia, I agree t h a t the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r reference and study. I f u r t h e r agree that permission f o r e xtensive copying of t h i s t h e s i s f o r s c h o l a r l y purposes may be granted by the Head o f my Department or by h i s r e p r e s e n t a t i v e s . I t i s understood t h a t copying or p u b l i c a t i o n of t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l not be allowed without my w r i t t e n permission. Department of M ATH EN\A TIC S The U n i v e r s i t y of B r i t i s h Columbia, Vancouver 8, Canada. Date 8 APRIL IU3 i A b s t r a c t In t h i s t h e s i s we i n v e s t i g a t e a r i t h m e t i c subroutines and round-off procedures.. An e r r o r a n a l y s i s of s i n g l e operations i n normalized f l o a t i n g p o i n t a r i t h m e t i c leads us to the c o n s t r u c t i o n of an improved form of a d d i t i o n -s u b t r a c t i o n subroutine. I n . a d d i t i o n , ; t h e p r o p e r t i e s of s e v e r a l types of round-off procedures are examined (adding \; adding random d i g i t s ; dropping d i g i t s ) . The experimental work (using the above mentioned subroutines) w i t h the system k = y, y = -x shows th a t the systematic accumulation of round-off e r r o r observed by Huskey i s due t o the type of rounding-off procedure used. Furthermore, Hartree's explanation of t h i s \u00E2\u0080\u00A2 e f f e c t . i s found to be inadequate because c a r r y i n g e x t r a d i g i t s throughout the c a l c u l a t i o n s does not e l i m i n a t e the systematic round-off. i i i Acknowledgement I wish to thank Dr. T. E. H u l l and Dr. C h a r l o t t e Froese f o r t h e i r h e l p f u l suggestions i n the p r e p a r a t i o n of t h i s manuscript. The f i n a n c i a l a s s i s t a n c e of the Defence Research Board of Canada (grant number DRB-95^0-03) i s g r a t e f u l l y acknowledged. 1 1 Table of Contents Page Chapter I I n t r o d u c t i o n , ]. Chapter I I Computing Schemes and Round-off Procedures 2 Chapter I I I E r r o r .Analysis of Normalized F l o a t i n g P o i n t A r i t h m e t i c 6 Chapter IV A S p e c i a l Subroutine and Round-off Procedure il+ Chapter V 20 Table I D e s c r i p t i o n s of P a r t i c u l a r A r i t h m e t i c Subroutines 21 Table I I Execution Times of the Subroutines 22 Table I I I Experimental R e s u l t s 23 Figure I 2k Figure I I 2k Chapter VI Conclusions 28 B i b l i o g r a p h y 29 '. - Chapter . I . . For the numerical s o l u t i o n of a mathematical problem on a d i g i t a l computer we should provide some s o r t of e r r o r a n a l y s i s to account f o r round-off e r r o r s , which are u s u a l l y present. The propagated round-off e r r o r may be given i n terms of a maximum upper bound or an expected value and v a r i a n c e : the s t a t i s t i c a l estimate i s much more s a t i s f a c t o r y . H e n r i c i \}\~\ s t r o n g l y advocates the usage of s t a t i s t i c a l methods, but Huskey cautions against t h e i r a p p l i c a t i o n on the b a s i s of some experimental r e s u l t s , > i t i s our i n t e n t i o n t o t r y to e x p l a i n Huskey's r e s u l t s I b y examining v a r i o u s types of a r i t h m e t i c subroutines and rounding-off procedures; i n a d d i t i o n , . we w i l l show tha t Hartree's e x p l a n a t i o n [j3 \u00C2\u00B0^ Huskey's r e s u l t s i s not s u f f i c i e n t to e x p l a i n the observed e f f e c t . For t h i s purpose we f i r s t d i s c u s s computing schemes and rounding-off procedures 'in Chapter I I . In Chapter I I I we give an e r r o r a n a l y s i s of s i n g l e operations i n normalized f l o a t i n g p o i n t a r i t h m e t i c . A s p e c i a l form of subroutine and a s p e c i a l rounding-off procedure are described i n Chapter IV, while i n Chapter V we present some experimental r e s u l t s . Chapter I I In g e n e r a l , e r r o r s are introduced when any mathematical problem i s solved by numerical computation on a d i g i t a l computer. We wish to determine the e f f e c t of round-off e r r o r s on the numerical s o l u t i o n to a problem, so henceforth we s h a l l be concerned w i t h the a r i t h m e t i c operations of a d d i t i o n , s u b t r a c t i o n , m u l t i p l i c a t i o n and d i v i s i o n . \u00E2\u0080\u00A2Since these operations are dependent upon the manner i n which the computations are performed, we f i r s t d i s c u s s v a r i o u s computing schemes. Numbers i n a d i g i t a l computer can be represented as an ordered p a i r (m,e) where m i s the mantissa of word le n g t h H d i g i t s . a n d e i s the exponent; -both are expressed i n the number base b, of the machine. A machine number (m, e) i s a c t u a l l y m.be. I f the exponent e i s a predetermined constant f o r a l l c a l c u l a t i o n s we have the system known.as f i x e d p o i n t a r i t h m e t i c . I f we a l l o w e t o vary between f i x e d l i m i t s , say -a ^ e i a , then we have f l o a t i n g p o i n t a r i t h m e t i c . With f i x e d p o i n t a r i t h m e t i c we u s u a l l y deal w i t h & - d i g i t numbers of the form (m,.o), w i t h \u00E2\u0080\u00A2?! < m 1 . This system i s very i n f l e x i b l e since one must r e s c a l e numbers so t h a t the r e s u l t s of numerical computations w i l l l i e w i t h i n the open i n t e r v a l ( - l , l ) . A l s o , the programmer must keep t r a c k of the decimal p o i n t throughout the c a l c u l a t i o n s . But, by u s i n g a f l o a t i n g p o i n t a r i t h m e t i c the programmer i s not r e q u i r e d to l o c a t e the decimal p o i n t at each step i n a program, or to r e s c a l e . Because of these advantages over f i x e d p o i n t , f l o a t i n g p o i n t a r i t h m e t i c i s now used i n almost a l l computers. In c o n s t r u c t i n g a f l o a t i n g p o i n t a r i t h m e t i c we become aware of two f a c t s : . ( l ) , t h a t although we can perform c a l c u l a t i o n s using numbers whose magnitudes vary over a c o n s i d e r a b l e range, the maximum number of d i g i t s we can r e t a i n at any stage i s the word le n g t h of m, the mantissa of our f l o a t i n g 3-p o i n t number; (2)^ the r e p r e s e n t a t i o n f o r a number x, (m,e), i s not unique(*)-For uniqueness we must t u r n t o a n o r m a l i z a t i o n convention such as b | m | < 1. Even so, the r e p r e s e n t a t i o n f o r zero i s not unique; hut we may define zero as (o, -a),' where -a i s the lower bound f o r e . By r e s t r i c t i n g |m| to the range b\"^ ^ |m| \u00C2\u00AB\u00C2\u00A3. 1 we r e t a i n a maximum number of s i g n i f i c a n t d i g i t s at each computation step. However, now we are u n c e r t a i n as to the number of t r u l y s i g n i f i c a n t d i g i t s r e t a i n e d i n a long s e r i e s of c a l c u l a t i o n s . This l o s s of in f o r m a t i o n can be p a r t i a l l y e l i m i n a t e d by the use of an unnormalized f l o a t i n g p o i n t or s i g n i f i c a n t d i g i t a r i t h m e t i c , \u00E2\u0080\u00A2 But now we must f o r f e i t the obvious procedural.conveniences of normalized f l o a t i n g p o i n t a r i t h m e t i c . Because of t h i s the l a t t e r system i s more commonly used i n computers. In summary: f l o a t i n g p o i n t a r i t h m e t i c i s extremely, u s e f u l f o r comput-a t i o n s where the magnitudes of q u a n t i t i e s may vary widely. With normalized f l o a t i n g p o i n t i t i s p o s s i b l e to represent numbers i n a computer by s t o r i n g i n memory only the f i r s t JL s i g n i f i c a n t d i g i t s and an exponent. The f l o a t i n g p o i n t system has two obvious disadvantages: \u00E2\u0080\u00A2 ( l ) , more complicated computer hardware i s needed to handle both the mantissa and the exponent, and ( 2 ) , 'fewer s i g n i f i c a n t d i g i t s can be c a r r i e d because a p o r t i o n of a word must be used f o r the exponent. In s p i t e of t h i s , f l o a t i n g p o i n t i s an e f f e c t i v e and e f f i c i e n t method of re p r e s e n t i n g numbers w i t h i n a computer. Wow tha t we have some p r a c t i c a l a r i t h m e t i c s t h a t can be harnessed to do (*) This i s immediately evident since we cou l d have mb e= m*be* w i t h m f , m* arid e / e*. h. our numerical work, we are i n a p o s i t i o n to give a q u a l i t a t i v e d e s c r i p t i o n of rounding e r r o r s . The process of rounding-off i s t h a t procedure by which low-order d i g i t s are e l i m i n a t e d i n a number. Consequently round-off e r r o r i s the d i f f e r e n c e \u00E2\u0080\u00A2 and the corresponding i n f i n i t e p r e c i s i o n number. Bearing i n mind the computer and i t s a r i t h m e t i c u n i t we c l a s s i f y as round-off e r r o r s those e r r o r s induced by s p e c i a l r e p r e s e n t a t i o n s of numbers and a l s o those e r r o r s r e s u l t i n g from a r i t h m e t i c operations. Both are caused by rounding-off. S e v e r a l schemes have been proposed to implement the round-off process. In each a number x i s changed to x, i t s rounded form. Let R denote the rounding-off operator (R: X > x) : i f x > o we round x i t s e l f ; i f x < o we round the absolute value of x, |x| ; i f x = o we do not round at a l l . Let S be the e r r o r i n the l e a s t s i g n i f i c a n t d i g i t r e t a i n e d i n |x| , a f t e r rounding-off; 6\" g i v e s a measure of the accuracy, of a rounding procedure. For every d e f i n i t i o n of R we o b t a i n a d i f f e r e n t rounding scheme : ( i ) ,Add b/2 to |x| at the p o s i t i o n which i s to be dropped.(*) A c a r r y ( i i ) Make the lowest-order r e t a i n e d d i g i t of |x| equal t o b/2 r e g a r d l e s s of the c o r r e c t value of t h a t d i g i t and the d i g i t s i n lower orders. ( i i i ) Add a \"0\" o r , n l \" randomly i n t o the lowest-order r e t a i n e d d i g i t regard-between a f i n i t e p r e c i s i o n number ( a f t e r the low-order d i g i t s are e l i m i n a t e d ) w i l l r e s u l t i f t h i s d i g i t i s b/2 or greater. l e s s of the values of the d i g i t s of |x| D i s c a r d low-order d i g i t s . The e r r o r i s (*) b i s the number base of the machine. ' 5 -Every one of the preceding schemes has i t s own m e r i t s : the f i n a l s e l e c t i o n of a p a r t i c u l a r rounding procedure i s a compromise between the d e s i r e d accuracy f o r a r e s u l t and the computer time r e q u i r e d t o achieve t h i s accuracy. For example, procedure ( i v ) i s the l e a s t time-consuming but g i v e s a l a r g e b i a s e d e r r o r , w h i le ( i ) has the minimum e r r o r but takes longer. 6. Chapter I I I Regardless of the numerical scheme that we choose to perform our computations, the rounding-off process may introduce errors. For.a quantitative description of these errors we now give an error analysis of single operations i n normalized f l o a t i n g point arithmetic. The error analysis w i l l t e l l us the accuracy, of the indi v i d u a l arithmetic operations. In the discussion, r e a l arithmetic operations w i l l be denoted by t h e i r usual symbols +, -,.x or . , * ; the corresponding machine operations w i l l t>e (i) > \u00C2\u00A9> @> G) > also,.x,y,z s h a l l be machine numbers. The machine base w i l l be b and the mantissas w i l l have JL d i g i t s . Zero w i l l be of the form (0, -a) where -a i s the lower l i m i t on the range of values of the exponents. Errors 6j_ > \u00C2\u00A3jL w i l l be errors i n the least s i g n i f i c a n t d i g i t of the resulting Ji - d i g i t mantissa. In the following error analyses we s h a l l assume that neither exponential underflow nor exponential overflow occur. The area of computer memory i n which a numerical r e s u l t . i s formed i s known as an accumulator. In the error analyses two different types of accumulators are used: the single-length - d i g i t accumulators, and the \"e x t r a - d i g i t \" accumulators. In the usual JL - d i g i t accumulator the \"decimal\" point i s considered to be immediately to the l e f t of the f i r s t (most s i g n i f i c a n t ) d i g i t . An {Ji +l) - d i g i t accumulator i s formed from t h i s by providing an extra d i g i t position to the l e f t of the \"decimal\" point. This form of extra d i g i t accumulator i s used f o r addition, subtraction and d i v i s i o n . . A second form of the extra-digit accumulator i s the double-length one used i n mu l t i p l i c a t i o n ; . m u l t i p l i c a t i o n of two JL - d i g i t numbers gives a product of 2JL d i g i t s . We form t h i s double-length accumulator by using two single length accumulators. A. A d d i t i o n - S u b t r a c t i o n X \u00C2\u00A9 Y = ( m , , e , ) 0 ( m 2 , e 2 ) = ( m 3 , e 3 ) I f x = -y.then x(+)y = (o, - a ) ; suppose e 2 > e-^ ^ e 2 - e l +^> then x0y - (x+y) =-x. For j e 2 \" e i | ^ X -1 r 1. ( X + l ) - d i g i t accumulator (a) We must s h i f t m-j_ to /.align the decimal p o i n t s : we form a new mantissa V, of JL d i g i t s . V - m , b + 6, b (V has j e ^ - e 2 | s i g n i f i c a n t zeros) (b) Add mg and V : m 2 + V (c) Form the H-digit mantissa o f the r e s u l t , ( i ) No c a r r y on the a d d i t i o n of m 2 and V m 3 = {m2+v)b f = o ; i , 2 , . where j i s the number of s i g n i f i c a n t zeros before n o r m a l i z a t i o n . 8. Thus X \u00C2\u00A9 Y = b (m, + v)b m X \u00C2\u00A9 Y - ( X + Y ) ^ 6Ab D Notice that we do not have an e r r o r bound i n terras of the exponent of the f i n a l r e s u l t . I f we wish to use we have that -JL 6 + -j-x\u00C2\u00A9 Y - ( X + Y)| < \u00C2\u00A3Ab b3 But we do not know . A l l we can say i s th a t ^ - JL ~ I , and th e r e f o r e X \u00C2\u00A9 Y -(X+Y) 6 \u00C2\u00A3 A b e 3-However, f o r almost a l l a d d i t i o n s and s u b t r a c t i o n s , t h i s bound g r o s s l y over-estimates the e r r o r . We conclude t h a t the only r e a l i s t i c bound i s -I e X \u00C2\u00A9 Y - ( X + Y ) | * \u00C2\u00A3AB bZ . I f we had performed the e r r o r a n a l y s i s u s i n g 6, > Q^. we would have obtained an e r r o r bound X \u00C2\u00A9 Y - ( X + Y ) | i \u00C2\u00A3A\u00C2\u00A3 b' . . Therefore, the most general e r r o r bound i n t h i s case i s | X \u00C2\u00A9 Y - ( X + Y ) \u00C2\u00A3 \u00C2\u00A3A b b ( i i ) Carry on the a d d i t i o n of m 2 and V Since (raz + V ) has JL +1 d i g i t s we must s h i f t i t one d i g i t to the r i g h t . m3 = 15' (m2 + v) + \u00C2\u00A32.t 9-X 0 Y = x + Y + bZ b 3 [e, + b e 2 X \u00C2\u00A9 Y - ( X + Y ) | ^ CA^V3 ' [l + b] Here the e r r o r i s given e x p l i c i t l y i n terms of the exponent of the f i n a l r e s u l t . 2. X - d i g i t accumulator. ( i ) Without c a r r y , the r e s u l t s are the same as f o r the A+1 d i g i t accumulator. ( i i ) Carry Now we must s h i f t both m-^ and nLp (assume e]_) (a) S h i f t m2 and form Vg IT1 CZ v 2 = nn2D +\u00E2\u0080\u00A2 e , b (b) S h i f t mj and -.-Align decimal p o i n t s ,-' 1e l-e 1 .JL v, = m , b b + e 2 b (c) Add V-L and V\"2 ; both have only & d i g i t s . m 3 = v, + v 2 e 3 = e a + 1 X \u00C2\u00A9 Y = x + Y + b ^ t D 3 [e, X \u00C2\u00A9 Y -(X + Y) i 2 \u00C2\u00A3 A \u00C2\u00A3*l> 3 I f we compare t h i s e r r o r bound to the corresponding e r r o r bound i n 1. above, we see t h a t t h i s bound i s l a r g e r by a f a c t o r of J> = For b > 2 , I < j> <: 2 . b = 2 j ? ^ 4/3 -1.33 \u00E2\u0080\u00A2 b = IO j> = 20/) I -1 .82 Therefore, i f p o s s i b l e one should use an (Z+l) d i g i t accumulator f o r a d d i t i o n and s u b t r a c t i o n . B. M u l t i p l i c a t i o n X \u00C2\u00AE Y ( m , , e , ) \u00C2\u00AE ( m 2 , e 2 ) = ( m 3 , e 3 ) Let I S J L I * 6 M , \lx\ \u00C2\u00B1 ; JL - 1,2 I f e i t h e r or both x, y are zero then X 0 Y - ( 0 , \" a) 1 . A .2A-digit accumulator (a) Form the product rn ,*m 2 -2 - 1 (b) i f b \u00C2\u00A3 m r m 2 | * b _ m 3 = b (rri, . m a ) \u00E2\u0080\u00A2\u00C2\u00BB- 6 , b e 3 = e, + e , - i e X \u00C2\u00AE Y - X Y + E . b b |X\u00C2\u00AEY - X-Y| \u00C2\u00A3 e M b b ( c) i f * | m . - \u00E2\u0084\u00A2 2 | ^ i m 3 = m . - m j , + e 2 b e 3 - e, + e 2 X \u00C2\u00AE Y = X-Y + \u00C2\u00A3 2 b b X \u00C2\u00AE Y - X-Y | \u00C2\u00B1 \u00C2\u00A3 M b b 11. In both cases we have the e r r o r bound i n terms of the exponent of the r e s u l t . The most general e r r o r bound i s 4,e3 | X \u00C2\u00AE Y - X-Y | \u00C2\u00A3 6 M B b Ah X - d i g i t accumulator (a) Form v 2 = m 2 + e 2 b \u00C2\u00B0 2 where c, + c 2 = JL I f JL i s even, take C ( - C 2 I f JL i s odd, take C, = ( - C - 0 /2 , C 2 = (-0 + 0 /2 (b) M u l t i p l y . V-^ and Vg to get an ^ - d i g i t number (c) l - T i | V.-Va. | ^ b*' m 3 ^ b ( v , - v 2 ) x \u00C2\u00AE r = X Y + b \u00C2\u00AB 3 + l m 2 \u00C2\u00A7 , S V m , ! ^ z + 6 , 6 2b X 0 Y - X Y | ^ e M b ^b + b + e M b y (d) v , v 2 | < I m 3 = v , . v 2 e , + e 2 X - Y + 3 [no 26, b C ' m , e z b 2 + e . l * b 12. X \u00C2\u00A9 Y - X - Y ie3 T L ^ rc* - ,-4/2 L e3 In each of (c) and (d) the e r r o r bound .is p r o p o r t i o n a l to D P ; f o r a double-length accumulator the e r r o r i s p r o p o r t i o n a l to D D -C l e a r l y a double-length accumulator i s the b e t t e r of the two. but C- D i v i s i o n X \u00C2\u00A9 Y = ( m , , e.) \u00C2\u00A9 ( m 2 1 ee ) = ( m 3 , e 3 ) We must have y f o. I f x = o then X \u00C2\u00A9 Y \"=\u00E2\u0080\u00A2 ( 0 , - Cl) Let l e . j i eD , | e , | * e 0 ; -L * 1,2 1. Jh + 1 d i g i t accumulator (a quotient .of ^+1 d i g i t s ) (a) I ^ | m , / m 2 | < b m 3 = b ' (rc\/rn 2 ) 4- e ( b ^ e 3 = e, - e 2 +.1 X \u00C2\u00A9 Y = ( X * Y ) + 6 , b D 00 | X \u00C2\u00A9 Y - ( X T Y ) | \u00C2\u00A3 \u00C2\u00A3 D b b m i / m 2 - | * 1 nn 3 = r n , / m 2 +\u00E2\u0080\u00A2 \u00C2\u00A3 2 b ~0 6 3 - \u00C2\u00A91 ~ ^ 2 x \u00C2\u00A9 Y - (x r Y ) + e2 b ^ b e ' X \u00C2\u00A9 Y - ( X * Y ) | ^ e ^ b * 3 The general e r r o r bound i s | X \u00C2\u00A9 Y - ( X 4 Y ) | < \u00C2\u00A3 D \u00C2\u00A3 b 13-2. An Jl - d i g i t accumulator (a) i \u00C2\u00B1 | m , / m 2 | <- b S h i f t m-|_ and form V-j_ v, = b m , + \u00C2\u00A3, b m 3 = ( v , . T m J + \u00C2\u00A3 , & ' e 3 = (i + e . ) - ez X \u00C2\u00A9 Y = ( X * Y ) + B D + + | x \u00C2\u00A9 Y - ( X - Y ) | ^ E * l ? a [ f i 0 +\u00E2\u0080\u00A2 b g j (b) b ' ^ | m , / m 2 | ^ 1 This i s the same as the corresponding case f o r d i g i t s . The preceding e r r o r bounds c h a r a c t e r i z e the s p e c i f i c machine a r i t h m e t i c operation. In the next chapter we examine f u r t h e r c h a r a c t e r i s t i c s of these machine operations. Ik. Chapter IV Automatic programming languages must use permanent machine language subroutines t o perform a r i t h m e t i c operations. Since the accuracy of c a l c u l a t i o n s i s v i t a l l y dependent upon these a r i t h m e t i c subroutines, we should use r o u t i n e s whose maximum e r r o r i s as small as p o s s i b l e . Let us consider the normalized f l o a t i n g p o i n t a r i t h m e t i c procedures discussed in-Chapter I I I (those f o r e x t r a - d i g i t accumulators). Define t as f o l l o w s : (x ^ Y) - (x - Y ) X ~ Y where x-^y \u00C2\u00A3 o ' d e n o t e s a r e a l a r i t h m e t i c o p e r a t i o n , /&* denotes the corresponding machine operation. A d d i t i o n - S u b t r a c t i o n (a) No c a r r y -JL t A I - e A b rna/K 'A I \" whe ,rrwxfe.,e,.) - e 3 b ID = fc,b e 3 = r r u w c ( e , , e z ) - f (b) , Carry ^ t., -. \u00C2\u00A3 A ( i + b ) b = k,b M u l t i p l i c a t i o n D i v i s i o n t M = k3 b\" 15. The maximum r e l a t i v e e r r o r , t , i s p r o p o r t i o n a l to b f o r m u l t i p l i c a t i o n , d i v i s i o n and a d d i t i o n - s u b t r a c t i o n w i t h c a r r y . \"No c a r r y \" on a d d i t i o n - s u b t r a c t i o n can produce a l a r g e value of t ( p r o p o r t i o n a l to b ) . From t h i s evidence we conclude t h a t the u s u a l procedures for- a d d i t i o n - s u b t r a c t i o n are so inaccurate t h a t we must develop a procedure which always gives a value of t p r o p o r t i o n a l to b\"-^ . Suppose t h a t we have a (2JL +l) - d i g i t accumulator a v a i l a b l e , w i t h one d i g i t p o s i t i o n to the l e f t of the decimal p o i n t . The a n a l y s i s i s s i m i l a r to those given i n Chapter I I I . Suppose 6, < Q,z i f e 2 > e , + i + i ,then X 0 Y -{x-+y) =-x . F o r | e z - e,| ^ z \u00E2\u0080\u00A2 1. S h i f t m Form V, = m, b v, has Z+ | e , - e j d i g i t s Add |6( - \u00C2\u00A3,2 J ' n o n - s i g n i f i c a n t zeros onto the end of m^, thus Vg = ^2 V 2 has > \u00C2\u00A3 + | e , - e 2 | d i g i t s 3. Add V-, and V 0 1 2 2. Form V \u00C2\u00A3 6 \u00E2\u0080\u0094 6 j 16. ( i ) Carry m 3 b'(v, + v 2 ) + e,S* e 3 = e z +. - i e ( i i ) No c a r r y m 3 = b (V, + V Z ) + \u00C2\u00A3 2 b ; f\u00C2\u00ABo,i,...^ X \u00C2\u00A9 Y = X + Y + \u00C2\u00A3 2 b b E r r o r Bounds I vl M*)?* ( i ) C a r r y X \u00C2\u00A9 Y - ( X + Y ) | \u00C2\u00A3 \u00C2\u00A3 A b D ( i i ) No c a r r y | x \u00C2\u00A9 Y - ( X + Y ) | ^ 6 A D V Advantages of t h i s method 1. We avoid the rounding e r r o r i n m-^ before the a d d i t i o n operation. 2. -The f i n a l e r r o r bounds.are e x p l i c i t l y i n terms of the f i n a l exponent. 3- For t h i s method we obta i n This s a t i s f i e s the requirement on t . 17 The machine language a r i t h m e t i c subroutines, f o r use i n c o n j u n c t i o n w i t h an automatic programming language, should be those w i t h a (2-& +l) d i g i t accumulator f o r a d d i t i o n - s u b t r a c t i o n , a 2\u00C2\u00A3 d i g i t accumulator f o r m u l t i p l i c a t i o n , and an (JL+l) d i g i t accumulator f o r d i v i s i o n . We are l e a d to these conclusions before we have considered the choice of a rounding-off procedure. Let us now t u r n t o the problem of choosing a round-off procedure. In Ghapter'II we gave some examples of t y p i c a l rounding-off methods: the most commonly used are (1) .the dropping of d i g i t s , and (2) the a d d i t i o n of i n the l a s t r e t a i n e d d i g i t p o s i t i o n . We s h a l l d i s c u s s t h i s choice i n s o f a r as i t a f f e c t s the more s i g n i f i c a n t problem of g i v i n g e r r o r estimates f o r a sequence of c a l c u l a t i o n s . For example,.if we are performing a r e p e t i t i v e c a l c u l a t i o n i n which a round-off e r r o r at one step a f f e c t s the r e s u l t s at successive steps we can give an estimate of the propagated e r r o r i n terms of the rounding e r r o r at each step. The step e r r o r s are expressed as a combination ( u s u a l l y l i n e a r ) of the i n d i v i d u a l a r i t h m e t i c e r r o r s (the \u00C2\u00A3's ). By u s i n g the above maximum e r r o r bounds f o r i n d i v i d u a l a r i t h m e t i c operations we are able to o b t a i n an.error estimate at every step i n a s e r i e s of c a l c u l a t i o n s . But t h i s i s a maximum e r r o r bound and i s u n s a t i s f a c t o r y because r a r e l y does the propagated e r r o r become equal to i t s upper bound. I t has been suggested t h a t one give a s t a t i s t i c a l estimate f o r the propagated e r r o r : we would a a l c u l a t e an expected value and a v a r i a n c e . The v a l i d i t y of the s t a t i s t i c a l estimate u l t i m a t e l y r e s t s on whether the \u00C2\u00A3's can be considered as independent random v a r i a b l e s . H u s k e y ^ 7 j give s an example where, f o r a p a r t i c u l a r a r i t h m e t i c subroutine and rounding-off procedure, the e r r o r s s y s t e m a t i c a l l y b u i l d up. Forsythe \u00C2\u00A33_J has suggested t h a t one c o u l d avoid e f f e c t s s i m i l a r to systematic round-off by adding random d i g i t s to the d i g i t p o s i t i o n s which are to be dropped. We 1 8 . s h a l l examine such a procedure. Our r e s u l t s are somewhat d i f f e r e n t from those given \"by Forsythe. Let m be the absolute value of a f l o a t i n g p o i n t mantissa m which we wish to round-off at the k t h d i g i t . Define a new number M such t h a t M = b m Consider M - [ M] = V where [M] i s the gr e a t e s t i n t e g e r l e s s than or equal t o M. \u00E2\u0080\u00A2 Obviously O i V < 1 ; V c o n s i s t s of those d i g i t s o f m which are to be dropped a f t e r rounding-off. We consider V to be the p r o b a b i l i t y of rounding up (adding 1 i n t o the k t h d i g i t of m) and 1-V to be the p r o b a b i l i t y of rounding down (dropping the k + 1 , k + 2 , . . . d i g i t s ) . Now suppose we add a random number , ( 0 ^ l ) i n t o the k + 1 , k + 2 , ... d i g i t s of m. Let be u n i f o r m l y d i s t r i b u t e d on [ 0 , l ) . The e r r o r \u00C2\u00A3 i s now a random v a r i a b l e , and i s given by 6 = [M] + I - M = I -V (rounding up) \u00C2\u00A3 = [ M ] - M = ~V (rounding down) The p r o b a b i l i t y d e n s i t y f o r \u00C2\u00A3 i s \>(C) = 1 - 6 (rounding .up, 6 >0 ) |3(\u00C2\u00A3) = I + \u00C2\u00A3 (rounding down, t \u00C2\u00A3 0 ) Expected v a l u e : E(\u00C2\u00A3) = Je(i + e)de + je(i-\u00C2\u00A3)cLs = 6 Variance: V(\u00C2\u00A3) = \u00C2\u00A3*(\u00E2\u0080\u00A2+\u00C2\u00A3)<*\u00C2\u00A3 + ( \u00C2\u00A3 l ( l - \u00C2\u00A3 ) d U = g (The p r o b a b i l i t y t h a t |\u00C2\u00A3| * \u00C2\u00A3 ; i s 3/4) 19. A subroutine u s i n g random d i g i t s , w h i l e being t h e o r e t i c a l l y i d e a l , would be, i n p r a c t i c e , q u i t e c o s t l y i n computing time. The procedure by which we add (-g-)^ would be f a s t e r and may give j u s t as good r e s u l t s . In the f o l l o w i n g chapter we present some experimental r e s u l t s t h a t we obtained u s i n g d i f f e r e n t types of a r i t h m e t i c subroutines;.of these, two inco r p o r a t e the s p e c i a l p r o p e r t i e s described i n the f i r s t p a r t of t h i s chapter, and another two use random d i g i t s f o r rounding. 20. Chapter V Our experimental w o r k - w i l l be w i t h a r i t h m e t i c subroutines w r i t t e n . i n the IBM 1620 (machine number base, i s 10) machine language, and to be used w i t h the automatic programming language, FORTRAN. F o r t r a n uses an.eight d i g i t f l o a t i n g p o i n t mantissa. A l a t e r v e r s i o n of F o r t r a n , FORTRAN 2, has the advantage of v a r i a b l e word le n g t h (2 to 28) f o r f l o a t i n g p o i n t mantissas. In Table I we l i s t the p r o p e r t i e s of the subroutines: Table I I contains the execution .times ( i n m i l l i - s e c o n d s ) , o f the subroutines. 21. Table I Type of F o r t r a n .A d d i t i o n - S u b t r a c t i o n M u l t i p l i c a t i o n D i v i s i o n (1) Drops the low order Rounds to 8 Forms a 9 FORTRAN d i g i t s on the smaller s i g n i f i c a n t d i g i t s d i g i t quotient 8 d i g i t s number before adding, by\u00E2\u0080\u00A2adding \ i n t o Takes the most I f there i s a c a r r y , the 8th p l a c e . s i g n i f i c a n t 8 .the lowest .order d i g i t d i g i t s . i s dropped. ( l6 d i g i t accumulator) (9 d i g i t accumulator) (9 d i g i t .accumulator) (2 ) -X where As i n ( l ) but v a r i a b l e Takes the most . As i n ( l ) 1 = 2,3,-. .28 \u00C2\u00A3 s i g n i f i c a n t X but v a r i a b l e ^ FORTRAN 2 d i g i t s of product. Drops the lowest order d i g i t s . (3) Subroutine i s the Same as i n ( l ) Rounds by FORTRAN s p e c i a l one described above. adding \ to 8 d i g i t s i n Chapter IV. \u00E2\u0080\u00A2Rounds the 8th by adding \ i n t o the s i g n i f i c a n t 8th p l a c e . d i g i t of the quotient. (4) Same as (3) above Retains only, the As i n ( l ) FORTRAN except that low order most s i g n i f i c a n t above. 8 d i g i t s d i g i t s are dropped. 8 d i g i t s ; drops others. (5) Rounds the smaller Same as ( l ) Same as (3) FORTRAN number by adding \ i n 8 d i g i t s the l a s t r e t a i n e d d i g i t . I f c a r r y , rounds r e s u l t by adding \ i n t o 8th d i g i t . (6) Same as (5) above Same as (5) above Same as (3) FORTRAN except t h a t we add except t h a t we add 8 d i g i t s random d i g i t s . random d i g i t s . (7) Same as (3) above Same as (3) above Same as (3) FORTRAN except t h a t we add except that we add 8 d i g i t s random d i g i t s . random d i g i t s . 22. The random d i g i t s used w i t h subroutines (6) and (7) were obtained from a random number generator of the form x n = 101 xT1_j_ + C (mod 10^) (see L6J )\u00E2\u0080\u00A2 Table I I Subroutine 'Ad d i t i o n - S u b t r a c t i o n M u l t i p l i c a t i o n D i v i s i o n Time Time Time (1) 8.29 15-8U2 50.518 (2)-8 9.512 16.572 1+9.51I+ (2)-9 9.778 19.868 59-712 (2)-10 10.0M+ 23-530 70.950 (2)-12 IO.576 31-952 96. 5I+6 (2)- l6 11.640 53-188 160.218 (3) .12.80 15-8U2 55-906 \h) .II . 8 5 15.0I+2 50.518 (5) ,11.44 15.81+2 55-906 (6) 15-322 20.1+62 55.906 (7) 17-121 21.392 55-906 D e s c r i p t i o n of the Experiment Choice of experimental problem was motivated by the i n t e r e s t i n g r e s u l t s observed by Huskey ^jj i n the i n t e g r a t i o n of the system by.the Heun method x ! = x j - i + h y j -c. . + (-x* J - l 3 over the range .5200 \u00C2\u00B1 t \u00C2\u00B1 -5290 w i t h step s i z e , . h , of 2x10\"^. Huskey found th a t the round-off e r r o r s s y s t e m a t i c a l l y accumulated to such an extent as to c o n t r a d i c t the assumption t h a t the i n d i v i d u a l e r r o r s (the \u00C2\u00A3 ' s) were independent random v a r i a b l e s . \u00E2\u0080\u00A2 X = y \u00E2\u0080\u00A2 y \u00E2\u0080\u00A2 = -X X* = 3 = X. 3 -1 + y3 y! = - y j + h ( -X J = = X j - l + l ( y . y o = = y J - l + *(- : 2 We performed a s e r i e s of experiments which c o n s i s t e d of i n t e g r a t i n g the given system u s i n g the subroutines ( l ) to (7)- At every f i f t h i n t e g r a t i o n step the computed r e s u l t f o r x was compared to a 25 d i g i t r e s u l t obtained w i t h F o r t r a n u s i n g the same i n t e g r a t i o n procedure. An e r r o r a n a l y s i s of the method, by Rademacher [^ J and quoted by Huskey \u00C2\u00A37^ , gives f o r the propagated e r r o r , uniform d i s t r i b u t i o n d i s t r i b u t i o n f o r random d i g i t s 'assuming the i n d i v i d u a l rounding e r r o r s are independent.random v a r i a b l e s u n i f o r m l y d i s t r i b u t e d between -a and a ; n i s the number of i n t e g r a t i o n steps, and \(f . i s the standard d e v i a t i o n of I V . Table I I I Observed R e s u l t s T h e o r e t i c a l Subroutine E r r o r at 450th step E r r o r Range x l O 8 E r r o r of Maximum Absolute Value Occurs at Step n Standard D e v i a t i o n < r ( r n ) (1) (2)-8 (2)-9 (2)-10 (2)-12 (2)-l6 (5) -223xl0\"8 -223x10\" 0 -25.6x10\"\u00C2\u00B0 -2.85xl0 - 8 -223xl0 _ 12 -225xl0 - 1^ -1.36x10\"\u00C2\u00B0 0,-223 0,-223 0,-25.6 0,-2.85 0,-.0223 0,-225xl0 - 8 2-05,-5-3 1+50 i+50 1+50 1+50 1+50 1+50 330 \u00E2\u0080\u00A2 12.2x10\"\u00C2\u00B0 12.2x10\";: 1.22x10\"\u00C2\u00B0 .122xl0 - 8 12.2x10 12.2x10\"16 5-25xlO\"8 (3) 00 (6) (7) -1.36xl0\"8 -227xl0\"8 2.05,-5-3 0,-227 3.17,-11-9 10.6,-8.8 330 H-50 5.25xlO\"8 12.2x10\"8 at n=4-50 8.66xlO\"8 8.66xiO - 8 Figure I I o 100 zoo 300 400 Number of I n t e g r a t i o n Steps 25: Comments on Experimental R e s u l t s Relevant experimental r e s u l t s appear i n Table I I I and i n Figures I and I I . F i g u r e s I and I I are graphs of the observed e r r o r s , rn = ( X - d i g i t r e s u l t a f t e r n steps) - (25 d i g i t r e s u l t a f t e r n steps) 1. Rounding by dropping d i g i t s (see F i g u r e s I and I I ) . We obtained almost i d e n t i c a l r e s u l t s w i t h subroutines ( l ) , (h) and (2)-8; a l l three e x h i b i t e d the systematic accumulation of round-off e r r o r . In t h e i r c o n s t r u c t i o n subroutines ( l ) and (2)-8 d i f f e r only, i n the m u l t i p l i c a t i o n o p e r a t i o n : a computed r e s u l t i n ( l ) i s rounded by adding i> i n s t e a d of dropping the lower-order d i g i t s . This i n d i c a t e s t h a t the dominant c o n t r i b u t i o n to rounding e r r o r accumulation was due to the a d d i t i o n operation. The use of s p e c i a l subroutine (k) d i d not give us any advantage over the us u a l subroutines. 2. Rounding by adding (see Figure i ) . Subroutine (3) (of the type described i n Chapter IV) and subroutine (5) produced i d e n t i c a l r e s u l t s i n which systematic round-off was absent. Of the two r o u t i n e s , (3) i s the b e t t e r since i t has the lower value of the \" t \" parameter (see Chapter IV). . Comparing the r e s u l t s of (3) and (5) w i t h those obtained w i t h the corresponding r o u t i n e s (h) and (2)-8, which round by dropping d i g i t s , we conclude t h a t the e l i m i n a t i o n of systematic round-off i n the former r e s u l t s was due to the rounding-off process of adding \. The choice of round-off process i s o b v i o u s l y c r i t i c a l . 3- Rounding by adding random d i g i t s . We performed s e v e r a l experiments w i t h d i f f e r e n t i n i t i a l v a lues X Q and constants c f o r the random number generator X n +2_ 2 1015^ +c (mod 10^-2). 8 8 The maximum absolute value of observed e r r o r was 11.9x10\" and 10.6x10\" 26. f o r subroutines (6) and (7) r e s p e c t i v e l y . In a l l cases, n e i t h e r r o u t i n e gave r e s u l t s which e x h i b i t e d the systematic round-off. Further experimental work w i t h other types of problems i s necessary to f u l l y evaluate these subroutines. Longer word lengths (see F i g u r e s I and I I ) . Subroutines (2)-9, (2)-10, ('2)-12, and (2) - l6 a l l e x h i b i t e d systematic round-off. The r e s u l t s obtained w i t h the 8 - d i g i t r o u t i n e s , (3) and (5), were g e n e r a l l y b e t t e r than those of (2)-9 and (2)-10. This seems to i n d i c a t e t h a t by c a r r y i n g one or two e x t r a d i g i t s throughout the c a l c u l a t i o n s we may not counteract the e f f e c t s o f such a poor round-off process as the dropping of the lower order d i g i t s . Hartree Q7I a t t r i b u t e s systematic round-off t o the f a c t t h a t the -\"leading d i g i t rounded o f f remains the same i n a number of successive c o n t r i b u t i o n s to the i n t e g r a l \" ; on t h i s b a s i s he develops a c r i t e r i o n to determine whether systematic round-off i s l i k e l y to occur. According to h i s a n a l y s i s , f o r step s i z e of 2x10 y and .52 * t \u00C2\u00A3 . 529> w e might avoid systematic round-off i f we take the word length g r e a t e r than 12. Our experimental r e s u l t s showed a systematic accumulation of round-off e r r o r f o r both the 12 d i g i t and 16 d i g i t word lengths (subroutines (2)-12 and ( 2 ) - l 6 ) . We conclude t h a t Hartree's analysis i s not s u f f i c i e n t t o account f o r the systematic b u i l d - u p of e r r o r s . The e f f e c t i s due to the type of round-off process (dropping d i g i t s ) . S t a t i s t i c a l e s t i m a t i o n of e r r o r s . Table I I I shows t h a t the t h e o r e t i c a l standard d e v i a t i o n ,C T ( r n ) , does not give an accurate e r r o r estimate f o r those subroutines which rounded by dropping d i g i t s (the maximum observed e r r o r was l a r g e r by a f a c t o r of approximately. 1 8 ). On the other hand, the s t a t i s t i c a l estimate was s u f f i c i e n t l y p r e c i s e f o r the r o u t i n e s which rounded by adding \ or by 27. adding random d i g i t s . T his shows t h a t a s t a t i s t i c a l treatment of* round-off e r r o r can give reasonable e r r o r estimations, provided the round-off process i s accurate enough. Chapter VI Conclusions Our experimental r e s u l t s emphasize the f a c t t h a t a designer of an a r i t h m e t i c subroutine should approximate a r e a l a r i t h m e t i c operation as c l o s e l y as p o s s i b l e . The form of subroutine best f i t t i n g these requirements was the one which u t i l i z e d a double p r e c i s i o n ( 2 - & - d i g i t s ) product area, an {Z + l ) -d i g i t q uotient area, and a {2.JL + l ) - d i g i t area f o r sums and d i f f e r e n c e s ^ ) , i n c o n j u n c t i o n w i t h . a rounding-off process of adding \ i n t o t h e ' l a s t d i g i t p o s i t i o n r e t a i n e d . The systematic accumulation of round-off e r r o r was found to be caused by the type of rounding-off process: the e f f e c t was observed.in r e s u l t s obtained w i t h those subroutines which rounded by dropping d i g i t s , but d i d not occur w i t h r o u t i n e s t h a t used \ or random d i g i t s f o r rounding. S t a t i s t i c a l methods gave us accurate e r r o r estimates f o r data obtained w i t h the l a t t e r subroutines. As a consequence, we conclude t h a t s t a t i s t i c a l methods may be a p p l i e d to the propagation .of round-off e r r o r provided the a r i t h m e t i c subroutines are s u f f i c i e n t l y accurate. (*) As d e s c r i b e d i n Chapter IV. 29-B i b l i o g r a p h y 1. Ashenhurst, R.L., and M e t r o p o l i s , - N., \"Unnormalized F l o a t i n g - p o i n t \u00E2\u0080\u00A2 A r i t h m e t i c \" , J . Assoc. CM., 6 (1959), .1+15-U28. 2.\" Car r , J.W. , I I I . \" E r r o r A n a l y s i s of F l o a t i n g - p o i n t A r i t h m e t i c \" , Comm. of A.CM. . 2 (1959), -10-15. 3. Forsythe, G.E., \"Reprint of a Note on Rounding-off E r r o r s \" , S.I.A.M. Review 1 (1959), 66-67. 1+. H e n r i c i , P., D i s c r e t e V a r i a b l e Methods i n Ordinary D i f f e r e n t i a l . ' Equations, J . Wiley & Sons, New York, 1962. 5- Householder, A.S., \"Generation o f Errors i n D i g i t a l Computation\", B u l l . Amer. Math-Soc. 6\u00C2\u00A3 (195*4 )> 23*4-21+9. 6. H u l l , T.E., and Dobell,. A.R., \"Random Number Generators\", S.I.A.M. Review *+ (1962), 23O-25I+. 7. Huskey, H.D., \"On the P r e c i s i o n o f a C e r t a i n Procedure of Numerical I n t e g r a t i o n \" , With an appendix by D.R. Hartree, J . Research of Nat. Bur, of Stand. *+2 , (l9*+9), 57-62. 8. von Neumann, J . , and G o l d s t i n e , H.H., \"Numerical I n v e r t i n g of Ma t r i c e s of High Order\", Bull-Amer.Math.Soc.. 53 (19*47), 1021-1099. 9. Rademacher, H., \"On the Accumulation of E r r o r s i n Processes of I n t e g r a t i o n on High-Speed C a l c u l a t i n g Machines\", Annals Comput. Labor. Harvard Univ., 16, (19*4-8), 176-I87. 30. B i b l i o g r a p h y - cont: 10. Richards, R.K., . A r i t h m e t i c Operations i n D i g i t a l : Computers, van Nostrand-Co. Inc.,. 1955\u00C2\u00BB 11. W i l k i n s o n , J.H., \u00E2\u0080\u00A2. \" E r r o r A n a l y s i s of F l o a t i n g Point-Computation\" Num.Math. 2 (i960), 319-3^0. "@en . "Thesis/Dissertation"@en . "10.14288/1.0080564"@en . "eng"@en . "Mathematics"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "Rounding errors in digital computer arithmetic subroutines."@en . "Text"@en . "http://hdl.handle.net/2429/39178"@en .