A REINFORCEMENT LEARNING A L G O R I T H M FOR OPERATIONS PLANNING O F A H Y D R O E L E C T R I C POWER MULTIRESERVOIR SYSTEM by ALAA EATZAZ ABDALLA B.Sc. A i n Shams University, 1984 M . A . S c . Katholieke Universiteit Leuven, 1990 A THESIS SUBMITTED IN P A R T I A L F U L F I L M E N T OF T H E REQUIREMENTS FOR T H E D E G R E E OF DOCTOR OF PHILOSOPHY in T H E F A C U L T Y OF G R A D U A T E STUDIES (CIVIL ENGINEERING) T H E U N I V E R S I T Y O F BRITISH C O L U M B I A A p r i l 2007 © A l a a Eatzaz Abdalla, 2007 ABSTRACT The main objective o f reservoir operations planning is to determine the optimum operation policies that maximize the expected value o f the system resources over the planning horizon. This control problem is challenged with different sources o f uncertainty that a reservoir system planner has to deal with. In the reservoir operations planning problem, there is a trade-off between the marginal value o f water in storage and the electricity market price. The marginal value o f water is uncertain too and is largely dependent on storage i n the reservoir and storage in other reservoirs as well. The challenge here is how to deal with this large scale multireservoir problem under the encountered uncertainties. In this thesis, the use o f a novel methodology to establish a good approximation o f the optimal control o f a large-scale hydroelectric power system applying Reinforcement Learning ( R L ) is presented. R L is an artificial intelligence method to machine learning that offers key advantages in handling problems that are too large to be solved by conventional dynamic programming methods. In this approach, a control agent progressively learns the optimal strategies that maximize rewards through interaction with a dynamic environment. This thesis introduces the main concepts and computational aspects o f using R L for the multireservoir operations planning problem. A scenario generation-moment matching technique was adopted to generate a set o f scenarios for the natural river inflows, electricity load, and market prices random variables. In this way, the statistical properties o f the original distributions are preserved. The developed reinforcement learning reservoir optimization model ( R L R O M ) was successfully applied to the B C Hydro main reservoirs on the Peace and Columbia Rivers. The model was used to: derive optimal control policies for this multireservoir system, to estimate the value o f water in storage, and to establish the marginal value o f water / energy. The R L R O M outputs were compared to the classical method o f optimizing reservoir operations, namely, stochastic dynamic programming (SDP), and the results for one and two reservoir systems were identical. The results suggests that the R L model is much more efficient at handling large scale reservoir operations problems and can give a very good approximate solution to this complex problem. ii TABLE OF CONTENTS A B S T R A C T II T A B L E O F C O N T E N T S Ill LIST O F FIGURES VI LIST O F T A B L E S LIST O F A B B R E V I A T I O N S X XI G L O S S A R Y XIII A C K N O W L E D G E M E N T S XIV 1. I N T R O D U C T I O N 1 1.1. BACKGROUND 1 1.2. PROBLEM STATEMENT 4 1.3. GOALS A N D OBJECTIVES 5 1.4. ORGANIZATION O F T H E THESIS 7 2. L I T E R A T U R E R E V I E W 2.1. MODELING APPROACHES 2.2. OPTIMIZATION TECHNIQUES 2.3. 8 8 10 2.2.1. Deterministic Optimization Models 10 2.2.2. Stochastic Optimization Models 14 2.2.3. Heuristic Models 23 SAMPLING TECHNIQUES 26 2.3.1. Time Series Models 27 2.3.2. Conditional Sampling 27 2.3.3. Sampling from Specified Marginals and Correlations 28 2.3.4. Moment Matching 28 2.3.5. Path Based Methods 29 2.3.6. Optimal Discretization 29 2.3.7. Scenario Reduction 30 2.3.8. Interior Sampling Methods 30 2.4. RL APPROACH 30 2.5. CONCLUSIONS 32 iii 3. T H E REINFORCEMENT LEARNING APPROACH 3.1. INTRODUCTION 34 3.2. REINFORCEMENT LEARNING APPROACH V E R S U S DYNAMIC PROGRAMMING 35 3.3. REINFORCEMENT LEARN I N G PROBLEM FORMULATION 37 Basic Elements 3.3.1. R L 3.3.2. Exploration/ Exploitation 39 3.3.3. Return Functions 40 3.3.4. Markovian Decision Process ( M D P ) 41 REINFORCEMENT LEARNING ALGORITHMS 42 3.4. 4. 34 38 3.4.1. Dynamic Programming Methods (DP) 44 3.4.2. Monte Carlo Methods ( M C ) 45 3.4.3. Temporal Difference (TD) 47 3.5. FUNCTION APPROXIMATION 54 3.6. ON-LINE A N D OFF-LINE 55 3.7. SUMMARY 56 REINFORCEMENT LEARNING MULTIRESERVOIR OPTIMIZATION M O D E L (RLROM) D E V E L O P M E N T 57 4.1. INTRODUCTION 57 4.2. M A T H E M A T I C A L FORMULATION 59 4.2.1. S D P Formulation 61 4.2.2. R L Formulation 62 4.2.3. State Space 65 4.2.4. Action (Decision Variable) 65 4.2.5. Exploration/Exploitation Rate 66 4.2.6. Learning Rate 66 4.2.7. Rewards 67 4.2.8. Stochastic Modeling o f Random Variables 4.2.9. Lookup Tables and Function Approximation 4.2.10. M o d e l of the Environment 68 70 74 4.3. R L R O M M o d e l Algorithm Using Lookup Tables 81 4.3.2. The R L M o d e l Algorithm Using Function Approximation 86 4.3.3. Convergence 90 SUMMARY 91 M O D E L TESTING A N D I M P L E M E N T A T I O N 92 5.1. INTRODUCTION 92 5.2. SINGLE RESERVOIR M O D E L - TEST C A S E 93 5.2.1. Problem Description 93 5.2.2. S D P mathematical formulation 95 5.2.3. R L mathematical formulation 97 5.2.4. Solution Steps o f Q-Learning Algorithm 99 5.2.5. Establishing the parameters of the R L Algorithm 101 5.2.6. Results 105 5.3. Two RESERVOIR M O D E L - TEST CASE 109 5.3.1. System Modeling 109 5.3.2. Results and Analysis Ill 5.3.3. Efficiency of the R L Algorithm 114 5.4. RLROM MULTIRESERVOIR M O D E L IMPLEMENTATION - CASE STUDY 116 5.4.1. Description of the Case Study 117 5.4.2. Input Data 118 5.4.3. Scenario Generation 126 5.4.4. Results of the R L R O M M o d e l Runs 132 5.5. 6. 80 4.3.1. 4.4. 5. SOLUTION ALGORITHM SUMMARY CONCLUSIONS 163 A N D R E C O M M E N D A T I O N S 165 6.1. SUMMARY 165 6.2. CONCLUSIONS 167 6.3. FUTURE W O R K 170 R E F E R E N C E S APPENDIX A 171 ^ L E A R N I N G N U M E R I C A L E X A M P L E 183 v LIST OF FIGURES F I G U R E 3.1. REINFORCEMENT LEARNING A N D D Y N A M I C PROGRAMMING C O M P A R I S O N ( A D O P T E D F R O M GOSAVI, 2003) 37 F I G U R E 3.2. AGENT-ENVIRONMENT INTERACTION IN R L 38 F I G U R E 3.3. COMPARISON O F DIFFERENT R L APPROACHES (ADOPTED F R O M SUTTON A N D B A R T O 1998) 53 FIGURE 4.1. R L AGENT-ENVIRONMENT INTERFACE INLEARNING M O D E 63 FIGURE 4.2. LEARNING R A T E A N D E X P L O I T A T I O N R A T E 67 FIGURE 4.3. P W L APPROXIMATION OF STORAGE V A L U E FUNCTION 72 F I G U R E 4.4. 3-D V I E W O F T H E P W L V A L U E O F S T O R A G E F U N C T I O N A S A F U N C T I O N O F G M S A N D MICA STORAGE FIGURE 4.5. 73 2-D V I E W O F T H E P W L V A L U E O F S T O R A G E FUNCTION A S A FUNCTION O F G M S A N D MICA STORAGE F I G U R E 4.6. 73 P W L V A L U E OF STORAGE FUNCTION A SA FUNCTION OF G M S STORAGE A T MICA TARGET STORAGE 74 F I G U R E 4.7. PIECEWISE L I N E A R H Y D R O P O W E R P R O D U C T I O N FUNCTION F I G U R E 4.8. R L R O M M O D E L ALGORITHM USING LOOKUP TABLES 85 F I G U R E 4.9. R L R O M M O D E L A L G O R I T H M U S I N G FUNCTION APPROXIMATION 88 FIGURE 4.10. S C H E M A T I C R E P R E S E N T A T I O N O F R L R O M M O D E L A L G O R I T H M 76 USING FUNCTION APPROXIMATION F I G U R E 5.1. EFFECT O F V A R Y I N G T H E E X P L O I T A T I O N R A T E F I G U R E 5.2. CONVERGENCE O F Q-LEARNING SOLUTION A P P L Y I N G DIFFERENT EXPLORATION-EXPLOITATION POLICIES 89 102 103 F I G U R E 5.3. PERFORMANCE O F D I F F E R E N T ^ - G R E E D Y POLICIES 104 F I G U R E 5.4. PERFORMANCE O F D I F F E R E N T L E A R N I N G R A T E S 105 F I G U R E 5.5. CONVERGENCE O F T H E V A L U E F U N C T I O N 106 F I G U R E 5.6. V A L U E F U N C T I O N O B T A I N E D B Y R L A N D S D P A T T I M E P E R I O D S 6 A N D 12.. 107 F I G U R E 5.7. OPTIMAL POLICIES D E R I V E D B YT H E R L A N D S D P M O D E L S A T T I M E PERIODS A N D 12 6 108 F I G U R E 5.8. CONVERGENCE O F T H E R L M O D E L 112 F I G U R E 5.9. G M S STORAGE V A L U E FUNCTION A SA FUNCTION OF M C A STORAGE 113 vi F I G U R E 5.10. S I M U L A T E D O P T I M A L O P E R A T I O N PLANNING POLICY 113 FIGURE 5.11. C O M P A R I S O N O F T H E R L A N D S D P C P U T I M E F I G U R E 5.12. A V E R A G E M O N T H L Y R E S E R V O I R S LOCAL N A T U R A L INFLOWS FIGURE 5.13.NORMALIZED C U M U L A T I V E M O N T H L Y D I S T R I B U T I O N A N N U A L RESERVOIRS 115 120 OF THE A V E R A G E L O C A L N A T U R A L INFLOWS 120 F I G U R E 5.14. A V E R A G E M O N T H L Y M A R K E T P R I C E I N E A C H S U B - T I M E S T E P 121 FIGURE 5.15.HISTORICAL M O N T H L Y D I S T R I B U T I O N 122 FIGURE 5.16.CORRELATION OF % OF SYSTEM LOAD B E T W E E N THE DALLES INFLOW V O L U M E A N D THE MONTHLY PEACE A N D COLUMBIA INFLOWS FIGURE 5.17.PROBABILITY DISTRIBUTION 128 FUNCTION OFFORECASTED F I G U R E 5.18. T E S T I N G T H E Q U A L I T Y O F T H E D I F F E R E N T N U M B E R S SYSTEM LOAD 129 OF GENERATED SCENARIOS 130 F I G U R E 5.19. C O M P A R I S O N B E T W E E N T H E P E A C E I N F L O W H I S T O R I C A L GENERATED SCENARIOS DATA A N D THE STATISTICS 131 F I G U R E 5.20. S T O R A G E V A L U E F U N C T I O N INDECEMBER 134 F I G U R E 5.21. S T O R A G E V A L U E F U N C T I O N INM A Y 134 F I G U R E 5.22. S T O R A G E V A L U E F U N C T I O N INAUGUST 134 FIGURE 5.23.MICA STORAGE V A L U E F U N C T I O N INDECEMBER FIGURE 5.24.MONTHLY G M S S T O R A G E V A L U E F U N C T I O N WITH MICA RESERVOIR STORAGE LEVEL 135 A T 50% 136 FIGURE 5 . 2 5 . M C A M A R G I N A L V A L U E O F E N E R G Y I N DECEMBER 139 F I G U R E 5.26. G M S M A R G I N A L V A L U E O F E N E R G Y I N D E C E M B E R 140 F I G U R E 5.27. G M S M A R G I N A L V A L U E O F E N E R G Y I N D E C E M B E R 141 FIGURE 5 . 2 8 . M C A M A R G I N A L V A L U E O F E N E R G Y I N M A Y 142 FIGURE 5 . 2 9 . G M S M A R G I N A L V A L U E O F E N E R G Y IN M A Y 142 FIGURE 5 . 3 0 . G M S M A R G I N A L V A L U E O F E N E R G Y I N AUGUST 143 F I G U R E 5.31. M C A M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A L O W S T O R A G E L E V E L A T M C A ( M C A STORAGES25,000CMS-D) FIGURE 5 . 3 2 . M C A M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O RA H I G H S T O R A G E A T M C A ( M C A STORAGE=285,000CMS-D) FIGURE 5.33.MARGINAL V A L U E O F K N A S T O R A G E C O N S T R A I N T 144 LEVEL 144 145 vii FIGURE 5.34. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A N E M P T Y G M S RESERVOIR STORAGE ( G M SSTORAGE=55,000CMS-D) 146 FIGURE 5.35. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A F U L L G M S R E S E R V O I R STORAGE ( G M SSTORAGE=475,000CMS-D) 146 FIGURE 5.36. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A L O W S T O R A G E L E V E L A T MCA(MCASTORAGE=125,000CMS-D) 147 FIGURE 5.37. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A H I G H S T O R A G E LEVEL A T M C A ( M C A STORAGE=285,000CMS-D) 148 FIGURE 5.38. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O RA L O W S T O R A G E L E V E L A T MCA(MCASTORAGE=125,000CMS-D) 148 FIGURE 5.39. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A H I G H S T O R A G E AT LEVEL M C A( M C A S T O R A G E = 2 8 5 , 0 0 0 C M S - D ) 149 FIGURE 5.40. J A N U A R Y N E T M A R K E T T R A N S A C T I O N S 152 F I G U R E 5.41 . A U G U S T N E T M A R K E T T R A N S A C T I O N S 152 FIGURE 5.42. M O N T H L Y N E T M A R K E T T R A N S A C T I O N S F O RG M S S T O R A G E O F 55,000 CMS-D 153 FIGURE 5.43.MONTHLY N E T M A R K E T T R A N S A C T I O N S F O RG M S S T O R A G E O F 475,000 CMS-D 154 FIGURE 5.44.MONTHLY N E T M A R K E T T R A N S A C T I O N S F O RM C AS T O R A G E O F 125,000 CMS-D 155 FIGURE 5.45.MONTHLY N E T M A R K E T T R A N S A C T I O N S F O RM C AS T O R A G E O F 285,000 CMS-D 155 FIGURE 5.46.PLANTS G E N E R A T I O N I N J A N U A R Y A T D I F F E R E N T M C AS T O R A G E INCREMENTS W I T H G M S S T O R A G E A T 55,000 C M S - D FIGURE 5.47.PLANTS G E N E R A T I O N I N JANUARY A T D I F F E R E N T M C AS T O R A G E 156 INCREMENTS W I T H G M S S T O R A G E A T 475,000 C M S - D 157 FIGURE 5.48. P L A N T S A V E R A G E T U R B I N E R E L E A S E S I N J A N U A R Y A T D I F F E R E N T M C A STORAGE INCREMENTS W I T H G M S S T O R A G E A T 55,000 C M S - D 157 FIGURE 5.49.PLANTS A V E R A G E T U R B I N E R E L E A S E S I N J A N U A R Y A T D I F F E R E N T M C A STORAGE INCREMENTS W I T H G M S S T O R A G E A T 475,000 C M S - D FIGURE 5.50 M O N T H L Y P L A N T S G E N E R A T I O N 158 159 Vlll FIGURE 5.51. M A R K E T P R I C E A N D M A R G I N A L V A L U E O F E N E R G Y 160 FIGURE 5.52.LOAD R E S O U R C E B A L A N C E 161 F I G U R E 5.53. A N N U A L D I S T R I B U T I O N 162 O F T H EFIVE M A I N P L A N T S G E N E R A T I O N FIGURE 5.54.NET M A R K E T T R A N S A C T I O N S F O RT H R E E P R I C E S C E N A R I O S 163 IX MDP Markovian decision process MM Moment matching MRE Mean relative error MW Megawatt MVW Marginal value o f water NDP Neuro dynamic programming NFM Network flow model NLP Non-linear programming pdf probability density function PWL Piecewise linear function RL Reinforcement learning RLROM Reinforcement learning reservoir optimization model RP Reliability programming RTC Real time control SDP Stochastic dynamic programming SDDP Stochastic dual dynamic programming SLP Stochastic linear programming SLPR Stochastic linear programming with recourse STOM Short term optimization model TD Temporal difference TPM Transition probability matrix TRM Transition reward matrix WK Weekday WKPK Weekday peak load hours WKLO Weekday light load hours WKHI Weekday high load hours WE Weekend GLOSSARY Agent A learner or controller or decision maker Age o f the agent The number o f iterations performed by the agent Capacity The rated power output o f a power plant, normally measured in M W Demand The rate at which electric energy is delivered to or by a system, generally expressed in M W , or averaged over a period o f time (e.g., MWh) Energy The amount o f electricity produced or used over a period o f time usually M W h Environment The thing that the agent interacts with to provide a control signal. It could be a real system or a model o f the real system. Episode Trial or iteration or cycle. Represents the planning period. Each episode ends at a terminal time period T Greedy policy A policy that follows always the best action 6 - greedy policy A policy that follows the best action with probability e while exploring new actions with probability 1- e Planning horizon The number o f stages / time periods Operating reserve A specific level o f reserve power should be available at all times to insure reliable electricity grid operation. Step size Learning rate parameter a Sub-time step The time step is divided to a number o f sub-time steps to capture the variation in certain parameters or variables at a shorter time increment Time step The planning period T is subdivided to number o f periods which are called the time steps or stages t Target A desirable direction to move in the next step Terminal stage The last period in an iteration (the end o f the planning horizon) ACKNOWLEDGEMENTS I greatly acknowledge the support and invaluable insights o f m y thesis advisor Dr. Ziad Shawwash. Without his guidance this thesis would probably never have been completed. M y understanding o f topics relating to this research has been shaped by his guidance throughout the course o f this research. I wish to extend my special thanks to Dr. S.O. Denis Russell for his wise advice and suggestions. I would like also to express my sincere thanks to D r . A . D . Russell, Dr. J . A . Meech, Dr. R . G . M i l l a r , and D r . T . A . Sayed for their helpful comments and feedback throughout the course o f this research. Special appreciations are due to Dr. Thomas Siu for his continuous help and support. This thesis has benefited from his expertise, deep understanding, and explanations o f operations planning o f B C Hydro reservoir systems. I am also thankful to m y colleagues in B C Hydro: D . Smith, D . Druce, Dr. G . Nash, H . Louie, W . Cheng, and D r . G . Sreckovic for their valuable assistance and support. Interactions with U B C students have also been fruitful. I would like to acknowledge J. L i , Y . Tang, and R . Mazariegos for their help. Special thanks are due to Dr. B . Abdulhai. M y understanding o f Reinforcement Learning benefited from the discussions I had with h i m and from his knowledge o f artificial intelligence. I am very grateful for my wife Dalai, for her love, patience and support during the entire period o f m y study. Thanks for m y son A l y for bringing me great j o y in stressful time. Lastly, and most importantly, I wish to thank m y parents, Zeinab and Eatzaz, for their great care and encouragement during my studies and throughout my life. xiv 1. INTRODUCTION 1.1. Background In British Columbia, there are large number o f dams distributed throughout the different basins and watersheds o f the province. Most o f the large dams i n B . C . serve hydroelectric power purposes. The British Columbia Hydro and Power Authority ( B C Hydro) operates a hydro dominated power generation system that includes 30 hydroelectric generating stations providing approximately 90 per cent o f the system installed capacity (11,100 M W ) . These plants can be placed into four main groups: Peace system (2 plants) producing about 34% o f the energy requirements, Columbia system (3 plants) producing 31%, Kootenay Canal and Seven M i l e generating stations producing 13% and the remaining 23 plants that supply about 16% o f the energy production. The balance o f energy requirements is supplied from two thermal generating facilities and from energy purchases. The B C Hydro integrated system produced about 54,000 gigawatt-hours in 2004. The main hydroelectric storage facilities in the B C Hydro system are: the Williston reservoir (39.4 billion m ) on the Peace River and the Kinbasket reservoir (14.8 billion 3 m ) on the Columbia River. The two reservoirs provide multi-year storage 3 and accordingly, they are used for the strategic and long-term operations planning o f the B C Hydro system. The reservoir systems in B . C . provide many important benefits i n addition to hydroelectric power production. These include: domestic and industrial water supply, flood control, and recreation. During the operation o f these reservoir systems, conflicts may arise between some o f these uses, particularly i n periods o f sustained drought or storage deficits. A l s o , new pressures have been imposed on the reservoir system due to expanding energy needs, increasing water demands, and growing public awareness o f environmental issues. In addition, publicly owned reservoir systems often have to deal with complex legal agreements (e.g. Columbia River Treaty), fisheries, and non-power 1 release requirements, Federal/Provincial regulations (e.g. water licenses), and pressures from different interest groups. Recently increased attention has been given to improving the economic and operating efficiency o f the existing reservoir systems. The electrical transmission network in British Columbia is interconnected with Alberta and Western U S systems. These interconnections allow for the purchase and sale of electricity in the wholesale electricity market, and therefore complicate the reservoir operations planning problem, since it must incorporate maximizing the profit from energy transactions in these two markets. The complexities o f the multipurpose, multiple reservoir systems generally require that release decisions be determined by an optimization and/or simulation model. Computer simulation models have been applied for many years, as powerful tools i n reservoir systems management and operation. These models are descriptive o f the behavior o f the system (e.g. releases, storage), given a specified scenario (e.g. the sequence o f flow data, storage, demands, priorities, and constraints); and they are able to illustrate the changes resulting from alternative scenarios. They are also useful i n examining the long-term reliability o f a proposed operating strategy when combined with Monte Carlo analysis. However, they are not well suited to develop the optimal strategy, particularly for large-scale systems. Prescriptive optimization models on the other hand, offer the capability to systematically derive optimal solution given a specific objective and a set o f constraints. The operating policies developed with optimization procedures often need to be checked with a simulation model in order to ensure their feasibility. In B C Hydro, the operations planning process is carried out using several simulation and optimization models. This modeling system is divided in terms o f time horizons. The cascaded operations planning modeling hierarchy in B C Hydro is categorized as follows: - Long-term operations planning (strategic planning, 1-6 years in monthly time-steps). 2 - Medium-term operations planning (strategic/tactical planning, up to 12 month period in hourly to weekly time-steps). - Short-term operations planning (tactical planning, for one week in hourly to daily time-steps). - Real-time operations planning (1 hour-1 week in hourly time-steps). For long term planning, a marginal cost model ( M C M ) has been developed in-house at B C Hydro by Druce (1989, 1990), and kept up to date. The model applies Stochastic Dynamic Programming (SDP) to calculate the monthly marginal value o f water stored in the Williston reservoir, the largest reservoir in the system, over a planning period o f 4 to 6 years. The S D P model takes into account the uncertainties in inflows and market prices. The calculated marginal value o f water stored i n the Williston reservoir is then used as a proxy for the long-term system marginal cost. A s well, the model develops a probabilistic forecast o f B C Hydro system price signals, reservoir storage volumes and expected spills. Shawwash et al. (2000) developed a short-term optimization model ( S T O M ) that is used by B C Hydro operation engineers to determine the optimal hourly generation and trading schedules while meeting system demands. The model applies a linear programming technique in which the inflow, prices, and system loads are deterministic within the short-term period. The model has been modified and extended to the mediumterm generalized optimization model ( G O M ) that is capable to handle longer planning periods ranging from hourly to monthly time-steps. The model also allows for variable time steps that can be specified on daily, weekly, and monthly intervals, and to include sub-time steps within a time step. Hence, G O M has the advantage o f capturing variations in load, market prices, and generation schedules for shorter periods within the time step during weekdays or weekends. Nash (2003) developed a stochastic optimization model for the Columbia and Peace reservoir systems (aggregated to two reservoirs). The model consists mainly o f two sub-models. In the first model, which is the long-term model, the monthly storage value function and the marginal values o f water derived by a D P - L P based model are passed to the second model. The second model is the shorter-term model that applies a stochastic 3 linear programming with recourse algorithm ( S L P R ) in which the inflows, prices, and demands are defined by a scenario tree. The storage value curves generated by the longterm model are used as terminal value functions in the shorter-term model. The outputs o f the shorter-term model for the two aggregated reservoir systems are the refined marginal values over the shorter periods and the operating decisions for each period. The present research work builds upon, extends, and enhances the above techniques and develops a new approach to solve the operations planning problem for multireservoir systems. 1.2. Problem Statement The main objective o f reservoir operations planning is to determine the release policies that maximize the expected value o f the system resources over the planning period. This objective is generally achieved by maximizing the value o f the energy produced while meeting firm domestic demands and taking into consideration the value of stored water at the end o f the planning horizon. The reservoir operating strategy should provide answers to these questions at every decision point in time: when and how much water to store or release and when, where, and what quantity o f energy to trade in the competitive wholesale electricity market while meeting the firm domestic demands and non-power requirements. This control problem is challenged with three main sources o f uncertainty with which a reservoir system operator has to deal with. The main sources o f uncertainty are the inflows to the reservoir system. The system operation policy has to protect against situations o f shortage where the system is unable to meet the demands in a dry year. A l s o , it has to be capable to store any sudden increase i n the inflow while avoiding wasteful water spills. In addition, there is uncertainty in forecasting electricity demand that w i l l affect the amount o f energy generated since the firm demand must be met regardless o f cost. There is also uncertainty in market prices that varies seasonally, daily, and between weekdays and weekends. 4 The marginal value o f water represents the value o f an extra increment o f storage in a given reservoir ($/m ). In the reservoir operations optimization problem, there is a trade3 off between the marginal value o f water and the present market price. If the market price is higher than the marginal value o f water, then it is more profitable to sell energy. O n the other hand, when the market price is lower than the marginal value o f water, then it is more profitable to purchase energy from the market. The marginal value o f water is equal to the derivative o f the value o f water function with respect to storage. The optimal use o f the water in the reservoir corresponds to the point that minimizes the sum o f immediate and future costs. This is also where the derivatives o f the immediate cost function and future cost function with respect to storage become equal (Pereira et al. 1989). To derive a set o f realistic control policies and better estimate the value o f system resources, there is a need to integrate B C Hydro's main reservoir systems into a single model that takes into consideration the uncertainties in the inflows, market prices, and the system load. This integration is essential to allow for the effect o f interactions between the different reservoir systems on the amount o f total energy produced by the system. The value o f water in a given reservoir is a function o f the storage level in that reservoir and those in other reservoirs. Thus, the value o f water in storage in any reservoir cannot be established unless assumptions are made about the other storage variables i n the system. In this research, the main hydroelectric power generation facilities o f B C Hydro system are included in a long/medium term stochastic optimization model. These hydroelectric power generation facilities are located on several independent river systems, for example, Peace River ( G . M . Shrum and Peace Canyon) and Columbia River (Mica, Revelstoke, and Keenleyside). 1.3. Goals and Objectives The main goal o f this research work is to develop and implement a long/medium term stochastic optimization model that serves as a decision support tool for the operations planning o f the B C Hydro's reservoir systems. The model should be able to: 5 • Handle the dimensionality problem o f this large-scale stochastic problem i n an efficient manner, • Provide forecasts o f the expected value o f revenues, energy generation, and expected market transactions (imports/exports), • M o d e l several river systems within the B C Hydro system, • Address the main uncertainties in the operations planning problem (inflow, market price, and demand), • Provide the marginal value o f water for the major reservoirs, and • Deal with variable time steps and be able to address different objective functions. To achieve these goals, several objectives were identified: 1. Acquire an in-depth understanding and knowledge o f the reservoir operations planning problem in general and o f B C Hydro's reservoir systems in particular. This was achieved by thoroughly investigating the modeling environment at B C Hydro and other hydroelectric power generation entities. Special attention was given to integrating the generalized optimization model ( G O M ) with the stochastic optimization model. 2. Carry out an extensive review o f the literature on reservoir optimization techniques with a particular emphasis on stochastic optimization techniques. This literature review was extended to include state o f the art techniques developed and applied in the fields o f machine learning and artificial intelligence environments. The aim o f the literature review was to assess the merits and the drawbacks o f the different optimization techniques and their potential to handle the complexity and the dimensionality problems o f the large scale, stochastic, multiperiod, multireservoir operations planning problem. 3. Formulate the stochastic optimization model that addresses the uncertainties in inflow, system load, and market prices. 6 4. Investigate and develop the Reinforcement Learning ( R L ) technique with Function Approximation and Sampling techniques, an approach, that is becoming popular and seems to offer the promise o f handling large scale stochastic problems 5. Test the performance o f the R L model and develop an algorithm to implement it for the optimization o f the operation o f B C Hydro's main reservoir systems. 1.4. Organization of the thesis This chapter presented the motivation, goals and the focus o f the thesis. Chapter 2 reviews the different techniques and approaches in handling the reservoir optimization problem cited in the literature. Chapter 3 introduces the main concepts and computational aspects o f using reinforcement learning ( R L ) . Chapter 4 describes the methodology and the mathematical formulation adopted in the development o f the R L R O M model. Chapter 5 presents the results o f applying the R L solution methodology for a single reservoir and testing the extended two reservoir problem. The model was then implemented on B C Hydro's main reservoir system on the Peace and the Columbia Rivers. Chapter 6 provides conclusions and recommendations for future research work. 7 2. LITERATURE REVIEW The literature review carried out i n this research is presented herein from two perspectives: modeling approaches and optimization techniques. The modeling approaches commonly applied to the reservoir optimization problem can be grouped into two main categories: Implicit Stochastic optimization and Explicit Stochastic optimization. In terms o f the optimization techniques, the reservoir optimization models can be classified as: • Deterministic models including: Linear programming, Network Flow, Multiobjective, N o n Linear programming, and Dynamic Optimzation Models, • Stochastic optimization models including: Dynamic, Linear, Dual Dynamic Programming, Chance Constrained and Reliability Programming, • Heuristic models including: Genetic Algorithms, Artificial Neural Networks, Fuzzy Programming, etc... 2.1. Modeling Approaches Deterministic analysis o f reservoir operational problems has several computational advantages over the stochastic analysis. Ignoring the stochasticity o f the system simplifies the model resulting in more efficient performance; however this simplification introduces a bias i n the results. Loucks et al. (1981) state that: "Deterministic models based on average or mean values o f inputs, such as stream flows, are usually optimistic. System benefits are overestimated, and costs and losses are underestimated i f they are based only on the expected values o f each input variable instead o f the probability distributions describing those variables". Reservoir optimization models can be useful tools for identifying and evaluating the impacts o f various alternative system operations. Yet, these models are not likely to be very useful unless they consider the uncertain conditions affecting the future performance o f those systems. This includes the uncertain future demands imposed on those systems, the uncertain future costs o f those systems and the uncertain quantities and qualities o f 8 the flow within those systems. Assumptions made regarding each o f these sources o f uncertainty can have a major impact on system planning and operation. These facts have served to motivate the development o f stochastic models, models that take into consideration at least some o f the important sources o f uncertainty and their impacts on system design and operation. Implicit 'stochastic reservoir optimization models optimize over series o f random variables assuming perfect knowledge o f the future. Multiple regression analysis o f the results o f the deterministic optimization model can be applied to generate operational rules. However, Labadie (1998) claims that regression analysis can result i n poor correlations that may invalidate the operating rules. Explicit stochastic optimization models deal with the probabilistic nature o f random variables directly, rather than dealing with deterministic sequences. Accordingly, the optimization is performed without assuming perfect knowledge o f future events. The benefit o f this approach is to better quantify the impact o f the uncertainty i n the random variables and consequently come up with better reservoir management decisions. However, this approach is more computationally expensive than the implicit optimization approach. Most o f the explicit stochastic optimization models assume that unregulated inflows are the dominant source o f uncertainty in the system and can be represented by appropriate probability distributions. These may be parametric or nonparametric based on frequency analysis. Other random variables that may be defined include: market prices and demands. Unregulated natural inflows may be highly correlated spatially and/or temporally. Explicit stochastic models use probability distributions o f stream flow. This requires two main simplifications to keep the dimensionality o f the problem manageable. First, discretization o f the probability data, and second, relatively simple stochastic models are usually used (e.g., lag-1 Markov model). Most inflow sequences show serial correlation (Pereira et al., 1999), and are represented in modeling inflows by a lag-1 autoregressive or multivariate model. 9 2.2. Optimization Techniques A broad array o f mathematical models has been applied for the optimization o f reservoir systems operations and management. The choice o f the modeling technique relies largely on the characteristics o f each application. The following sections briefly review optimization methods that are widely used to solve the reservoir system optimization problem, with a focus on the techniques that are applied i n multireservoir systems. Y e h (1985) and Labadie (1998, 2004) presented a comprehensive in-depth state o f the art review o f the optimization techniques used i n reservoir operation and management. 2.2.1. Deterministic Optimization Models 2.2.1.1. Linear Programming Models One o f the most favored optimization techniques i n reservoir system models is linear programming (LP). L P requires that all the constraints and objective function be linear or be "linearizable" b y applying one o f the available linearization techniques, such as piecewise linearization or Taylor series expansion. L P models guarantee convergence to global optimal solutions. In addition, for large-scale reservoir optimization problems where the number o f variables and constraints are large, decomposition techniques such as Dantzig-Wolf or Bender decomposition technique can be used to accelerate the solution process (Yeh, 1985). L P problem formulation is easy and L P problems can be readily solved b y applying commercially available L P solvers. Turgeon (1987) applied a monthly mixed integer L P model for site selection o f hydropower plants to be built. H i e w et al. (1989) applied L P technique to an eight reservoir system i n northern Colorado. Shawwash et al. (2000) presented an L P short term optimization model ( S T O M ) , which has subsequently been developed to determine the optimal short term schedules 10 that meet the hourly load and maximizes the return to B C Hydro resources from spot transactions in the Western U . S . and Alberta energy markets. Currently, the model is used in B C Hydro by the generation operations engineers to optimize the scheduling o f the main reservoir systems. The authors state that using the model to develop the optimized schedule contributed between 0.25-1.0% in the form o f additional revenue from sales and in the value o f additional stored water. The authors indicate that one o f the major benefits o f using L P is the derived sensitivity analysis data that can be obtained from the simplex dual variables. A s an example, the dual variable o f load resource balance equation provides the system incremental cost at each time step. This information can be used in planning spot trading schedules and in real time operation o f the system. Other applications o f the L P technique to the reservoir operations problem include: Martin (1986), Piekutowski et al. (1993), and Crawley and Dandy (1993). 2.2.1.2. Network Flow Models Network flow models ( N F M ) have been applied in a broad range o f water resource applications, as they are easy to formulate and efficient to solve. A reservoir system is represented as a network o f nodes that are linked by arcs. Nodes could represent storage or non-storage points o f confluence or diversion and arcs represent releases, channel flows, and carryover storage. This representation also has the advantage o f easily defining piecewise linear functions through the specification o f multiple links between nodes. F l o w bounds and unit costs are defined by the flow limits and slopes o f each linear piece (Labadie, 1997). Lund and Ferreira (1996) applied a network flow algorithm to the Missouri River multireservoir system. The multireservoir system is optimized for a period o f 90 years i n monthly time steps. The authors concluded that system operation rules could be inferred from deterministic optimization applying a long hydrologic sequence. Shawwash et al. (2000) observed that some o f the methodology's limitations were encountered when using arcs to describe flow patterns in a complex system, such as B C 11 Hydro's, that includes a combination o f very large and very small reservoir systems, were encountered. 2.2.1.3. Labadie Multiobjective Optimization Models (1997) presented two approaches for dealing with multiobjective optimization problems. In the first approach, the primary objective is represented i n the objective function while treating the other objectives are treated as constraints at desired target levels (epsilon method). The second approach assigns weights to each objective (weighting method). Can and Houck (1984) applied a preemptive goal programming (PGP) approach to a four reservoir multipurpose system i n the Green River Valley, Kentucky. In their comparative study with other L P models the authors concluded that P G P allows the flexible expression o f policy constraints as objectives and it performed well compared with a more data intensive L P optimal operating model. The basic concept o f P G P is to set aspiration levels (targets) for each objective and prioritize them. Attainment o f the goals is sought sequentially. A significant advantage o f P G P is that it does not require any penalty-benefit function, reducing the need for a detailed economic analysis. However, one drawback o f P G P is that it does not allow trading a small degradation i n a high priority objective for a large improvement i n a lower priority objective (Loganathan and Bhattachatya, 1990). A s goal programming (GP) relies on achieving predetermined target values, the global optima for objectives may not be explored. 2.2.1.4. Nonlinear Programming Models Non-linear programming ( N L P ) is not as popular as L P and dynamic programming (DP) i n solving reservoir systems optimization problems. The reason is basically because the optimization process is slow and can return inferior and non-optimal solutions. However, i n cases where a problem cannot be realistically linearized, it may be solved as a N L P problem particularly with inclusion o f hydropower generation i n the objective function and/or the constraints. Labadie (1997) indicates that the most powerful and 12 robust N L P algorithms available to solve reservoir system optimization problems include: Sequential Linear Programming (SLP), Sequential Quadratic Programming (SQP), Method o f Multipliers ( M O M ) , and the Generalized Reduced Gradient Method ( G R G ) . Recent applications o f N L P to hydropower reservoir operations include: TejadaGuibert et al. (1990), A r n o l d et al. (1994), and Barros et al. (2003). Barros et al. (2003) applied N L P model to a large scale multireservoir system in Brazil. This multiobjective optimization problem was solved applying L P and S L P using a Taylor series expansion. The authors concluded that the N L P model is the most accurate and suitable for real-time operations than the L P model. 2.2.1.5. Dynamic Programming Models Dynamic programming (DP) is another powerful optimization technique that has been used extensively to solve reservoir system optimization problems. Unlike L P and N L P techniques that simultaneously solve the problem for all time periods, D P algorithms decompose the original problem into sub-problems that are solved sequentially over each stage (time period). D P formulation requires the definition o f a set o f state variables to describe the system state at the beginning o f each stage and a set o f decisions that transform the current stage state to the next one. D P has the advantage o f handling nonlinear, nonconvex, and discontinuous objective and constraint functions. However, a major problem that limits the application o f D P to large-scale multireservoir systems is the exponential growth in computation time as the number o f discretized state variables increases. This is widely known as the curse o f dimensionality (Bellman, 1957). One o f the earliest applications o f deterministic D P to reservoir operation was by Y o u n g (1967), who studied a finite horizon, single reservoir operation problem. Various extensions have been developed to overcome the curse o f dimensionality in applying dynamic programming application to reservoir operation. Larson (1968) introduced incremental dynamic programming (IDP). The IDP procedure starts with a trial solution and the recursive D P equation examines adjacent states around the current 13 solution. If better values are obtained, then the current solution is updated. Jacobson and Mayne (1970) developed a Differential Dynamic Programming ( D D P ) technique that uses analytical solution rather than discretization o f the state space. Murray and Yakowitz (1979) extended this approach to constrained problems. Johnson et al. (1993) introduced the Coarse Grid Interpolation technique. This technique relies on using larger discretization intervals. Solution accuracy is retained by interpolating within a coarser grid structure. 2.2.2. Stochastic Optimization Models 2.2.2.1. Stochastic Stochastic Dynamic Programming Models dynamic programming (SDP) is a powerful tool for studying multireservoir system operation because the stochastic nature o f inflows and the nonlinear energy generation functions can be modeled explicitly. Interestingly, Yakowitz (1982) found that the first application o f S D P preceded the application o f deterministic D P by more than a decade. Lamond and Boukhtouta (1996, 2001) and Lamond (2003) presented a survey o f stochastic reservoir optimization methods and models. A multireservoir, multiperiod S D P model is formulated by considering the multiperiod optimization i n stages. Each stage corresponds to one period. Release decisions are made to maximize current benefits plus the expected benefits from future operation, which are represented by a recursively calculated cost to go function. Solution o f the S D P model for a multireservoir system yields the "cost-to-go" function and a release policy decision rule for each time period as a function o f the system state variables. Since optimization is performed conditionally on all discrete combinations o f the state vector, the specter o f the curse o f dimensionality arises. For a multireservoir model with m discritization levels for n reservoirs, computational time and storage requirements are proportional to m". 14 The state variable typically includes the volume o f water in reservoirs and sometimes a description o f current, or forecasted hydrological conditions (Kelman et al., 1990). A periodic Markovian process typically describes reservoir inflows in S D P models. The choice o f the inflow state variable in an S D P model depends on the system's characteristics as well as the information available for decision-making. In addition, computational considerations often influence how hydrologic information is represented in S D P . Huang et al. (1991) applied four types o f representations o f the inflow state variable to the Feitsui reservoir S D P optimization model in Taiwan. The authors found that using previous period inflows resulted i n superior performance compared to the use of the present period inflows. Piccardi and R. Soncini (1991) found that policies derived from an S D P model without a hydrologic state variable resulted in simulated performance similar to that o f policies derived using the previous period's inflow, although the S D P and simulation agreed more closely when the previous period's inflow were employed. Tejada-Guibert et al. (1995) examined the value o f hydrologic information i n S D P multireservoir models by using different hydrological state variables for the ShastaTrinity subsystem o f the Central V a l l e y project in California. Then, the S D P policies were compared using a simulation model assuming that the current inflows were known. The authors applied four types o f models with different inflow state variables, and concluded that the value o f using sophisticated inflow forecasting depends on several system characteristics, including the relative magnitude o f water and power demands and the severity o f the penalties on shortages. Turgeon (2005) applied a parametric approach to represent the inflows by a linear autoregression ( A R ) model, used to solve the S D P reservoir management problem. Instead o f using the traditional lag-1 models, the author stated that there are many advantages in considering multilag autocorrelation o f inflows. To avoid an increase in state space, the multilag autocorrelation o f inflows was represented by the conditional mean o f the daily inflow. The use o f S D P to optimize multireservoir systems is usually accompanied by the assumption that various natural inflows are not cross correlated. This results in solutions that provide a rough estimate o f the optimal design or operation policy. T o handle this 15 problem, Y e h (1985) suggested the separation o f the D P optimization and stream flow generation, or using an aggregation/decomposition methods similar to those proposed by Turgeon(1980). a) Dynamic Programming with Successive Approximation The Dynamic Programming with Successive Approximation ( D P S A ) method consists o f breaking up the original multi-state variable problem into a series o f one-state variable sub-problems in such a manner that the sequence o f optimizations over the sub-problems converges to the solution o f the original problem. Davis et al. (1972) used the D P S A to determine a local feedback policy for each reservoir for a network o f reservoir-hydroplant complexes in parallel. Pronovost and Boulva (1978) have used Davis' method to obtain an open-loop policy, which gives near optimal results rather than local feedback to eliminate the drawback o f this method. Turgeon (1980) concluded that to obtain an openloop policy solution, the successive approximation method must be solved repetitively at the beginning o f each period that may be computationally costly. b) Aggregation and Decomposition SDP Turgeon (1980) introduced the aggregation and decomposition method consisting o f breaking-up the original n-state variable stochastic problem into n stochastic subproblems o f two-state variables that are solved by S D P . The final result o f this method is a suboptimal global feedback operating policy for the system o f n reservoirs. Furthermore, Turgeon (1980) assumed that the electrical energy generated by any plant is a constant times the turbine releases. Accordingly, instead o f utilizing the reservoir storage as a state variable, a potential energy term is created for treating the nonlinearity o f the power generation function. Turgeon (1980) applied the D P S A and the aggregation/decomposition methods to a network o f 6 reservoir hydroplant complexes. In his comparative study he concluded that the later gives a better operating policy with the same time and computer memory requirements. In addition, the computational effort o f the aggregation/decomposition 16 method increases linearly with the number o f reservoirs since for each additional reservoir, only one additional D P two-state problem has to be solved. Valdes et al. (1992) applied this technique to the 4 reservoir lower Caroni hydropower system in Venzuela. Disaggregation was performed both spatially and temporally, resulting in daily operational policies from the monthly equivalent reservoir policies. Saad et al. (1994) incorporated neural networks to improve the disaggregation process and to account for nonlinear dependencies between the system elements. The method was successfully applied to finding long-term operational policies for HydroQuebec's 5 reservoir hydropower system on the L a Grande River. Labadie (1997) indicated that the main problem with the use o f state aggregation/decomposition methods is the loss o f information that occurs during the aggregation process. c) The Principle of Progressive Optimality Turgeon (1981) presented an algorithm based on the principle o f progressive optimality o f Howson and Sancho (1975), for which the state variables do not have to be discretized. He applied the technique to an example consisting o f four hydropower plants in series to determine the optimal short time scheduling for multireservoir system consisting o f 4 hydropower plants. The algorithm does not have the recursive equation i n terms o f the optimal value function and might be considered as a multidimensional continuous version o f the IDP procedure. This approach has the advantage o f dealing with discontinuous return functions and with hydropower production functions that do not have to be linearized or approximated by convex linear functions. A l s o , there is no problem o f dimensionality since only one trajectory o f the reservoir storage must be stored in the computer memory. A s this iterative procedure is a function o f the selected initial solution, Turgeon (1981) proposed the use o f D P S A with a very coarse grid o f the state variables to obtain a good trial trajectory before using this approach, which can then be solved by a direct search method. 17 d) SDP with Function Approximation The discretized "cost-to-go" function can be approximated by a continuous function. Since an approximate value for the cost-to-go function is only calculated at the discretized state values, the value o f the function at other points can be estimated by interpolating nearby grid points. Several authors explored the reduction in computational effort possible when multivariate polynomials or piecewise polynomial functions are employed in S D P algorithms (Foufoula-Georgiou and Kitanidis (1988); Johnson et al. (1989); Johnson, et al 1993; Tejada-Guibert et al (1993), and Lamond (2003). TejadaGuibert et al. (1993) concluded that computational savings are possible; mainly because: (1) the improved accuracy o f higher order functions which results in good solutions even with a coarse state space discretization and (2) efficient gradient-based optimization algorithms can be used to compute better approximations to the optimal solutions. Johnson et al. (1993) applied a high order spline function to approximate the cost-to go function so that a coarse discretization o f the state space could be used. The spline is constructed o f individual multivariate cubic polynomials, each defined over a sub-region o f the state space domain. The spline coefficients were determined by requiring that the spline be able to interpolate the cost function values at each state space grid point. This approach proved to be successful in reducing the solution time for a system with two to t five reservoirs. Tejada-Guibert et al. (1993 and 1995) applied these piecewise cubic functions to approximate the cost-to-go function for the five hydropower plants o f the Shasta-Trinity system in North California. He also recommended the use o f a sampling S D P algorithm suggested by Kelman et al. (1990) as an attractive approach to describing the distributions and temporal correlations o f inflows. Lamond (2003) applied a piecewise polynomial approximation o f the future value function for a single hydroelectric reservoir model. The authors concluded that the adopted method is faster than both discrete D P value iteration and a continuous D P method using splines on a fixed grid. A l s o , they suggested that spline approximation is not well suited when the rewards are piecewise linear. 18 Lamond and Boukhtouta (2005) applied the neuro-dynamic programming approach (NDP) o f Bertsekas and Tsitsiklis (1996) to approximate the cost-to-go function by a neural network. They applied the N D P to compute an approximate optimal policy for the control o f a single hydroelectric reservoir with random inflows, concave, piecewise linear revenues from electricity sales taking into account the head variations and the turbine efficiency. Their N D P approach is based on a backward induction o f a feed forward neural network with an input layer, hidden layer and a single output layer to approximate the future value function. Lamond and Boukhtouta (2005) concluded that the N D P approximation architecture gives very smooth approximate functions, which allowed the use of a coarse discretization o f the state and the inflow variables in the training step o f the neural functions. Their findings reinforce and confirm Bertsekas (2001) claims that N D P can be impressively effective in problems where the traditional D P methods would be hardly applicable. 2.2.2.2. Stochastic Linear Programming Stochastic linear programming ( S L P ) deals with uncertainty in the model parameters by considering a number o f scenarios. Each scenario describes the values o f the uncertain parameters and their probability o f occurrence. The primary advantage o f scenario-based stochastic models is the flexibility it offers in modeling the decision process and i n defining scenarios, particularly when the state dimension is high. However, the difficulty with this modeling approach is that an ample number o f scenarios result in a large scale linear programming problem, which in turn requires special solution algorithms that rely mainly on decomposition approaches. Stochastic linear programming with recourse ( S L P R ) utilizes scenarios to represent the uncertainty in model parameters in the form o f stages. The S L P R in its simple form subdivides the problem into two stages. The first stage decisions are proactive or planning decisions, which are made with certainty, while the second stage decisions are reactive or operating decisions. Accordingly, S L P R models support the "here and now 19 decision", while providing a number o f "wait and see" strategies depending on which scenario unfolds. These models are non-anticipative: i n each stage, decisions must be made without knowledge o f the realization o f random variables in future stages. When each inflow scenario is treated deterministically, the deterministic variables represent the set o f first stage decisions and the stochastic variables represent future release decisions corresponding to a specific scenario. It should be noted that only the first stage decisions are actually implemented, since the future decisions are not known with certainty. Following the implementation o f the first stage decisions, the problem is then reformulated starting with the next period decisions, and solved over the planning horizon. The first applications o f two-stage and multi-stage S L P to reservoir management (Pereira and Pinto, 1985, 1991) used the Benders Decomposition Method (Benders, 1962: V a n Slyke and Wets, 1969). This method is powerful because it allows a large-scale problem to be solved iteratively. Moreover, using this technique in a nested form allows multi-stage problems to be decomposed by both scenario and decision period (Birge, 1985). Jacobs et al. (1995) applied S L P using Benders decomposition to a three reservoir hydropower system in Northern California. Decomposition o f the linear programming problem into smaller network flow optimization problems resulted in significant computational savings over attempts at direct solution. Dantzig and Infanger (1997) combined Benders decomposition (Benders, 1962) with the importance sampling technique to reduce the variance o f the sample estimates. The dual o f the multistage formulation measures the impact o f future responses, which is fed back to the model's present time i n the form o f cuts. These cuts are sequentially added at different stages o f the multi-stage dynamic system. Dantzig and Infanger (1997) indicated that thesecuts constitute a set o f generated rules that guide the control problem to balance between present and future costs and drive the system away from future infeasibilities and towards optimality. Kracman et al. (2006) developed a multistage multiobjective S L P 20 reservoir planning model for the Highland Lakes system in Texas applying generated scenarios using a quantile sampling technique. The authors state that this scenario generation technique, which was adopted, preserves the spatial correlation o f the random inflows. 2.2.2.3. Stochastic Dual Dynamic Programming Stochastic Dual Dynamic Programming ( S D D P ) developed by Pereira (1989) represents an interesting m i x o f stochastic linear and dynamic programming optimization techniques. S D D P solves a multidimensional stochastic dynamic programming problem and it approximates the future cost function ( F C F ) as a piecewise linear function. Unlike conventional S D P , which discretizes the state space and solves the F C F for all points, S D D P samples the state space and solves the D P problem iteratively. The S D D P approach, as presented by Pereira et al. (1999), is described in the following paragraphs. The first phase starts with a backward recursion calculation o f the F C F . The slope o f the F C F around a given state is calculated by solving a series o f one stage L P s for each inflow scenario. The slopes o f the F C F at the different states are estimated from the dual variable o f the mass balance constraint, as these multipliers represent the change in the objective function value with respect to storage (df /dS). The resulting cost-to-go, which is based on the highest value in each state (convex hull), represent a lower bound for the actual F C F . In the second phase a Monte-Carlo simulation is performed in a forward pass which simulates the system operation using the release policy calculated so far. Similar to the backward recursion calculations, a set o f one stage L P problems has to be solved for each inflow scenario. The upper bound o f the F C F is estimated as the mean o f the Monte Carlo simulation results. To address the uncertainty around the true expected value o f the cost function, Pereira et al. (1999), used the 95% confidence intervals to estimate the confidence limits around the true values. The Optimal solution is obtained i f the lower bound o f the F C F lies inside the confidence limits. If not, a new iteration with backward and forward calculations has to 21 be performed adding additional sets o f states (additional cuts or plans to the F C F ) . The states that the simulation passes through are used in the new backward recursion. It should be noted that the planes obtained in each iteration are retained and the new planes are added to the set generated so far. This is in contrast to the conventional S D P where a new cost-to-go table has to be developed in each stage. Pereira (1989), and Pereira and Pinto (1991) applied the S D D P to a hydrothermal scheduling problem in Brazil. Rotting and Gjelsvik (1992) applied the S D D P to seasonal planning for system o f 35 reservoirs in a 28 river systems, which represents a part o f the Norwegian hydropower system. The system is operated to minimize the thermal operating costs while considering the terminal value o f water storage. They concluded that the S D D P procedure is successful and convergence o f the algorithm is obtained with a saving in run time over the basic S D P approach by a factor o f 16. Halliburton (1997) however, states that: "convergence is questionable for both the U . S . Bonneville Power Administration ( B P A ) and N e w Zealand systems". He summarized the difficulties with S D D P as: non convergence, long C P U time, difficulties in setting the large number o f interacting penalties, and the inability to handle certain type o f constraints (non convex, applying across a number o f time periods, etc...). 2.2.2.4. Chance-Constrained Programming and Reliability Programming Chance-Constrained Programming ( C C P ) considers the probability conditions on constraints. Typically, the probability o f satisfying a constraint is required to be greater than a threshold value. These constraints have the impact o f tightening the restrictions on reservoir releases at the desired risk levels, thereby encouraging more conservative operational strategies. C C P converts a stochastic type problem to a deterministic-type one, and then solves the deterministic equivalent. Loucks and Dorfman (1975) concluded that the use o f chance constraints leads to overly conservative rules for reservoir operation. Takeuchi (1986) invoked a C C P model to solve a real-time reservoir operation problem. The chance-constraints were set on the 22 probability o f the reservoir becoming empty. Changchit et al. (1989) combined C C P with goal programming to operate a multiple reservoir system. Y e h (1985) concluded that C C P formulations neither explicitly penalize the constraint violation nor provide recourse action to correct constraint violations as a penalty. Hogan et al. (1981) warned that the practical usefulness o f C C P is seriously limited and it should not be regarded as substitution for stochastic programming. Labadie (1997) indicated that the C C P does not represent the true risk that must be estimated by performing Monte Carlo analyses on the proposed operational policies. Colorni and Fronza (1976) initiated the application o f reliability programming (RP) to the reservoir management problem that was regarded as an extension to the C C P . In their model, risk was accounted for by choosing different probability values that constrain the degree o f satisfying the contracted release. Reznicek and Cheng (1991) expressed the probability o f the constraints as decision variables and were therefore incorporated i n the objective function. 2.2.3. Heuristic Models Heuristics methods are criteria, methods, or principles for deciding that among several alternative courses o f action are the most effective in achieving certain goals (Pearl, 1984). Heuristic algorithms cannot guarantee global optimum solutions, however they are well-suited to problems that are difficult to formulate and solve by applying algorithmic methods (e.g. non-linear-nonconvex functions). Genetic algorithms ( G A ) and artificial neural networks are the most commonly used heuristic methods for the reservoir operations planning problem. Recently, Ant Colony Optimization ( A C O ) algorithms, which are evolutionary methods based on the foraging behavior o f ants, have been successfully applied to a number o f benchmark combinatorial optimization problems (Dorigo et al., 2000). A C O was inspired by the behavior o f ants in finding the shortest route between their nest and a 23 food source. Jalai et al. (2006) applied ant A C O to the Dez reservoir in Iran for a finite horizon and a single time series o f inflows. The authors concluded that the A C O algorithm provided improved release policies as compared with another G A model. The same conclusion o f A C O outperforming the G A algorithm was found by Kumar and Reddy (2006) who applied A C O to a multipurpose reservoir i n India. 2.2.3.1. Genetic Algorithm Genetic algorithm ( G A ) is a powerful population oriented search method based on the principle o f Darwinian natural selection and survival o f the fittest. G A performs optimization through a process analogous to "the mechanics o f natural selection and natural genetics" (Goldberg, 1989). Genetic algorithms (strings/chromosomes), deal with a population o f individual candidate solutions which undergo changes by means o f genetic operations o f reproduction through selection, crossover, and mutation operations. These solutions are ranked according to their fitness with respect to the objective function. Based on their fitness values, individuals (parents) are selected for reproduction o f the next generation by exchanging genetic information to form children (crossover). The parents are removed and replaced i n the new population by the children to keep a stable population size. The result is a new population (offspring) with normally better fitness. After a number o f generations, the population is expected to evolve artificially, and the (near) optimal solution w i l l be reached. The global optimum solution however cannot be guaranteed since the convexity o f the objective function can't be proven. The G A adjusts populations of release rule structures based on values o f the fitness (objective) function values according to the hydrologic simulation model results. Wardlaw and Sherif (1999) successfully applied G A to a four-reservoir system in which a global optimum was achieved. The authors concluded in their evaluative study that G A provides robust and acceptable solutions and could be satisfactorily used in realtime operations with stochastically generated inflows. Haung et al (2002) applied a genetic algorithm based-stochastic dynamic programming to cope with the 24 dimensionality problem in a parallel multireservoir system in northern Taiwan to derive a joint long-term operation policy. Haung et al (2002) concluded that although the use o f GA-based S D P may be time consuming as it proceeds from generation to generation, the model could overcome the "dimensionality curse" in searching solutions. Reis et al. (2005) proposed a hybrid genetic algorithm and linear programming approach for multireservoir operation planning. Their model handled the stochastic future inflows by a three stage tree o f synthetically generated inflows. They applied their approach to a hypothetical hydrothermal four reservoir system and compared the results with a S D D P model. The authors concluded that the hybrid scheme offers some computational advantages over the S D D P model. However, it is computationally more time consuming. 2.2.3.2. Artificial Neural Networks A n artificial neural network ( A N N ) is a model inspired by the structure o f the brain that is well suited to complicated tasks such as pattern recognition, data compression and optimization. In neural network terminology, a formal neuron simulates the behavior o f the biological neuron whose dendrites collect the energy from its input signals and whose axon transmits a signal to other neurons. In the formal neuron, the energy from the dendrites is presented by a weighted sum o f the input variables, and the axon transmission is represented by applying a transfer function to the weighted sum o f inputs. The training o f the A N N is usually performed using a gradient-type back propagation procedure, which determines the values o f the weights on all interconnections that best explain the input-output relationship. A N N has been used within S D P models to approximate the "cost-to-go" function with fewer sampling points. Saad et al. (1994) applied an A N N to the long-term stochastic operation o f the hydroelectric multireservoir system o f Quebec's L a Grande River. The neural network was trained to disaggregate the storage level o f each reservoir o f the system for an aggregated storage levels for the system. The inputs to the network are the aggregated storage levels determined by S D P for the aggregated reservoirs. The neural network is trained by applying a large number o f equally likely stream flow 25 sequences. Saad et al. (1994) concluded that in comparison with the principal components approach, A N N is more efficient. Raman and Chandramouli (1996) used A N N to obtain optimal release rules based on initial storage, inflows, and demands for A l i y a r reservoir in Tamil Nadu, India. The A N N was trained by applying the results o f a deterministic D P model. Raman and Chandramouli (1996) concluded that simulation o f operation with rules obtained by the trained A N N outperformed those produced by linear regression analysis, as well as optimal feedback rules obtained from explicit stochastic optimization using S D P . 2.2.3.3. Fuzzy Programming Several researchers have used fuzzy set theory and fuzzy logic to deal with uncertainties associated with the reservoir operation problem. Fuzzy set theory is generally used to describe imprecision and vagueness. In fuzzy logic, variables are partly represented by several categories and the degree o f belongingness to a set or category can be described numerically by a membership number between 0 and 1.0. Russell and Campbell (1996) used fuzzy logic to derive operating rules for a hydroelectric plant, where the inflow and price o f energy can vary. Tilmant et al. (2001) developed a fiizzy S D P approach with fiizzy objectives and fuzzy intersections between immediate and future release decisions consequences. Mousavi et al (2004) developed a technique called fuzzy-state stochastic dynamic programming (FSSDP) for reservoir operation that considers the uncertainties in the hydrologic variables and the imprecision due to variable discretization as fiizzy variables. The transition probabilities are considered by defining a fuzzy Markov chains. 2.3. Sampling Techniques In the applications o f stochastic programming models for the reservoir optimization problem we are usually faced with the problem o f how to represent the random variables (inflow, demand, prices). The problem becomes rather complex with multivariate random vectors, particularly i f these vectors are correlated. Generation o f data trajectories or 26 scenarios represents the most challenging and time consuming part o f the solution o f stochastic optimization problems. The objective is to generate trajectories or scenarios that best approximate the given distribution o f the random variables in a computationally manageable way in the optimization model. Different sampling-based approaches have been proposed to handle the problem o f generating scenarios. A number o f these methods have been presented by Kaut and Wallace (2003). The following is a brief overview o f the generation o f data trajectories and sampling methods. 2.3.1. Time Series Models Time series models are intended to replicate the spatial and temporal structure o f the random variables. Examples o f time series models include: Autoregressive models, M o v i n g Average Models, and Bayesian Vector Autoregression model ( V A R ) . M a n y o f the reported applications o f S D P in reservoir management models use lag-1 autoregressive or multivariate model. The use o f time series to generate data trajectories involves selecting a model and estimating its parameters. These two steps add to the uncertainty o f the analysis. V o g e l and Stedinger (1988) have documented that errors arising from parameter estimation often overwhelm issues o f model choice. 2.3.2. Conditional Sampling Because o f its simplicity, conditional sampling is the most popular method for generating scenario trees in stochastic programming. It is based on approximating probability measures by empirical ones generated by random samples. Because o f computational restrictions, these samples cannot be very large, so the empirical measures can be poor approximations o f the original ones. Pennanen and K o i v u (2002) show that modern integration quadratures provide a simple and attractive alternative to random sampling. These quadratures are designed to give good approximations o f given probability measures by a small number o f quadrature points. Loretan (1997) applied principal component analysis to reduce the dimensionality o f the scenario tree. Sampling from principal components, allows correlated random vectors to be obtained. 27 2.3.3. Sampling from Specified Marginals and Correlations Many techniques are available to generate random distributions (Devroye, 1986). These techniques samples from univariate are not applicable i n sampling multivariate vectors particularly i f they are correlated. In the case o f multivariate distributions, some algorithms were developed assuming the correlation matrix and marginal distributions (beta, lognormal, Pearson, e t c . . ) are fully specified (e.g. Cario and Nelson, 1997). Other algorithms sample correlated random variables applying partially specified multivariate distributions (e.g. L u r i and Goldberg 1998). However the user specifies the marginal moments. The various algorithms also differ i n the degree to which dependencies among variables are specified. Most algorithms require only the correlation matrix, but a few require higher order product moments. Parish (1990) presented a method for generating random variables from multivariate Pearson distribution, with the knowledge o f all product moments to the fourth order. 2.3.4. Moment Matching This method relies on describing the marginal distributions by their moments (mean, variance, skewness, kurtosis, etc.) as well as a correlation matrix, and possibly other statistical properties. Hoyland and Wallace (2001) developed a scenario generation algorithm, which constructs multi-dimensional scenario trees with specified moments and correlations, by solving a single, very large, least squares problem. T o improve the speed of the solution procedure, Hoyland et al. (2003) introduced a new algorithm that speeds up the procedure b y decomposing the least squares problem into n univariate random variables, each satisfying a specification for the first four moments. Then, the different marginal distributions were combined so that the joint distribution satisfies the specified correlations and moments by applying a Cholesky decomposition and a cubic transformation i n an iterative procedure. Lurie and Goldberg (1998) applied a similar multivariate decomposition approach but starting with parametric marginal distributions instead o f the marginal moments. 28 Although Hoyland et al. (2003) could not guarantee convergence to their proposed procedure, they concluded that their experience shows that it would converge i f the moment's specifications were possible and there were enough scenarios. They also stated that a potential divergence or convergence to the wrong solution is easy to detect. Accordingly, there is no risk o f ending up using the incorrect tree i n the optimization procedure. In terms o f computer time, they found trees with 1000 scenarios representing 20 random variables took less than one minute. 2.3.5. Path Based Methods These methods start by generating several data paths (or fans), which can be done through the use o f parametric or nonparametric methods as suggested by Dupacova et al. (2000). In many application areas there exist advanced continuous and discrete time stochastic models and historical time series that serve to calibrate these models. A global scenario generation can be achieved with the calibrated model, b y simulating many sample paths. These models employ a specified type o f probability distributions. Nonparametric methods can be applied to large families o f probability distributions, which cannot be indexed by a finite dimensional parameter (distribution free methods). The next step is to delineate the initial structure o f the scenario tree, i.e. the number o f stages and the branching scheme. The additional step to build the scenario tree includes applying ad hoc methods, by cutting and pasting the data paths i n an intuitive way. The other possibility, as proposed b y Birge and M u l v e y (1996), is to apply cluster analysis i n a multi-level clustering or bucketing scheme that exploits the whole sequences o f observed/simulated data. 2.3.6. Optimal Discretization Pflug (2001) developed a method for constructing a scenario tree with optimal discretization on the basis o f a simulation model o f the underlying stochastic process b y using a stochastic approximation technique. This method is different from other methods 29 described earlier i n that it constructs the whole scenario tree at one time. However, the method deals only with univariate processes. 2.3.7. Scenario Reduction This method involves developing a much smaller number o f scenarios, and it determines a scenario subset o f prescribed cardinality or accuracy and a probability measure based on this set that is the closest to the initial distribution in terms o f a natural probability metric. A l l deleted scenarios have probability zero. Romisch and Heitsch (2003) presented two new algorithms for computing optimally reduced probability measures approximately. One advantage o f the reduction concept is its generality. N o requirements are imposed on the stochastic data processes (e.g. the dependency or correlation structure o f the scenarios, the scenario probabilities or the dimension o f the process). 2.3.8. Interior Sampling Methods Interior sampling is an another class o f sampling methods i n which several samples are used at different steps o f a particular optimization procedure, for example to estimate function values, gradients, optimality cuts, or bounds, corresponding to the second-stage expected value function. Higle and Sen (1991) suggested stochastic decomposition methods. Infanger (1994) applied importance sampling that generates samples within the L-shaped algorithm for stochastic linear programming. Importance sampling is typically presented as a method for reducing the variance o f the expected estimate o f a stochastic variable by carefully choosing a sampling distribution. 2.4. R L Approach Conventional optimal control methods, dynamic programming for instance, suffer from the 'curse o f dimensionality', wherein the large dimensionality o f the system at hand and the exponential growth o f its possible states prohibit the attainment o f an 30 optimal solution even using the fastest computers available today, and most likely in the future. The literature survey conducted has revealed that this area o f research is still very active, as new solution techniques are being investigated and developed. One possible angle from which the problem can be tackled is through the use o f machine learning techniques from the field o f artificial intelligence ( A l ) , particularly Reinforcement Learning ( R L ) . R L has two key advantages over conventional control methods: the potential for learning how to control a larger system in a shorter time, and the ability to do so with or without a formal model o f the system. Reinforcement learning ( R L ) has adapted key ideas from various disciplines namely: machine learning, operations research, control theory, psychology, and neuroscience to produce some very successful engineering applications (Sutton and Barto 1998). R L overcomes the curse o f dimensionality through the use o f function approximation, which allows R L to use much larger state spaces than classical sequential optimization techniques such as dynamic programming. In addition, using sampling, R L can be applied to large-scale problems where it is too complicated to explicitly evaluate and enumerate all the state transition probabilities. Modern reinforcement learning could be applied to both trial and error learning without a formal model o f the environment, and to planning activities with a formal model o f the environment, where an estimate o f the state-transition probabilities and immediate expected rewards could easily be evaluated. Sutton and Barto (1998), Bertsekas and Tsitsiklis (1996) state that: " R L has become popular as an approach to artificial intelligence because o f its simple algorithms and mathematical foundations and also because o f a series o f successful applications". Sutton (1999) concluded that this approach has already proved to be very effective in many applications as it has produced the best o f all known methods for playing backgammon (Tesauro, 1995), dispatching elevators (Crites at al. 1996), job-shop scheduling (Zhang W . and Dietterich 1996), and assigning cellular-radio channels (Singh and Bertsekas 1996). 31 Ernst et al. (2003) applied R L for power systems stability control. Abdulhai et al. (2003) applied R L for true adaptive traffic signal control. In the water resources sector the application o f this approach has been very limited. W i l s o n (1996) applied the R L technique in the real-time optimal control o f hydraulic networks. Bhattacharya et al. (2003) successfully applied the R L technique in real time control ( R T C ) to Delfland water system in the Netherlands, which includes Delft, Hague, and part o f Rotterdam covering an area o f about 367 k m and consisting o f about 60 polders with 12 pumping 2 stations. Bhattacharya et al. (2003) concluded that in all applications involving some sort o f control functions (urban drainage systems, polder water level maintenance, and reservoir operation), R L has substantial potential. R L is a machine learning approach that can be used to derive an optimal control strategy. R L concerns the problem o f a learning agent interacting with its environment to achieve a goal (Sutton, 1999). The agent continuously maps situations to actions so as to maximize a reward signal. The learner is not told what to do, as i n most forms o f machine learning techniques, but instead must discover which actions yield the most rewards by trying them (Sutton, 1999). These two characteristics, trial and error search and delayed reward, are the two most important distinguishing features o f reinforcement learning. 2.5. Conclusions The literature" review carried out shows that this area o f research is still very active and that different optimization approaches and modeling techniques are being tried for dealing with the reservoir systems optimization problem. The review shows that employing an explicit stochastic optimization approach would be the most advantageous since it provides the best representation o f this complex problem. The main obstacle that needs to be addressed and resolved, however, is the high dimensionality o f the problem. From this literature review, it can be concluded that D P algorithms remains a very powerful technique for handling the nonlinear, stochastic large-scale reservoir optimization problem. A m o n g the numerous efforts attempted to alleviate the curse o f 32 dimensionality problem, which is aggravating the large-scale S D P method, approximation techniques and/or sampling techniques resulted i n some function successful applications in the multireservoir hydropower generation operations planning problem. One promising approach addressing the possibility o f combining these two techniques (function approximation and sampling techniques) within an S D P formulation is the Reinforcement Learning ( R L ) technique. The following chapter presents the main concepts and computational aspects o f R L techniques. 33 3. THE REINFORCEMENT LEARNING APPROACH 3.1. Introduction Reinforcement learning ( R L ) is a computational approach for learning from interactions with an environment and from the consequences o f actions to derive optimal control strategies. R L has adapted key ideas from various disciplines namely: machine learning, operations research, control theory, psychology, and neuroscience (Sutton and Barto, 1998). R L has become popular as an approach to artificial intelligence because o f its simple algorithms and mathematical foundations (Bertsekas and Tsitsiklis, 1996) and because o f a number o f successful applications in different domains, e.g. control problems, robot navigation, economics and management, networking, games, etc... (Sutton, 1999). The successful applications o f R L surveyed and the key advantages that R L offers i n handling large-scale problems provided the motivation to research the possibility o f applying this approach to solve the large-scale problem o f operation planning o f multireservoir systems. The following sections o f this chapter introduce the main concepts and computational aspects o f the R L methods and presents the distinguishing features and elements o f the R L approach including the trial and error learning o f policies, the concept o f delayed rewards, and the exploration and exploitation o f policies. The chapter also focuses on the three main classes o f methods to solve the R L problem, namely, (1) dynamic programming algorithms and its relation with Markovian decision process ( M D P ) and the Bellman principle o f optimality, (2) Monte Carlo methods, and (3) the TemporalDifference learning methods. More advanced R L methods that unify the basic ideas o f the above three methods are also described; these include the eligibility traces and function approximation, and generalization. More comprehensive reviews o f R L can be 34 found in Sutton and Barto (1998), Kaelbling et al. (1996), and Bertsekas and Tsitsiklis (1996). 3.2. Reinforcement Learning Approach versus Dynamic Programming Dynamic Programming (DP) is a very powerful technique for handling sequential, nonlinear, and stochastic optimization problems. D P guarantees the attainment o f optimal solutions to M D P s . However, D P requires that values o f the transition probabilities and the transition rewards o f the system be calculated. The transition probabilities and the expected immediate rewards are often known as the theoretical model o f the system. For large scale systems that involve several stochastic variables, constructing the model o f the environment is quit a difficult task. Gosavi (2003) states that: "Evaluating the transition probabilities often involves evaluating multiple integrals that contain the probability density functions (pdfs) o f random variables. It is for this reason that D P is said to be plagued by the curse o f modeling". Compared with D P methods, linear programming methods (LP) can also be used to solve M D P s . Sutton and Barto (1998) indicated that L P becomes impractical at a much smaller state space (by a factor o f about 100) and concluded that for the largest problems, D P methods are the only feasible and practical solution methods. For D P problems, assuming a system with m state discretization and n reservoirs, the computational time and storage requirement is proportional to m". Consider the case o f a system with hundred state discretization for each o f two reservoirs, the number o f possible state combinations i n one period is 100 =10 . This exponential increase in the 2 4 state space is often known as the curse o f dimensionality (Bellman, 1957). Assuming that there are five possible actions for each state, the transition probability matrix o f each action would consist o f 5 x 1 0 x 10 = 5 x 10 elements. A s this simple example shows, 4 4 8 it is obvious that for such problems with large state space, storage o f the transition matrices w i l l be a difficult task indeed. 35 Obviously, the two main problems limiting the capabilities o f D P are the excessive memory needed to store large tables and the very long computation time required to fill those tables. One possibility to tackle these problems is through the use o f machine learning techniques from the field of artificial intelligence ( A l ) , particularly Reinforcement Learning ( R L ) . R L offers two key advantages in handling problems that are too large to be solved by conventional control methods: 1. The ability to solve M D P s with or without the construction o f a formal model o f the system. B y using sampling (simulation), R L can be applied to large-scale problems that are too complicated to explicitly evaluate and enumerate all the transition probabilities and the expected immediate rewards o f the system. This way R L provides a way to avoid the curse o f modeling. 2. The potential for learning how to control a larger system. R L can overcome the curse o f dimensionality through the use o f function approximation methods. For small scale problems, R L stores the elements o f the value function i n lookup tables called QTables (tabular methods). However, as the state space increases, R L can use function approximation methods, which require the use o f a limited number o f parameters to approximate the value function o f a large number o f states. The following figure highlights the difference in the methodology between the D P and R L . 36 Reinforcement Learning Traditional Dynamic Programming Inputs (Distribution o f Random Variables) No Model (Simulator/ Real system) Model (Generate the transition probability and transition reward matrices) r R L Algorithm (Q-Learning/ S A R S A ) DP Algorithm (Policy Iteration/ Value Iteration or L P ) r Approximate - Near Optimal Solution Figure 3.1. Optimal Solution Reinforcement Learning and Dynamic Programming comparison (adopted from Gosavi, 2003) 3.3. Reinforcement Learning Problem Formulation R L concerns the problem o f a learning agent that relies on experience gained from interacting with its environment to improve performance o f a system over time. The learner (or the decision maker) is called the agent and the object it interacts with is called the environment. The environment could be a simulator o f the system or the real system. Both the agent and the environment constitute a dynamic system. Unlike supervised learning techniques, which require examples o f input-output pairs o f the desired response to be provided explicitly by the teacher, the learning agent is not 37 told what to do. Instead, the agent must continuously learn from the consequences o f its actions. The agent perceives how well its actions perform by receiving a reward signal. This reward signal indicates that the agent should do or how to modify its behavior without specifying how to do it. The agent uses this signal to determine a policy that leads to achieving a long term objective. This trial and error interaction with the environment and the delayed rewards are the two main distinguishing features o f the reinforcement learning method. 3.3.1. R L Basic Elements The main components o f a R L algorithm are: an agent, an environment and a reward function. The interaction between the agent and its environment can be modeled within the framework o f M a r k o v Decision processes. The agent and the environment interact i n a sequence o f discrete time steps, t - 0, 1, 2, 3, .... A t each step the agent receives some indication o f the current state o f the environment, s G S, where S is the set o f all states t and then it selects an action a e A, where A is a finite set containing a l l possible actions. t The agent interacts with the environment and receives feedback in the form o f a stochastic transition to a new state s +i and receives a numerical reward r(si,a ) as defined t t by the reward function. Through this delayed reward and guided search process, the agent learns to take appropriate actions that maximize the cumulative rewards over time (Sutton, 1999). A schematic representation o f the agent-environment interaction is presented in Figure 3.2. Environment (Simulator/Real System) State Action a, Reward s t Agent (Decision Maker") Figure 3.2. Agent-Environment interaction in R L 38 3.3.2. Exploration/ Exploitation One o f the key aspects o f reinforcement learning is that the learner needs to explicitly explore its environment in its search for good rewards. The feedback that the agent receives from its environment indicates how good the action was, but it does not indicate whether it was the best or the worst action possible (Sutton and Barto, 1998). Accordingly, two conflicting objectives arise during the action selection process. One objective is to achieve high-valued short-term rewards by selecting actions that are already known to be good (exploit). O n the other hand, the agent has to explore new actions for better action selections in the future. Two popular methods for balancing the exploration and exploitation in R L are the £greedy and softmax action selection rules. In the e-greedy method, the learner behaves i n a greedy way most o f the time (by selecting the highest estimated action value), while a non-greedy exploratory action w i l l be taken every now and then with a small probability £ (by selecting a random action uniformly regardless o f its value). The advantage o f the e-greedy method is that, in the limit and as the number o f trials increases, every action is sampled an infinite number o f times. The probability o f selecting the optimal action converges to greater than 1-e (Sutton and Barto, 1998). The disadvantage o f the e-greedy method, however, is that in the exploration process it chooses equally among all actions, regardless o f the estimated value o f the chosen action. The result is that the learner could choose equally between the worst action and the second best action in the exploration process. In the softmax action selection method, the action selection probabilities are ranked according to their estimated values. The greedy action is then selected as the one with the highest action selection probability, and all other actions are then weighted proportionally to their estimated action values. Frequently, the softmax method uses a Gibb or Boltzmann distribution to choose the probability o f an action a, P{a): 39 where Q(a) and Q(b) are the estimated action values and tis a temperature parameter that controls the probability distribution. Higher temperatures result i n actions with equal probability o f selection (exploration). O n the contrary, low temperatures cause a greater difference i n the probability o f selecting actions according to their estimated values (exploitation). Gradually, the temperature r decreases over time to limit the exploration process. 3.3.3. Return Functions One can distinguish two main types o f R L tasks: episodic and continuous tasks. In episodic tasks, the horizon represents a finite number o f steps i n the future. There is a terminal state where the episode ends. O n the other hand, in continuous tasks, the horizon represents an infinite sequence o f interactions between the environment and the agent. The goal o f the agent is to maximize the accumulated future rewards, and the return function R is a long term measure o f such rewards. In the case o f finite horizon tasks, the t return is the sum o f the rewards from the beginning to the end o f the episode: R =r t t + l +r t + 2 + r , +.... + r t + (3.2) T where T is the number o f stages i n the episode, R is the reward received after t time t steps. For continuous tasks, the infinite horizon discounted model takes the long-term rewards into account b y discounting the rewards received in the future by discount factor y, where 0< y<\. The return function then becomes: K,=r +yr t+l t+2 + fr t + 3 + (3.3) 40 where k is the number o f time steps in the future. The discount rate determines the present value o f future rewards. If y- 0, the agent is said to be myopic and only considers immediate rewards. A s y increases, the returns increase and the agent gives more consideration to future rewards. In the mathematical formulation o f the reservoir operation planning model presented in chapters 4 and 5, it is assumed that the current period rewards is realized at the end o f each time period, accordingly the discount factor ^is applied to both o f the present period and the future rewards. In the following sections the focus w i l l be on discounted rewards as this approach is more appropriate to reservoir operation problems. 3.3.4. M a r k o v i a n Decision Process ( M D P ) R L relies on the assumption that the system dynamics can be modeled as a Markovian decision process ( M D P ) . A n environment is said to satisfy the Markovian property i f the signal from its state completely captures and summarizes the past outcomes in such a way that all relevant information are retained. In general, the response o f the environment at time t+1 to an action taken at time t depends on past actions. The system dynamics can be defined by specifying the complete probability distribution: Ps/r = Pi t i = '> , i =r\s ,a ,r ,s _„a _ ,....,r ,s ,a } s s + (3.4) r + t t t t t l ] 0 0 for all s', r and all possible values o f past events: s , a , r , t t t , rj so, ao- The M D P framework has the following elements: state o f the system, actions, transition probabilities, transition rewards, a policy, and a performance metric (return function). M D P involves a sequences o f decisions i n which each decision affects what opportunities are available later. The Markov property means that the outcome o f taking an action to a state depends only on the current state. If the state and action space i n a M D P are finite then it is called a finite M a r k o v decision process. 41 If the state signal has the Markov property, then the environment's response at t+1 depends only on the state and the action representations at t. The environment's dynamics of the M D P can be defined by the transition probability: Pi' = Pi t \ = A ' S S + >' } = S a (-) = Q 3 5 where p" ., is the probability of moving to state s' for a given state s and action a. and the ss expected value of the immediate reward: K' = E{r, x \s, = s,a= a, s = s'} t+l + (3.6) These two quantities, p°> and R° . completely specify the most important aspects o f s the dynamics of an M D P process. 3.4. Reinforcement Learning Algorithms The goal o f R L algorithms is either to evaluate the performance o f a given policy (prediction problems) or to find an optimal policy {control problems). In prediction problems, the value function (state-value function) for a given policy n is estimated as follows: V*(s) = E {R \s x t t =s}=E„{Y^ r t+k x\st kr k=(i =*} + (3.7) where policy ;ris a mapping from states s & S to the probability o f selecting each possible action. This mapping is called the agent's policy 7t, where n(s,a) is the probability o f taking an action a when the agent is i n state s. In control problems, the value function (action-value function) for policy ;ris defined as follows: Q (s,a) n = E {R s, =s, =a} K t ai = E (^° /r K k=0 l+lc+] \s, =s,a = t a} (3.8) 42 where Q\s,a) is defined as the expected return starting from state s, taking action a, and thereafter following policy K. The recursive relationship property o f value functions between the value o f state s and the value o f its possible successor states is: V*(s) = E„{R s =s} = £ V ( s , « ) £ / & [ / f t + yV*(sj],Vse t t S (3.9) Equation (3.9) is the Bellman equation, which states that the value o f state s must be equal to the discounted value o f the expected next state plus the expected reward along the way. The value function V\s) is the unique solution to the Bellman equation. The policy 7t is better than the policy rt i f V\s) > V^is) for all s&S. The optimal policy which has a better value function than other policies is defined as: V\s) = maxV*(s), VseS (3.10) K The optimal action-value function Q*(s,a) in terms o f V*(s) is: Q * (s, a) = E {r n /+1 + yV * (s')\s =s,a =a} = Y, K' [K' t t +^ (*')] (3.11) s' Once we have the optimal value function V* for each state, then the actions that appear best after a one-step search w i l l be optimal actions. Hence, a one-step-ahead search yields the long-term optimal actions. W i t h Q*, the agent does not have to do a one-step-ahead search: for any state s, it can simply find any action that maximizes Q*(s,a). The action-value function effectively memorizes or stores the results o f all onestep-ahead searches. It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair (Sutton and Barto, 1998). The following sections describe the fundamental three classes o f methods for solving the R L problem. These methods are: dynamic programming (DP), Monte Carlo techniques ( M C ) and the temporal difference learning methods (TD). Dynamic 43 programming is a planning method, which requires a model o f the environment (Modelbased), whereas the M C and T D methods are learning methods, which can learn solely from basic experience without using a model o f the environment (Model-free). 3.4.1. Dynamic Programming Methods (DP) Dynamic programming provides the theoretical foundation o f R L algorithms. D P methods are used to solve M D P s , assuming a perfect knowledge o f the model o f the environment. The key idea o f D P methods is the use o f value functions and the Bellman equation o f optimality recursively to guide the search for optimal policies. D P methods are bootstrapping methods, as they update one estimate o f the value function based on the estimate o f a successor states. The two most widely used D P methods for calculating the optimal policies and the value functions are the policy iteration and the value iteration. The following is a brief overview o f these two methods. The policy evaluation method refers to the iterative computation o f the value functions V" for a given policy K. Initial values o f a l l states are assumed V£ and successive approximation o f the value function is obtained b y applying the Bellman equation as a recursive update rule: <- + }T *(s%VszS,a k a = K(s),k = 0,1,2,... (3.12) s' In practice, and for practical considerations, a stopping criteria for the iterative process is commonly used when the term m a x |F (5 )-K (5 )| is sufficiently small , , r e S t+1 i (Sutton and Barto 1998). The estimated action-values are used as a basis to find a better policy. I f the actionvalue Q (s,a) > V (s) K K for some a^7i(s), then action a is taken and the policy is changed to 7f where n'(s) = a. This process o f taking greedy actions with respect to the current policy is called policy improvement. The new greedy policy ;r'is given by: 44 it\s) <- a r g m a x £ J f t [ / f t + yV*(s')] where argmax (3.13) / a denotes the action a at which the value function is maximized. a Combining the policy evaluation step with the policy improvement step yields the policy iteration algorithm. Thus we can obtain a sequence o f improved policies and value E I E I functions: it -^V" ->it —tV** —> Q x E it —>F*. Where E denotes policy evaluation and I denotes policy improvement. Another way o f solving M D P s is the value iteration algorithm. Similar to the policy iteration method, the value iteration also combines the policy improvement step and the policy evaluation step. However, in the value iteration algorithm the policy evaluation step is truncated after one sweep over the state space and is followed by a policy improvement step. The value iteration estimates the optimal policy directly as the maximum to be taken over all actions: V^^m^PfAK' + rVd^)], Vse S (3.14) The policy iteration and value iteration methods converge in the limit to the optimal value function V*(s) due to the contraction property o f the operator (3.14) (Bertsekas and Tsitsiklis, 1996). 3.4.2. Monte Carlo Methods (MC) A s stated earlier, D P methods require that a model o f the environment be available, including transition probabilities and the expected immediate rewards. However, in many cases the exact model o f the system is not known and in other cases, such as large scale systems, constructing the model could be a difficult task indeed. In such cases, learning the value function and the optimal policies directly from experience could be more efficient. M C methods estimate the value function from the experience o f the agent. This 45 experience is gained b y sampling sequences o f states, actions, and rewards from on-line or simulated interaction with an environment. M C methods use the mean return o f many random samples to estimate the expected value function V and the action-value function Q . A s more samples are observed, their K K average return converges to the expected value o f the value function. A sample return is the sum o f the rewards received starting from state s or a state-action pair (s, a), and that follows policy K until the end o f a learning episode. A s complete returns can only be obtained at the end o f such episodes, M C methods can only be defined for finite horizon tasks. One can design M C control methods b y alternating between policy evaluation and policy improvement for the complete steps o f the episode. Observed returns at the end o f the episode are used for policy evaluation and then for improving the policy o f all visited states i n the episode. There is one complication that arises i n this method however; the experience gained b y interaction with the environment contains samples only for the actions that were only generated b y policy TC but the values o f all other possible actions are not included i n the estimate. Those values are needed for comparing alternatives i n the policy improvement step. Therefore, maintaining sufficient exploration is a key issue in M C control methods. There are two approaches to assure that the agent is selecting all actions often, namely on-policy and off-policy control methods. In on-policy control method, the agent uses a soft stochastic policy meaning that 7t(s,a) > 0 for all <= 5 and all ae A to evaluate and improve the performance o f the s same policy. The agent commits to continuous exploration and tries to find the best policy in the process. The other approach is the off-policy method: the agent uses one policy to interact with the environment and generates a behavior policy. Another policy which is unrelated to the behavior policy is evaluated and improved, and is called the estimation policy. A n advantage o f this approach is that the agent learns a deterministic (e.g., greedy) optimal 46 policy (estimation policy) while following an arbitrary stochastic policy (behavior policy) thereby, ensuring sufficient exploration in the process. M C methods differ from D P methods in two main ways (Sutton and Barto 1998): First, they operate on sample experience. Therefore, they can be used for direct learning from interaction with the environment without a model. Second, they do not bootstrap; i.e. they do not build their value estimates for one state on the basis o f the estimates o f successor states. 3.4.3. Temporal Difference (TD) Temporal difference (TD) learning methods represent a central and key idea to R L . T D methods are considered as a class o f incremental learning procedures specialized where a credit is assigned to the difference between temporally successive predictions (Sutton 1988). The T D methods combine the ideas o f D P and M C methods. Similar to M C methods, the T D approach can learn directly from real or simulated experience without a model o f the environment. T D methods share with D P the bootstrapping feature i n estimating the value function (Sutton and Barto 1998). However, T D methods have some advantages over the D P and M C methods. T D methods use sample updates instead o f full updates as i n D P methods. The agent observes only one successor state while interacting with the environment rather than using values o f all possible states and weighing them according to their probability distributions. Accordingly, learning the optimal policy from experience does not require constructing a model o f the environment's dynamics. In addition, unlike M C methods, T D methods do not need to wait until the end o f the episode to update their estimate o f the value function. The simplest versions o f such algorithms are usually referred to as one-step T D or TD(0). TD(0) wait only for the next time step to update their estimates o f the value function based on the value function o f the 47 observed state transition V(s ) and the immediate rewards received from the environment t r +j. The following update is performed on each time step: t V(s )^V(s ) t + a [r t t M + yV(s )-V(s )} t+x (3.15) t where Ot is a step-size parameter, 0 < 0Ct < 1 which represents the learning rate. The update rule presented above computes a stochastic approximation o f the V , which states K that: New Estimate <— Old estimate + Step size . [Target - Old estimate} The target for the T D update is r t+1 (3.16) + yV(s +i). The term [Target - Old estimate] t represents the error i n the estimate or the temporal difference between two successive evaluations o f the value function. This error is reduced by taking a step toward the target. The step-size can be defined as lln where n is the number o f samples generated. This stochastic approximation algorithm which produces a sequence o f estimates o f V(s ) such t that the error —> 0, is based on an old algorithm (Robbins and Monro, 1951). For any fixed policy 7t, the T D algorithm is proven to converge to V in the limit with K probability 1 i f the step-size decreases according to the stochastic approximation conditions: ^ " « v = ° ° 0 ond ^~_ ? a 0 <°°- The first condition guarantees that the steps are large enough to overcome any initial conditions or random fluctuations. The second condition guarantees that eventually the steps become small enough to assure convergence. (Sutton and Barto 1998). In the case o f the control problem, i.e. estimation o f the action values Q(s,a), T D methods are used for the evaluation or the prediction part o f the policy iteration algorithm. A s with the M C methods, sufficient exploration is required to assure convergence which again can be achieved applying either the on-policy or the off-policy approaches. 48 3.4.3.1. SARSA The name S A R S A is attributed to the update rule that is applied i n this algorithm which follows the sequence o f events: current state (s ), current action (a ), resulting t t rewards (r +i), next state (s +i), and next action (a i). S A R S A is an on-policy T D control t t l+ algorithm that learns and improves Q for policy n, which selects and follows its actions. K K is e-greedy regarding the estimated Q so far. The update rule that is performed at each time step is: Q(s ,a )<- Q(s ,a ) + a[r + yQ(s , a ) - Q(s ,a,)] t t t t l+l t+x t+x (3.17) t This update is done after every transition from a non-terminal state s . If s +i is t t terminal, then Q(s +i, a .i) is defined as zero. Similar to TD(0), S A R S A converges with t t probability 1 to an optimal policy and action-value function as long as all state-action pairs are visited for an infinite number o f times and the policy converges in the limit to the greedy policy. 3.4.3.2. Q-learning, Q-Learning first introduced by Watkins (1989), is regarded as one of the breakthroughs in R L . The simplest form o f the g-learning algorithm, which is the one step tabular ^-learning, is based on the temporal difference method TD(0). In this method, the elements o f the estimated action-value function are stored in a so called Qtable. The agent uses the experience from each state transition to update one element o f the table (Sutton 1999). The Q-table has an entry, Q(s,a) for each state s and action a. After taking action a h the system is allowed to transition from state s to the next state s j. The immediate t t+ reward received as a feedback from the environment is used to update the (9-table for the selected action. The next time step action value estimate Q(s +i,a +i) used in the update is t t selected according to the £-greedy policy. This is achieved by selecting the next state- 49 action as the one with the maximum estimated value most o f the time and, with a small probability ( l - £ ) , a random exploration action is selected. The following is a procedural list o f the Q-Learning algorithm: Initialize Q(s,a) arbitrarily (to any feasible values) Repeat for each episode: Initialize s Repeat for each step in the episode: Choose a from s using policy derived from Q (e.g. e-greedy) Take action a, observe r, s' Q(s, a) <- Q(s, a) + a[r+ ymzx Q(s\a) - Q{s, a)] a S <— S until s is terminal If every state-action pair is visited infinitely often and the learning rate decreased over time, the Q-values converges to Q* with probability 1 (Sutton and Barto, 1998). g-Learning is an off-policy algorithm in the sense that the agent tries to learn the value o f the optimal policy while following an arbitrary stochastic policy which is independent o f the policy followed by the agent. A n example including sample numerical calculations using the g-learning method is presented in Appendix A . 50 3.4.3.3. Eligibility Traces Monte Carlo methods perform updates based on the entire sequence o f observed rewards until the end o f the episode. O n the other hand, the one-step T D methods use the immediate reward and the sample next state estimate to perform the update. In between the one step and full episode backup, there are «-step possible backups, based on rc-steps o f discounted truncated returns R" and the discounted estimated value o f the rcth next state f V (s +^). The rc-step return is defined as: t K = r M + + t + rV, + f'\V ) +n t+n (3.18) The n-step backups are still T D methods as they still change an estimate based on an earlier estimate: K M <- V,(s ) + a[R"-V (s )] l l l (3.19) One step further is to compute the updates o f the estimated value function based on several n-step returns. This type o f learning is denoted by TD(X) algorithm, where X is an eligibility trace parameter (trace-decay parameter), where 0 < X < I. The TD(A) algorithm averages the «-step backups each weighted proportionally to X"' . 1 The backup o f this A-return is defined as: Rf={l-^T ^~^ n) n (3-20) It is obvious from the above equation that by setting the X = 1, we get the M C updates, whereas by setting X = 0 we get the one-step TD(0) updates. K M <- VXs^ + c^Rf-VXs,)] (3.21) 51 A simpler and more efficient way o f implementing the TD(/t) method is the backward view TD(X) learning algorithm. This algorithm introduces an additional memory variable associated with each state at each time step called the eligibility trace (e ). A n eligibility trace is a temporary record o f the occurrence o f an event, t such as visiting a state or taking an action. This variable specifies the eligibility o f a particular event in updating the value function. A t each time step, the eligibility trace for all states decay by yX except for the visited states in that time step which is incremented by 1 as follows: (3.22) In other words, eligibility traces can be thought o f as weighted adjustments to predictions occurring rc-steps in the past; more recent predictions make greater weight changes. The TD(0) error or temporal difference at time step t is denoted as St is calculated as follows: (3.23) O n every time step, all the visited states are updated according to their eligibility trace: V (s )<r-VXs,) l+x t + ad e,(s) t (3.24) Although eligibility traces require more computation than TD(0) methods, they offer significantly faster learning, particularly when rewards are delayed by many steps. TD(X) is proven to converge under stochastic approximation conditions with probability 1 (Tsitsiklis, 1994). Figure 3.3 presents a schematic comparison o f the D P , M C , and T D ebackup methods and the calculation o f the value function. 52 Dynamic Programming Figure 3.3. Simple Monte Carlo Comparison of Different R L Approaches (Adopted from Sutton and Barto 1998). 3.5. Function Approximation When the state space is finite, the most straightforward D P approach is to use a lookup table to store the value function for each state or state-action value combination. In reality, the state space could be quite large or even infinite and could include continuous variables. B y using model free R L methods one can avoid the need to construct the transition probability matrix. However, this does not solve the problem completely as the large memory and long time needed to fill in the elements o f the lookup-tables still represents a problem. In such cases using look-up tables does not yield practical results. R L overcomes this problem by applying generalization and function approximation techniques. Here, estimating the g-values for unvisited state-action pairs require generalization from those states that have already been visited. Function approximation can be done in a number o f ways, such as: (1) function fitting (neural networks and regression), (2) function interpolation (K-nearest-neighbors and Kernel methods), and (3) state aggregation (Gosavi 2003). Watkins (1989) used the Cerebeller M o d e l Articulation Controller ( C M A C ) and Tesauro (1995) used back propagation for learning the value function in backgammon. A s an example, consider applying function fitting techniques for a M D P with A actions i n each state, for state s e S: (3.25) Q(s,a) -f {s) a The idea is to store the g-factors for a given action as a function o f the state index. Assumming s is a scalar; the function f (s) can be approximated by: a f (s)=A+Bs a + C.s 2 (3.26) Thus instead o f storing each Q-value for action a, we only need to store the values for the set o f parameters: A, B, C. The Q-value for state s and action a is represented by the 54 value f (s). Obviously, less storage space is needed and a large state space can thus be a handled. 3.6. On-Line and Off-Line R L methods can be implemented in two modes: on-line and off-line. The on-line mode consists o f using an R L driven agent directly on the real system. This mode is particularly interesting when it is difficult to model the system or when some phenomena are difficult to reproduce in a simulation environment. Moreover, and as the agent is learning continuously, it can adapt quickly to changing operating conditions. The main drawback o f the on-line mode is that the agent may jeopardize system stability because at the beginning o f the interaction no experience is available to the R L agent to adequately control the system. One solution to this problem is to first let the agent interact with a simulation environment (off-line mode). The R L agent then can be implemented on the real system where it would benefit from the experience it has acquired in the simulation environment and w i l l be able to improve its behavior from interaction with the real system. Alternatively, one may extract off-line learned policies and implement them in the real system without any further learning. On-line and off-line implementation o f R L should be differentiated from the on-line and off-line R L algorithms. In on-Line R L algorithms, similar to S A R S A , the agent is learning and improving the same policy that it is following in selecting the actions. O n the other hand, with an off-policy algorithm such as Q-learning, the agent is gaining useful experience even while exploring actions that may later turn out to be non-optimal. 55 3.7. Summary This chapter provided an overview o f R L and its main algorithms. The classification of the methods presented is intended to give the reader an idea about the different dimensions o f R L . B y using sampling and function approximation techniques, R L has the potential to be applied to larger systems than any other classical optimization technique. Modern reinforcement learning methods could be applied to both trial and error learning without a formal model o f the environment, and to planning activities with a formal model o f the environment, where an estimate o f the state-transition probabilities and immediate expected rewards can be easily evaluated. The introduction o f R L into the water resources systems domain is relatively new. However, the advantages that R L offers in dealing with large-scale problems, makes it a promising area o f research i n that field. A R L based approach is adopted in this research work to develop a stochastic optimization model for the solution o f the multireservoir operation planning problem as described i n the following chapter. 56 4. R E I N F O R C E M E N T L E A R N I N G MULTIRESERVOIR OPTIMIZATION M O D E L (RLROM) D E V E L O P M E N T 4.1. Introduction In general, the reservoir operation planning problem for a hydro dominated power system, such as the B C Hydro system, is to find the optimal operation strategy for each time period during the planning horizon that maximizes the long term value o f resources while serving the domestic load in British Columbia and thereby to minimize the cost o f electricity to the ratepayers in the Province. This objective can be accomplished by coordinating and optimizing the use o f all available generation resources while taking advantage o f market opportunities. Releasing more water now could result in high immediate benefits, but with less water left in storage for the future there would be less benefit in the future. O n the contrary, releasing less water now could result in gaining more benefits in the future. Accordingly, the decisions taken in any given planning period w i l l affect both the present return and the opportunities that could be available in the future. The challenge then is to link the present decisions with their future consequences. The approach followed in this research work for solving this problem relies on the concept o f the marginal value o f water ( M V W ) . B y definition, the M V W represents the incremental value o f water in storage expressed as dollar per cubic meter second-day ($/cms-day). Optimal dispatch from hydro plants is established when the trade-offs between the present benefits, expressed as revenues from market transactions, and the potential expected long-term value o f resources, expressed as the marginal value o f water stored i n the reservoirs, are equal. Accordingly, as long as the value o f releasing water is higher than the value o f storing water in the reservoir then the operator's optimal planning decision is to continue generation. 57 Another implication o f applying the water value concept is that hydro plants are dispatched based on their expected water value in storage. W i t h i n the B C Hydro system, the multiyear storage capability o f the G M Shrum D a m on the Peace River and M i c a D a m on the Columbia River projects and their large production capabilities dictate the need for a much higher level o f coordination in planning the operation o f these two particular basins, than with smaller projects. Ignoring inter-basin and system dependencies on the Peace and Columbia basins could lead to unrealistic operations modeling for the entire system. For this reason, it is important to be able to model the Peace and the Columbia basins and their interaction with the electricity markets in detail to truly reflect an optimal integrated system operation. The two reservoirs operation cannot be optimized separately as the benefits obtained from the operation o f one reservoir cannot be directly specified as a function o f the storage level in that reservoir alone. Rather, it is a function o f both plants ( G M Shrum and Mica). The challenge then is to model this large-scale multireservoir system in an integrated manner while addressing the different sources o f uncertainty in a way that a large-scale stochastic program can handle. To solve this large-scale problem, a stochastic optimization model for the operations planning o f a multireservoir hydroelectric power generation system is proposed. In this research work, the methodology adopted follows a Reinforcement learning ( R L ) artificial intelligence approach. The following sections describe the methodology followed in the development o f the proposed Reinforcement Learning Reservoir Optimization M o d e l ( R L R O M ) . The details o f the mathematical formulation and the solution algorithm o f the reservoir optimization problem are presented in the following sections. 58 4.2. Mathematical Formulation The primary objective o f the proposed R L R O M model is to develop a set o f system operation planning strategies that maximize the overall value o f B C Hydro resources at every time period over a planning time horizon given the uncertainty i n the forecasted domestic load, the forecasted market prices, and the random natural inflows. The main operation planning decisions consist of: the timing and quantities o f import and export, in addition to the timing, location and quantity o f water to store or draft from the reservoirs. Another significant outcome o f the operation planning model is to establish the M V W in storage for the main multiyear storage reservoirs i n the B C Hydro system. The established M V W obtained from the R L R O M model can potentially be used for estimating target storage values in the medium term optimization models. Moreover, the calculated marginal values can be used i n making tradeoff decisions using the clearing prices for short term wholesale energy market transactions. Marginal values o f water are determined from the derivative o f the value function with respect to storage ($/cms-d). The following notation is used in the mathematical formulation o f the R L model: J Number o f reservoirs included in the R L M o d e l , where {j e l,...,J} , Q Number o f scenarios o f random variables, where {a)€ 1,...,Q}, T final time period in a cycle (iteration or episode), N Iteration number or the age o f the agent, N M a x number o f iterations, {Sj} A set o f discretized storage volume for reservoir j, where Sj = {s\,s j,...,s" } max 2 J in cms-d, rij Number o f state discretization for reservoir j , d Number o f actions, {A} Set o f actions, where A ={a],a?,...,af}, t 59 t Current time period (stage) in a cycle (iteration or episode), where t e {1,..., T}, Sj, Sj , Storage volume in reservoir j at beginning o f the period t in cms-d, +1 s' Storage volume in reservoir j at end o f the period t in cms-d, State vector representing increments of the a possible different combination reservoirs within o f different the storage state space where,s = (s[ ,s'^,..,s'J), with z\ e {\,...,rij}forj=l,..., J i n cms-d, ] a, Action (decision variable): forward sales (pre-sales) at time period t, where a e { ai,ci2,..., ad} i n M W h , Ij Local natural inflow to reservoir j in period t for scenario ax where t a) e {1,.., £2} i n cms, L® System load at time period t for scenario co in M W h , P® Market price at time period t for scenario coin $-Canadian/MWh, Q Turbine releases from reservoir j during period t in cms, Q Spill flow from reservoir j during period t in cms, T s r (s\ a) Rewards o f taking action a and transition to state t at end o f period t i n $- t Canadian, Q, (s\a) Q-value function when the system state is s' and action a is selected at the t t beginning o f time period t in cms, /, (s') Expected value function when the system state is s' in the beginning o f time t period (stage) t in $-Canadian, y Monthly discount factor o f future rewards, cXn Learning rate (step size) parameter i n iteration n, where n e {1,..., N}, e Vector o f random events that influence the operation o f the multireservoir system, where e = ( 7 , Z , P ) , ( U e n ( U ( y Exploitation rate in iteration n (probability o f random action i n E -greedy policy), 60 n Policy, decision making rule, nis ) action taken in state s' under deterministic policy 7t, and 1 71 (s\ a) probability o f taking action a in state s' under stochastic policy n. 4.2.1. S D P Formulation One possible way to solve this reservoir optimization Markovian Decision Problem ( M D P ) is to apply one o f the traditional S D P techniques, such as the value iteration technique. This formulation involves a sequence o f decisions in which each decision affects what opportunities are available in the future. For any given time period and at any starting state, the outcome from applying an action to a state depends only on the current state. A l s o , the effects o f the outcomes are not deterministic. The basic structure o f the S D P algorithm can be characterized by: t Discrete time, s State variable, t a Control; decision to be selected from a given set, r Reward signal, e Set o f random variables, and T Horizon. t t The mathematical formulation o f the reservoir optimization problem can be expressed as follows: Objective function: ft 0,') = Max E{y[r (s' ,a„e,) a. t t +f l+] (4.1) (s' )]} +1 Constraints: (4.2) - A s . , <*,,,<*;,, a\<a, <a" V/eJ, Vae A VseS (4.3) (4.4) 61 where C j - and Clj are the elements o f the matrices representing the hydraulic connection between the reservoirs in terms o f the turbine outflow and spill outflow respectively from reservoir j' to reservoir j. C ^ and C ^ 0 i f there is no physical 7 = hydraulic connection between the reservoirs, C j - and Cfj=\ \fj-f, and C j - and for V/' C =-\ s fj 7' and reservoir j is physically connected to reservoir j'. For any given system state, the objective is then to maximize the expected long term value o f resources resulting from the reservoir operating decisions taken in the current period and for decisions that could be made out in the future. Equation (4.2) represents the set o f constraints describing the continuity o f flow for each reservoir (/') considering the physical setting o f any reservoir (j ) within the river r systems. In case the storage exceeds the maximum limit the reservoir spills. The minimum storage is treated as a soft constraint and storage below the minimum storage constraint is penalized. Equation (4.3) represents the upper and lower limits on the storage constraint. Equation (4.4) represents the constraints on the upper and lower limits on the decision variable. 4.2.2. R X Formulation Similar to the conventional S D P methods, the mathematical formulation o f the R L model is cast as a Markovian Decision Problem. The theoretical basis o f the R L approach, as presented in the previous chapter, formed the foundation for the methodology adopted in the development o f the reinforcement learning multireservoir optimization model ( R L R O M ) . In the proposed algorithm, the agent (controller) first interacts with a simulation environment that models the real reservoir system, to learn the optimal control policies (Figure 4.1). Once the agent learns the optimal value function, it can use its knowledge at any time period and in any given system state to control the 62 system or to estimate the M V W . Alternatively, the agent can keep on learning continuously as it controls the system, and it can adapt quickly to changing operating conditions and keep on updating and enhancing its knowledge about different operating conditions o f the system. Scenarios o f random variables Environment ((iOM) Action Observe Feedback a, r,(s',a), s't+i ' Agent ( R L Algorithm) Figure 4.1. ^ R L agent-environment interface in learning mode In this research, a reinforcement learning (Q-Learning) algorithm which relies on the stochastic approximation o f the value iteration is adopted. The main idea o f the proposed R L formulation is to apply a sampling technique while performing the iterations rather than computing the expected rewards and the transition probabilities. The transition state s' and rewards r,(s,a) are generated from the state-action pair (s ,a,) by simulation. This l+i t R L formulation can be regarded as a combination o f value iteration and simulation. Accordingly, rather than estimating the value function for each state (Equation 4.1), one can compute a value for each state-action pair which is known as the Q-value or Q-factor applying the following Q-Learning formula: 63 Q? ( / , a) = Q?- (s\a) 1 + a, j t f r , ( / , a) + max g V ( a)] - Q?~ ( / , a ) j (4.5) l where N is the iteration number and t is the time period (stage). It should be noted that the discounting is applied to both o f the current and future rewards as it is assumed in the model formulation that the rewards o f each time period are realized at the end o f each period. The details o f the R L R O M model components and the proposed solution algorithm are described i n detail in the following sections. The proposed R L R O M model is linked to a Generalized Optimization M o d e l ( G O M ) for the hydroelectric reservoir optimization model as described in section (4.2.10). A t each time period, the agent sends a signal to G O M describing the state o f the system s' t and the action it has taken a . G O M returns an estimate o f the rewards r (s',a) that t t informs the agent how well it did in selecting the action, and it generates a set o f transition states . The R L agent controls the forward sales (presales) decisions, and based on these decisions, G O M determines the optimal system operation in terms o f the release and generation strategies. This agent-environment interaction process is performed at each time period over the planning horizon and for a set o f generated scenarios that models the uncertainty in electricity load, market prices, and natural inflows. The stochastic modeling o f these random variables is described in detail in section (4.2.8). While the main concern is to establish the optimal control strategies and the M V W o f water in the two main multiyear storage reservoirs, linking the R L model with the G O M model allowed the inclusion o f more reservoirs i n the optimization process. The model optimizes the operation o f the five main plants within B C Hydro system, namely: G M Shrum, Peace Canyon, M i c a , Revelstoke, and Keenleyside. Therefore, the operation planning decisions developed by the R L algorithm encompasses the major plants within the B C Hydro system. 64 4.2.3. State Space A t any time period, the state o f the system is typically expressed as the amount o f available storage in each reservoir at the beginning o f that time period. The reservoir storage variable used to represent the system state is defined by a J dimensional space. The continuous storage o f reservoir j is discretized to nj discrete storage values. The state space can be represented as a cartesian product o f all state combinations as follows: {s\,...,s:}®...®^\,...y/} It is assumed that the number o f discretized states for reservoir j is constant during the T periods o f the planning study. 4.2.4. Action (Decision Variable) The decision variable considered in developing the R L model is the forward sales (pre-sales) to/ from the U S and Alberta markets. The decision variable (a,) is discretized at each time period to d decisions. For each time period, the forward sales are subdivided t into three categories namely: peak, high, and low to capture the effects o f price variations during the heavy load hours ( H L H ) and light load hours ( L L H ) . Traditionally, the turbine release from each reservoir is the common choice o f the decision variable in stochastic reservoir optimization models. However, in this form o f the simulation based optimization o f the reinforcement learning model, the turbine releases are calculated during the interaction process between the agent and the environment ( G O M ) . In the proposed R L model formulation, the turbine releases from the different reservoirs are based on the current state o f the agent, the presale decision and the adopted scenario o f the random variables. One o f the main advantages to the choice o f presales as the decision variable is that it reduces the dimensionality o f the problem. This is mainly due to the fact that presale is a 65 system variable which is independent o f the number o f the reservoirs involved in the R L model. In comparison, i f the turbine releases were used as decision variables in the R L framework that increase the dimensionality o f the decision space by many orders o f magnitude. 4.2.5. Exploration/Exploitation Rate One possibility for the agent is to pick the action with the highest Q-value for the current state- this can be defined as exploitation. A s the agent can learn only from the actions it tries, it should try different actions ignoring what it thinks is best, some o f the time- and this can be defined as exploration. A t the initial stages o f the learning process, exploration makes more sense, as the agent does not know much. The adopted equation for estimating the probability that the learning agent should select the best action in the algorithm developed herein is proposed by Michael Gasser, (2004). The equation states that: e = \-e^ (4.6) where: £ = exploitation rate, N = the number o f iterations or the age o f the agent, and ^ - exploitation parameter. It is clear that as the number o f iterations increases, the exploitation rate increases. Eventually, when the number o f iterations gets larger, the agent w i l l tend to select the greedy actions, and these actions represent the policy learned by the agent. Figure (4.2) presents the exploitation rate as a function o f the agent's age. 4.2.6. Learning Rate The learning rate a controls the step size o f the learning process. In fact, it controls how fast we modify our estimates o f the action-value function. Usually, the iterations 66 start with a high learning rate that allows fast changes in the Q-values and then the rate gradually decreases as time progresses - as shown in Figure (4.2). This is a basic condition to assure the convergence o f the Q-Learning approach as identified by Sutton and Barto (1998). The adopted formulation in the developed model is a polynomial learning rate as given by Even-Dar and Mansour, (2003) as follows: a=VN (4.7) ¥ where: 7V = Number o f iterations (the age o f the agent), \j/= Parameter, where (0.50,1.0). 0.5 0.4 0.3 0.2 0.1 0.0 0 20 i 1 1 40 60 80 1 1 1 r 100 120 140 160 180 200 Iteration Figure 4.2. 4.2.7. Learning rate and exploitation rate Rewards The reward function r (s',a) is very important in the communication process t between the environment and the agent. It helps the agent to learn how to perform actions that w i l l achieve its goals. Depending on the reward signal received from the environment, the agent perceives how well it did in selecting the action. Generally, the 67 objective is to maximize the sum o f the reinforcements received from the environment over time. In the multireservoir problem we are dealing with is an infinite time horizon problem, therefore a discount factor, j, is introduced. In the multireservoir model, the agent should learn the operation policy ^(s ) 1 that maximizes the benefits from taking action a when the agent is in state s'. t ^»=Z; A * =0 + +1 (4.8) A t each time period, and based on the current state o f the system and the agent's decision regarding forward sales, the optimization model ( G O M ) determines the optimal control strategy that maximizes the benefits. The reward function, which is selected to maximize the value o f resources, is calculated at the end o f each time period. The details o f the G O M optimization model are presented i n section 4.2.10. 4.2.8. Stochastic Modeling of Random Variables In modeling multireservoir hydro-electric power generation system, we are interested in establishing the optimal operation planning strategies and the expected value/marginal value o f the resources given the uncertainty in natural inflows, electricity demand, and market prices. This problem is complicated because o f the multidimensional probability distribution o f the random variables and their inter dependencies. One approach to fit the marginal distribution o f the random variables is to use time series autoregressive models. In these models, the autocorrelation o f the random variables are modeled by a continuous Markov decision process. The methodology adopted in this research relies on approximating the continuous distributions and stochastic processes by discretization to a number o f scenarios rather than the Markov description o f the stochastic variables. To represent the inflow, electricity load, and market price random variables, a moment matching ( M M ) scenario generation approach developed by Hoyland et al. 68 (2003) has been adopted. In this approach, the marginal distributions are described by their first four moments (mean, variance, skewness, and kurtosis). In addition, the correlation matrix is also specified. The method generates a discrete joint distribution that is consistent with the specified values o f the marginal distributions and the correlation matrix. The moment matching algorithm first decomposes the multivariate problem to univariate random variables that satisfy the first four moments for each random variable. Then, it applies an iterative procedure that involves simulation, decomposition and various transformations to preserve the original marginal moments and the correlation matrix. The details o f the algorithm are given by Hoyland et al. (2003). Historical stream flow records for the Peace and the Columbia rivers are used to estimate the distribution properties o f the inflow random variable. Considering the case o f the Peace and the Columbia River systems and assuming a monthly time step in the R L R O M model, the number o f random inflow variables for one year w i l l be twenty four, as inflow in each month is considered as a random variable. First, the first four moments and the correlation matrix for the twenty four variables are calculated. Then, this information is fed as an input to the M M model. The generated inflow scenarios, If, represents the cross and serial correlation between the inflows at the different time periods for the two river systems for one year. The outcome o f this process is a reduced (manageable) number o f inflow scenarios that preserve the properties o f the original distributions. In the Pacific Northwest, electricity price variations are correlated to runoff volumes in the northwest. In dry years, power production falls and prices increase accordingly, while in wet years, there is more power available and prices decrease proportionally. A l s o , prices vary to a large degree across the day. Therefore, the average forecasted market prices are adjusted to reflect the relationship between runoff volumes and market prices. This relationship is represented by a regression relationship between the Dalles monthly runoff volume in the U S and the M i d - C o l u m b i a market price. The following 69 approach was adopted to represent the price variability in the R L R O M model. First, a stochastic variable for the annual Dalles runoff volume was introduced i n the moment matching scenario generation model in addition to the twenty four inflow variables o f the Peace and Columbia inflows. Then, the scenarios generated for the Dalles runoff were correlated to the heavy load hour ( H L H ) and light load hour ( L L H ) price factors (multipliers) using polynomial regression relationships. Finally, the average price forecast for the M i d - C o l u m b i a was multiplied by the H L H and L L H price multipliers to generate the H L H and L L H prices for the scenarios. B C Hydro's system load forecast was estimated using a Monte Carlo energy model. The load forecast is mainly impacted by: (1) the economic growth measured by the gross domestic product ( G D P ) , (2) electricity prices billed to B C Hydro's customers, (3) elasticity o f the load with respect to electricity prices and economic growth, and (4) energy reduction due to demand side management ( D S M ) . The continuous distribution o f the B C Hydro system load forecast is represented by the values o f the median (P50) and the two quantiles (P5 and P95). Starting from these three points, the expected value and the variance o f the forecasted load are then estimated using the method o f Pearson and Tukey (1965). Then, a Monte Carlo simulation was carried out, using a Log-Normal probability distribution, to generate several thousand scenarios. The generated scenarios were then aggregated in a histogram with a specified number o f intervals that represent the required probably distribution. Based on the generated data, the first four moments and the correlation o f the electricity load with the inflow and price variables were estimated and used in the moment matching algorithm. The details o f the data used, the results, and an analysis o f the scenario generation process is presented i n chapter 5. 4.2.9. Lookup Tables and Function Approximation During the experimental and testing phase o f the R L R O M model development, the state space was discretized into a small number o f points. The state space in this case is the reservoir storage levels for the different reservoirs. For this limited number o f state space variables, the learning process for the Q-values was implemented using lookup 70 tables. Lookup tables are used to store and update the function representing the value o f water in storage at each time period. A t each time period, and for each discrete point on the state space, there exist a row o f entries in the lookup table. The elements o f each row are: reservoirs storage s', action a , reward value r (s \ a) and Q-value Q (s', a). t t t In real applications o f multireservoir optimization, the number o f grid points grows exponentially as the number o f reservoirs increases. This results in a much larger state space and the use o f lookup tables becomes impractical. Accordingly, some sort o f function approximation is required, which w i l l enable the calculation o f the future value function ft+i{s ) at any point within the state space without the need to store every value. r T w o properties o f this function need to be considered in the development process. First, the storage value function is typically a concave nonlinear function. Second, the target storage (end o f period storage) o f each reservoir is a function o f the storage levels in other reservoirs. In this research work, two alternative techniques for function approximation were investigated. The first approach is a function fitting approach using polynomial regression. A n alternative function fitting approach was tested using a linear interpolation technique. However, instead o f using one function to approximate the entire state space, it was divided into a finite number o f segments or pieces using a piecewise linear ( P W L ) interpolation technique. The following figure illustrates an example o f approximating the concave nonlinear value o f water i n storage function with four linear pieces. 71 Figure 4.3. P W L approximation of storage value function A s an example, sij it represents the third break point o f reservoir 1 storage at time t for a given storage level o f the other reservoirs. Similarly, ntjjj is the slope o f the third segment o f the P W L function o f reservoir 1 at time t, and so,3,t represents the intercept o f the curve with the storage axis. During the testing and development phase, the advantages of using P W L formulation over nonlinear formulation were noticeable in terms o f more stability in the results and also faster implementation in the R L R O M model. The faster implementation is mainly attributed to the way that A M P L mathematical programming language handles P W L functions. In A M P L , P W L functions are defined by: a set o f breakpoints (grid points o f the state space), the slope o f the different segments, and the intercept. In the course o f the iterations, for any target storage point on the state space, the P W L function approximation is implemented to calculate the storage value function for the multidimensional state space as presented in Figure 4.4 in a 3D view. The P W L function deals with one state variable at a time. A s an example, consider the case o f the value function being a function o f the storage value in two reservoirs (for example: G M S and Mica). First, P W L functions are constructed for the different storage grid points o f M i c a as a function o f different storage levels in G M S as shown in Figure 4.5. A t the target storage o f M i c a , a P W L curve is constructed as a function o f the different G M S 72 storage grid points. This P W L function was then used to evaluate the value o f water i n storage as shown i n Figure 4.6. Figure 4.4. 3-D view of the P W L value of storage function as a function of G M S and Mica storage —• m - - ~ ~ r. — j ! — » 1- " • ---*•""."--•— „ , > • " * - - - * - ...---A" •' _ X -x- — ..---A ..... . - —• - - " " " "..-•A A"""" .. A " ' ' ' -»-- - " »-' x . -" ... — Mica Target Storage a"' <<-— Mica Storage (cms-d) Figure 4.5. 2-D view of the P W L value of storage function as a function of G M S and Mica storage 73 GMS Storage (cms-d) Figure 4.6. P W L value of storage function as a function of G M S storage at Mica target storage 4.2.10. Model of the Environment The R L agent needs to interact with an environment while learning to control the system. This environment could either be the real system or a model o f the system. In this research work, the B C Hydro generalized optimization model, G O M , was used as the simulation environment o f the real system. The G O M model was adapted from the short term optimization model ( S T O M ) developed by Shawwash et al. (2000). G O M , which incorporates the basic optimization formulation o f S T O M , was developed to give its user the flexibility for a more generalized form so that it can be used over longer time horizons and at various time resolutions. The model is currently used at B C Hydro as an analytical tool that assists the operations planning engineers to simulate and optimize the operation planning o f the integrated B C Hydro system. The primary objective o f the model was to develop optimal system operation schedule given a deterministic forecast o f system inflows, domestic load and market prices while maximizing the value o f B C Hydro resources. 74 G O M is a variable time step model with the capability to model sub-time steps in detail. The time steps may be hourly, daily, weekly, sub-monthly and/or monthly. The G O M system includes a detailed hydraulic simulation that calculates the hydraulic balance and calculates generation and turbine limits for each time step. Sub-time steps may be used to further divide the time step into shorter time periods to reflect different load conditions within a time step, as derived from the load-duration curves. For example, for a time step that is greater than or equal to a week, the load-duration curves are used to represent both weekday and weekend load shapes. The sub-time step thus provides a more detailed view o f the load and resource balance, and the market trade-offs under different load conditions (i.e. super peak load, peak load, heavy load, shoulder load and light load hours). The load-resource balance and trade-off optimization is performed for each sub-time step. The non-linear power generation function is represented in G O M as a piecewise linear surface function where the generation is calculated as a function o f the forebay level FBj , t turbine discharge Q p, T and unit availability Up Gp = /(FBp, Qrp, Up). This function was developed with an optimal unit commitment and loading assumption. Accordingly, each point on the piecewise linear surface function represents the maximum generation attainable given the set o f turbine discharge, forebay, and the number o f units available for commitment. The procedure, which was followed to prepare these plant production functions for the B C Hydro plants is described in detail in Shawwash, (2000). The following figure illustrates an example o f the P W L function for a typical hydroelectric power generating plant. 75 Discharge F o r e b a y L e v e l (m) Figure 4.7. (m /s) 3 Piecewise linear hydropower production function The formulation o f the G O M model was modified to run interactively with the R L model as detailed below. A t each time step G O M receives the information from the R L model on the initial state o f the system, it then updates the system constraints and solves the one stage reservoir optimization problem and then it passes the optimized results back to the R L model. During the learning phase o f the R L agent, the information passed to the R L R O M model constitutes the transition state and the reward signal. In the final iteration, the optimal operation polices are derived, including the plant generation, market transactions (import/export), and turbine releases. Hence, linking the R L R O M with the G O M has the advantage o f capturing the diurnal variation in load, market prices, and generation schedules for the shorter periods within the time step, either during weekdays/ weekends or Heavy load hours ( H L H ) / Light load hours ( L L H ) . The following is a description o f the G O M model formulation including the decision variables, constraints, and the objective function: 76 Decision Variables Q Turbine release from reservoir k at time period t and sub-time step h, in T cms, Q S p i l l (non-power) release from reservoir k at time period t and sub-time s step h, in cms, Gk,t,h Generation from plant k at time period t and sub-time step h, in M W h , Sp°tus Spot transaction (import/export) to U S market at time period t and subtime step h, in M W h , and Sp AB , Spot transaction (import/export) to Alberta market at time period t and ot sub-time step h, in M W h . Constraints Hydraulic continuity equation ^=Sj,,-\zlX;M,,,*ci+a 'cjH,*H,\2A V*,* (4.9) M and at t=T: S jl+i =5 , (4.10) y> where: C T kj and C^. are the elements o f the matrices representing the hydraulic connection between the reservoirs in terms o f the turbine outflow and spill outflow respectively from reservoir k to reservoir j. C kJ between the reservoirs, C and C ^ = l i f j=k, and C T T and C|.=0 i f there is no physical hydraulic connection T kj and C --\ for Vj*k s kj kj and reservoir j is physically connected to reservoir k, H ,h~ number o f hours in sub-time step h at time step t and he(l,2, ...,h„), t 77 H = number o f hours in time step t. t Storage bounds constraint S^-A5V,<oV,<6tr V*,/ (4.11) where the storage is expressed as a P W L function o f the reservoir forebay Fbk, Sk,t - f {Fbk,t). This function, which is not part o f the optimization model, is used to relate the storage volume to the reservoir elevation (Forebay) within the G O M model. AS*,, is a variable representing the deviation from the minimum storage limit which is penalized in the objective function. In case the storage exceeds the maximum limit the reservoir spills. The minimum storage is treated as a soft constraint and storage below the minimum storage constraint AS , is penalized in the objective function. k Power generation constraint Gk, ,h t = f( k,t,h > Qr ,' k,,,h) FB V*,/,A U kJJ (4.12) Total plant generation constraint G^^G^+G^OR, Vk,t,h (4.13) where: ORk = the percentage o f operating reserve from plant k. The operating reserve is a specific level o f reserve power should be available at all times to insure reliable electricity grid operation. G T - the total o f plant generation and operating reserve from plant k at time period t, in M W h , and h sub-time step. 78 Load resources balance (LRB) constraint IL „. U-Gf, ,-Spot G -Spot +G 4 J USiU ABut =L tM Vt,h (4.14) where: G' th = the fixed and shaped generation from other small hydro and thermal plants, not included as a decision variables i n the optimization problem. G f h = the forward sales; this information is passed at each time period from the R L model. L lh = the load at time period t, and at subtime step h in M W h . Spot US Transactions constraint T ^ S p o t ^ T ^ Vt,h (4.15) where Ty = the inter-tie transmission limit from B C to the U S and T^ is the inter-tie m m S transmission limit from the U S to B C . Spot Alberta Transactions constraint T?->S ot >T l M P where T ABih A Vt,h = the inter-tie transmission limit from B C to Alberta and T (4.16) ax AB m AB is the inter-tie transmission limit from Alberta to B C . 79 Generation limits constraint G <G Max T *./ Vk,t,h <G Min 'k.i.1, (4.17) kj ' v Objective Function (MaxRev): Maximize: {(Spot + Spot USj ABj )*H lh *P + G^ *H ttl +z^fc-«.)*^rKst K<**i c where lh v *P } fih ' ' (4 18) ^ is the forward market price. The MaxRev objective function maximizes the value o f power generation at each time period t given a target reservoir level s^' and the 1 marginal value o f water MVW ~ N k l estimated b y the R L R O M model. The objective function consists o f four terms: The first and second terms represent the sum o f the revenues from the spot transactions to both the U S and the Alberta markets and the forward sales, where P ,h and P t f are the spot and forward market prices at time step t and sub-time step h. The third term accounts for the trade-off between the short and long term values o f water i n storage as the difference between the target (end o f period) storage calculated i n iteration A M and the target storage calculated in the current iteration multiplied by the marginal value o f water calculated i n iteration N-l. The fourth term penalizes the violation o f the minimum and maximum storage limits (ASkj), where c is k the penalty for violating the storage limits specified in the R L model for reservoir k. 4.3. Solution Algorithm This section presents a detailed description o f the solution algorithm o f the R L R O M model outlined i n the previous sections. First, a description o f the R L R O M solution algorithm using the tabular form for storing the Q-values is presented. This is followed b y a description o f the function approximation algorithm that can be used for larger state space problems. The R L R O M model is implemented i n the A M P L mathematical 80 programming language (Fourer et al., 2003). A C P L E X solver ( I L O G Inc.) implementing the simplex algorithm is used to solve the linear optimization problem o f the G O M model. 4.3.1. RLROM Model Algorithm Using Lookup Tables A flow chart illustrating the different steps o f the model algorithm is presented in Figure 4.8. A detailed description o f the implementation o f the R L algorithm in a procedural form is described hereafter: - Divide the state space {S} to a finite number o f discretized states that covers the range between the minimum and maximum reservoir storage. - Define the number o f stages T and the set o f actions {A,}. - Use a graphical user interface (GUI) to process the data sets required in the model runs in the specified time steps and sub-time steps. These data include the load, price, and the transmission limits. - R u n a batch o f G O M model jobs for each point on the state space grid for each stage in the planning period. Store the results o f the model in a tabular form. These lookup tables inform the agent at each starting state and at each action what would be the transition state and the corresponding rewards. - Run the Moment Matching technique and the Monte Carlo simulation described earlier to generate a specified number o f scenarios i 2 o f the random variables: the natural inflow 7^, forecasted l o a d / / " , and forecasted market priceP . w t - A t the start o f the process, initialize the value function and the Q-values arbitrarily, or alternatively, set the values to zero for all states and state-action pairs. The R L agent moves forward in each iteration from stage 1 to the last stage T. To estimate the 81 Q-values, Q,(s',a), applying the Q-Learning update rule (Equation 4.5); the value function f (s') = max Q"~ (s', a) at the end o f period state s' , the rewards r , and the ] l+] t+] t estimate o f the Q-values Q"~ (s, a) are known from previous iteration. x - Set the model parameters including: the discount factor % the initial values o f the learning rate a and the exploration rate £ according to the adopted formulas described earlier in this chapter. - Set the number o f cycles to N . Where N max max is chosen to be a large number that satisfies the convergence criteria, as described in detail in the following section. - Starting at the first stage, initialize the algorithm by randomly sampling the state space (i.e. randomly choose a point in the state space grid s'). - Randomly sample the stochastic variables from the scenarios generated from their probability distribution. - In the first iteration, the agent chooses action a randomly, as it has not learned yet t any information about the Q-values. In subsequent iterations, the agent chooses the action a using the e-greedy policy ft {s) derived from the learned Q-values Q^~ (s',a) where: x t t ft is)G t argmaxQ ~ (s',a) N a€A i ' - The agent interacts with the environment ( G O M model results stored i n lookup tables) and it receives a signal depending on the chosen action and on the sampled scenarios i n the form o f the next stage state transition s' jl+l and a numerical reward r., ( / , a). - A p p l y the Q-Learning update rule (Equation 4.5) to estimate a new value for the Q values, which can be presented in a general form as: 82 New Estimate <— Old Estimate + Step Size [Target - Old Estimate] (4.19) Store the new estimate o f these Q-values as Q (s',a). t - The agent moves to the selected state in the next stage, sets the target state s' +1 the first time period to be the initial state s' t in in the second time period. Repeat the procedure o f sampling the action a using the e-greedy policy and determine the reward t signal and the transition state until the agent reaches the final stage. - A t the final stage T, the agent is at state s' . The agent repeats the same procedure as T in the previous stages and receives the reward signal r (s, a) and moves to the transition T state s[. The Q-learning update equation for the estimate o f the action-value function applies the following equation for estimating the future value function o f the next stage: 7™ = (4-20) - The agent starts a new iteration. In this new iteration and in subsequent iterations, until the termination o f the algorithm, the agent is always using the information it learned so far to update the future value function estimates (i.e. reinforce its learning o f the re- values). In the beginning, the agent tends to explore more frequently, with a probability of (1-f), to gain more knowledge about how good or bad it is at taking the actions. Later on, and as the age o f the agent increases, the agent becomes more greedy and it chooses the best action, max Q (s', a), with a probability o f e. However, the agent also explores t+l a actions other than the greedy ones with a probability o f (l-£). A s part o f the convergence requirements (Sutton and Barto, 1998), the exploitation rate increases with the age o f the agent and as the step size is decreasing (Figure 4.2). 83 - The above steps are repeated until convergence is achieved. The optimal value function J{s') and the optimal generated policies ?r*(s')e arg max Q (s', a) are stored for t ' a&A all elements o f the state space. 84 (start) Obtain Environment Data Generate Scenarios for Random Variables Initialize Agent I Initialize Q-Table Values Arbitrarily I Set N=\ Randomly Sample State Space T Sample Random Variables According to their Prob. Distribution Choose -J-Greedy Policy I Move to Next Stage State Lookup Environment Model Results T Apply Q-Learning Algorithm to Update Q-Estimates t=t + \ No N=N + \ No Generate Optimal Solution T ( Figure 4.8. R L R O M Stop ) model algorithm using lookup tables 85 4.3.2. The RL Model Algorithm Using Function Approximation The algorithm presented above has the advantage o f avoiding the calculation and storage o f the transition probability matrix. However, as the state space and the decision variables increase, the use o f the lookup tables to store the action-value function becomes impractical. This is mainly due to the fact that the memory required to store the Q-values becomes very large. Accordingly, the algorithm is modified to allow for the use o f a larger number o f state and decision variables. Function approximation using a piecewise linear function ( P W L ) approximation technique is used to overcome problems with the storage o f the value functions in the R L R O M model. In this case, the state space is a continuous functional surface rather than a J dimensional grid. One advantage o f this method is that the target storage at every time period is not restricted to the grid points o f the state space. Rather, it can be any value within the state space surface. The other advantage o f this method is that it allows linking the G O M model to interact with the R L model on an on-line basis at each time step. A t each time step, the agent passes the sampled scenarios o f the random variables and the sampled state-action pair to the G O M model. The G O M model optimizes the operation o f the reservoirs and sends back to the agent a signal i n the form o f a next stage transition and the rewards corresponding to the selected action. This flexibility, o f having the target storage take any value, increases the chances o f finding a feasible solution in the G O M model runs. This is unlike the case requiring that an optimal solution at the grid points be found, which in some cases results in infeasible runs for the G O M model. To overcome this problem, a much finer state space grid needs to be generated, which further increases storage requirements for these optimization problems. Therefore, function approximation results in a significant reduction in computer storage requirements and a more robust algorithm implementation of the proposed system. Figure 4.9 presents a flow chart o f the R L model algorithm using function approximation. The flow chart indicates the interaction o f the R L agent with the model o f the environment ( G O M ) online. Figure 4.10 displays a schematic representation o f the R L R O M model algorithm using P W L function approximation. 86 A t each time period, the marginal value o f water, MVW (s'), is calculated as the tj derivative o f the value function j\s') with respect to storage in reservoir j: MVW tj (s) = df 0') / dsj (4.21) t The marginal value o f water is updated after each iteration, and its units is converted from $/cms-day to $ / M W h using a conversion factor, HK Jyh storage: HK,, (s) = G (s) I Q Jt Tj (s) (4.22) where G is the plant generation in M W h and Q jtt as function o f reservoir T is the turbine discharge in cms-day. 87 (start) Construct Environment Model [GOM] I Generate Scenarios for Random Variables Initialize Agent I Initialize Q-Table Values Arbitrarily Set N=\ » — ^ t=i Randomly Sample State Space T Sample Random Variables According to their Prob. Distribution Choose -*-Greedy Policy Generate Optimal Solution ^ ( Figure 4.9. RLROM model algorithm Stop using ) Function Approximation 88 Tim e Period (t+n Tim e Period (0 Notation: / s' a L ' P' /' e [*-« 1 • r * MVW Time Period State Variable- (Storage) Action (Pre-sale) Load scenario Price Scenario Inflow Scenario Q Value Function State Value Function Rate of Learning Iteration No. Rewards Discount Rate Marginal Value of Water Generated Scenarios O - Learning Update Equation: Updated Q-value = O/rf value+ 4 [New value - Old value] Temporal Difference (Reinforcement) Figure 4.10. Schematic representation of R L R O M model algorithm using function approximation oc Convergence 4.3.3. In stochastic dynamic programming, the optimality condition is reached when the solution is within a specified optimality tolerance. In general, the benefits o f attaining the true optimal solution do not outweigh the time and cost o f reaching the true optimal. The main concern in solving the stochastic reservoir optimization problem is to decide on how many iterations are sufficient to assure a good approximate solution. Reinforcement learning algorithms rely on stochastic approximation in the evaluation process o f the optimal value function f*(s'). Stochastic approximation is an iterative procedure that produces a sequence o f solutions in such a way that after a finite number of iterations the temporal difference o f the expected values approach zero with probability 1 (Gosavi, 2003). Accordingly, and as long as the Q-values are changing, we should continue to run the algorithm. A s the age o f the agent increases, the step size parameter a decreases. When the step size value becomes smaller than a specified value, the algorithm could be stopped. A t this point the Q-values should stabilize and no change should occur for each state-action pair. After each iteration, o f the R L R O M runs, the absolute (A^) difference between the estimated Q-value function in iteration N and iteration N-l is calculated at each time period as follows: = \Q? (S, a) - Qf- (s\ a)\ X \ft,s\a (4.23) The computation terminates when the difference in the Q-values between successive iterations ( A g ^ ) remains constant for several iterations and consequently the Q-values are said to converge to the optimal solution. Gosavi (2003) also suggests other criteria to stop the algorithm when the policy does not change after a number o f iterations, and this could be explored in the future. 90 4.4. Summary The proposed R L approach outlined in the previous chapter, led a practical model, o f the complex multireservoir system. Instead o f modeling a single reservoir , the R L R O M model was developed for the B C Hydro's two main river systems, the Peace and the Columbia Rivers. The model was formulated to establish an optimal control policy for these multiyear storage reservoirs and to derive the marginal value o f water in storage. The R L R O M model is presented with two solution algorithms: the first algorithm relies on the use o f lookup tables to store the Q-values and the second algorithm, which allowed the extension to handle a larger scale multireservoir problem, relies on the use o f the function approximation technique. The R L R O M model considers several stochastic variables: the domestic load, market prices, and the natural inflows. The use o f the moment matching technique for generating scenarios o f the load, inflow and price variables has the advantage o f using a limited number o f scenarios to represent the random variables' statistical properties, i n particular the moments and the correlation o f extensive historical time series records. A large-scale hydroelectric reservoir optimization model ( G O M ) based on linear programming was integrated with the R L R O M model. In this way, the optimization process was extended to include the other reservoirs on the Peace and on the Columbia Rivers: the Dinosaur, the Revelstoke, and the Keenleyside reservoirs. This integration allowed more efficient on-line interaction between the agent and the environment to be carried out. It also made it possible to capture the diurnal variation o f the price and load on shorter time periods. 91 5. MODEL TESTING AND IMPLEMENTATION 5.1. Introduction This chapter is structured as follows: F i r s t , a single reservoir optimization problem, representing a test bed for this research work was used to investigate capability o f the R L approach to solve the reservoir operation planning problem and to gain experience with the R L technique. Three cases were considered to test the performance o f the R L algorithm on problems o f increasing size. Second, a two reservoir problem was tested using the multireservoir model formulation presented i n the previous chapter. The objective o f this test case was to investigate the potential use o f the G O M model off-line as a model o f the real system. Lookup tables were used to store the feedback information from G O M and the estimated Q-values from the R L algorithm for a subset o f the full state space o f the two reservoirs. F i n a l l y , the R L R O M model was used to model the full state space and the function approximation R L algorithm was implemented for the B C Hydro two main reservoir systems. The G O M model was linked to run on-line within the R L algorithm. A case study is presented to demonstrate the capability o f the model to solve the large-scale multireservoir operation planning problem. A s both o f the G M S and the M i c a dams have multiyear storage capabilities, the model was run for a planning horizon o f 36 months. The optimized storage value function, marginal value o f energy for both o f the G M S and the M i c a dams w i l l be presented and discussed. In addition, examples o f the optimal control policies proposed by the R L R O M model w i l l be presented. The model output includes the optimized control policies for: market transactions, plant generation, and turbine releases for the five main plants in the Peace and the Columbia River systems. Initially, the model was run in training mode, where the R L agent learns the actionvalue function and the optimal control policies. Once the R L agent learns the optimal 92 control policies, the model can then be used to control the monthly operations planning decisions. The use o f the R L model to control the operation planning for reservoir systems w i l l be presented and discussed. 5.2. Single Reservoir Model - Test Case A s a first step in investigating the capability o f the R L approach to solve the reservoir operation planning problem, a single reservoir case was used as a test-bed. The problem was formulated and solved using the Q-Learning technique. The problem was also solved using the value iteration method o f the stochastic dynamic programming (SDP) technique. The established optimized solution using the S D P model was used as a base line to evaluate the R L model results. Working on this problem provided useful insights about the R L approach. In addition, it was possible to gain experience on the sensitivity o f the model results to the various parameters used in the formulation. The following sections present a description o f the case study, S D P and R L test model formulation, establishing the R L model parameters, and the model results. 5.2.1. Problem Description The system considered in this case consists o f one reservoir. The supply from the reservoir is mainly used for hydropower generation. The hydropower producer operating the system generates the electricity to satisfy the local demand and for sale in the open market. The energy yield, depending on the inflows, is variable from period to period throughout the year and is governed by the following probability distribution: 93 Probability of inflow for the different periods Table 5.1. Probability Inflow Volume (Energy units) Period 1 2 3 4 5 6 7 8 9 10 11 12 10 11 12 13 14 0.10 0.20 0.30 0.25 0.15 0.15 0.25 0.30 0.20 0.10 0.30 0.20 0.10 0.15 0.25 0.25 0.20 0.15 0.25 0.15 0.10 0.20 0.30 0.25 0.15 0.15 0.25 0.30 0.20 0.10 0.30 0.20 0.10 0.15 0.25 0.25 0.20 0.15 0.25 0.15 0.30 0.20 0.10 0.15 0.25 0.25 0.20 0.15 0.25 0.15 0.30 0.20 0.10 0.15 0.25 0.10 0.20 0.25 0.15 0.30 The power producer requires 10 units o f energy over each period to satisfy its demand and could store as much as 12 units. A n y energy stored but not used can be stored or sold in the following period. A marketer is willing to pay a premium price for the producer's energy according to the scale below, i f the producer guarantees delivery at a certain time of the year: Table 5.2. Units of Energy Price ($) Forward market price 0 . 1 0 1200 2 1000 3 800 4 800 If the producer contracts too much o f its production, leaving less than 10 units for its own needs, then the shortfall must be made by purchasing energy on the spot market (spot buy) at $1500 per unit. A n y energy held at the end o f the month, which the storage facilities can't store w i l l be sold on the spot market (spot sell) for $500 per unit. The producer limits transactions on the spot market to those that are absolutely necessary. The objective is to maximize the expected discounted profit over the foreseeable future, with an effective annual interest rate o f 7%. 94 5.2.2. SDP mathematical formulation The following notation was used in the mathematical formulation o f the single reservoir problem: t Time period, where te l , . . . , r i n month/s, a Forward sale (contract), where a GA in units o f energy, s Reservoir storage, where s e S in units o f energy, i Inflow volume, where iGl i n units o f energy, L Domestic load=10 units o f energy, U Upper storage limit in units o f energy, Spot_Buy/ SpotSell Spot market transactions (buy/sell respectively) in unit o f energy, Spot_Buy_Cost/ Spot_Sell Rev Cost o f buying / revenue from selling in $, Exp Spot_Buy_Cost/Exp_ Spot_Sell Rev Expected value o f buy cost/sell revenue in $, Cont_Rev Forward sale (contract) revenue in $, Pspotjuyl Ps ot_seii P Pj(a) Price o f spot buy / sell in $/unit o f energy, Price o f forward (contract) sale in $/unit o f energy, R Rewards in $, y Discount rate, TP(s ,it,a,s +i) Transition probability o f moving to state s +i for a given state s,, action t t t a, and inflow /, TPS(s ,a,s +i) Transition probability o f moving to state s +i for a given state s and t t t t action a, where Exp _ Spot _ Buy _ Re v(s, ,a) = Y ^.TPis^-s^L TPS(s,, a, s ) = | TP(s t+l l+x -s,+L ^TPis^-s^L Spot _ Sell(s,, /,, a) * P _ spol + a) + a) V + a) V * TP(s,, /,, a, s M ) s >U !+l U> s l+l V se!l > 10 *, <10 +I and f,(s ) is the value function for state s in time period t. t 95 The mathematical formulation o f the SDP problem is as follows: Policy Rewards Calculation: R(s,, a) = -Exp _ Spot _ Buy _ Cost(s,, a) + Exp _ Spot _ Sell _ Re v(s,, a) + Cont _ Re vis,, (5-1) where: i Exp_Spot_Buy_Cost P - yi t^t^ )* o,-bu (s„a) = Exp _Spot _Buy _Rev(s,,a) Contract _RQv(s a) n =£ S ot Bu s a p sp * y Spot _Sell(s ,i,,a) *P _ t spol sell T (s,,i ,a,s )(5.2) P t * TP{s,,i,,a,s ) t+l = P (a)*a !+1 (5.3) (5.4) f Constraints: Constraint on reservoir storage (state variable): U>s, >10 (5.5) Constraint on forward sales (decision variable): 0 < a < 4 (5.6) Spot transactions (sell) constraint: i f s +i>U Spot_Sell - s +i - U (5.7) Spot transactions (buy) constraint: i f s +i<lQ Spot_Buy = 10 - s +i (5.8) Hydraulic continuity constraint: and U>s„, t t s t+l t t =s +i -L -a t t t t t (5.9) >10 96 Objective Function (s ) /, l = mJiR(s a) + lt The optimal policy n t r*Y U i(0*rPS(s,,a,* )]} u f+1 + (5.10) is: n] (s) - arg max.fc„fl)+r*2;, a (5.11) l/ i(*i .)* c 7P5 l+1 W + 5.2.3. RL mathematical formulation The same set o f constraints, as described above in the S D P formulation was used in the R L model. However, the rewards and the objective function were calculated differently. A s described earlier in chapter 3, the R L Q-Learning algorithm relies on sampling the state, action, and random variables to calculate the rewards instead o f calculating the expected values o f rewards as described above for the case o f the S D P algorithm. Accordingly, there was no need to calculate the transition probabilities. Rewards in the R L formulation were calculated as follows: Rewards Calculation: r (s,a) = -Spot _ Buy _ Cost(s , i , a) + Spot _ Sell _ Re v(s , /,, a) + Cont _ Re v(s,, a) t t t ( (5.12) where: Spot _ Buy _ Cost{s , i , a) = Spot _ Buy{s , i , a) *spot-buy P (5.13) Spot _ Buy _ Re v{s , i , a) = Spot _ Sell (s i , d) * spot-sell P (5.14) t t t t t t 9 t t s 97 Contract _ Re v(s, ,a) = P (a)* a (5.15) f Objective F u n c t i o n The Q-Learning update rule was used to calculate the state-action value function (QValue): Q?(s,,a) = Qr (s,,a) + a,{r (s„a) + y*maxQ?J(s ,a)]-Q?~ (s,,a)} { t l+l (5.16) a From this we get: f,(s ) = m xQ (s ,a)) ! l a l l n](s,)e argmaxQ (s,,a) N (5.17) (5.18) a where (s,a) is the state-action value function at time period t and iteration N, r(s,a) is the sampled rewards for state s and action a as calculated in Equation 5.12. The problem was formulated i n A M P L and was solved using the Q-Learning algorithm described earlier i n section 3.4.3.2. For comparative and evaluation purposes, the S D P problem was also solved using the value iteration algorithm. Three cases were modeled for this test problem, as follows: Case 1: Considering three states (10-12) and four stages. Case 2: Considering three states (10-12) and twelve stages. Case 3: Considering ten states (10-19) and twelve stages. 98 5.2.4. Solution Steps of Q-Learning Algorithm The following is a presentation o f a step by step procedure o f the Q-Learning solution algorithm applied in solving the single reservoir problem: Step 1 Set the model parameters including: the annual discount rate y= 0.07, the exploration rate exponent £ = 0.000045, and the learning rate exponent 0.61. The ^ and y/ constants are used to calculate the exploitation rate and the learning rate respectively (equations 4.6 and 4.7). Set the number o f iterations N max Step 2 - 10,000. Initialize the state-action value function (Q-Tables) to zero for all state-action pairs, Q,(s,a) - 0. Step 3 Starting at the first stage; initialize the algorithm by randomly sampling the state space, i.e. randomly choose a point from the discretized reservoir storage (state space), for examples, e {10,11,...,19}. 1 Step 4 Randomly sample the inflow variable according to the probability distribution in time period one, where/, e {10,11,...,14}. Step 5 The agent chooses action ai (forward sales or contracts) randomly, where a e {1,...,4}. In the first iteration, for all the stages te {1,...,12}the agent chooses the x action at random as it has not yet learned any information about the Q-values. Step 6 The agent interacts with the model o f the environment. The chosen action and the sampled inflow scenario are simulated. The agent receives a signal i n the form o f the transition to the next stage state s l numerical rewardr, (s,a) 2 based on the hydraulic balance equation (5.9) and a as calculated in equation (5.12). The rewards are calculated as the sum o f the spot and contract sales. 99 Step 7 The learning rate a is calculated using equation (4.7). Step 8 A p p l y the Q-Learning update rule (equation 5.16) to calculate a new estimate for the Q-value as follows: 2," (s,, a) = Q?~ {s ,a) + a, \r x t t {s„a) + y* max (s , a)] - Q?~ (s , a)} l M t a The first term o f the equation represents the initial action-value function (Q-value) i n this iteration (N=l) and at this time period (t=l), Q*(s ,a). The second term represents x the reinforcement achieved to the initial estimate o f the Q-value. This is calculated as the difference between the new Q-value{y*r {s ,a) + maxQl(s ,a)) and the initial estimate x x 2 a o f the Q-value (Q\{s ,a)) multiplied by a step size (learning rate, a,). This equation can x be represented in general form as follows: New Estimate <— Old Estimate + Step Size [Target (New Estimate) - Old Estimate] The New estimate explained above is the sum o f the rewards achieved i n this time period t and the discounted value function at the transition state in time period t+1. Store the New Estimate o f the Q-values as Q ' ( i „ a ) . A t this point, for stage 1 and time period 1 we have a new state-action estimate (Q-Value) for the visited state-action pair. The other state-actions remain unchanged (zero). Next visit to this state-action pair, may be i n iteration 50, the agent uses the stored Q-value Q\{s ,a) as the Old Estimate and x apply the same update rule to calculate the New Estimate(s ,a). x The best policy (corresponding to the maximum Q-value) is always updated after each visit to a state and calculating the Q-values for the sampled state-action. Step 9 The agent moves to the selected state i n the next time period, s\. The procedure o f choosing random action a and determining the reward signal and the t transition state is repeated until the agent reaches time period T. 100 Step 10 The agent starts a new iteration (N=2). A t the first time period (t=\), sample the state space randomly, sf. Calculate the exploitation rate £ applying equation 4.6. The agent chooses the action ai using the e-greedy policy. Accordingly, with probability £, the agent is choosing the best action estimated so far, max Q? ( j , , a). In the beginning, a the exploitation rate is small and the agent tends to explore more frequently, with a probability o f (1-e), to gain more knowledge about how good or bad it is at taking the actions. Later on, and as the age o f the agent increases, the exploitation rate increases and the agent becomes more greedy and it chooses the best action with probability £. Steps 3 to 9 are repeated until convergence is achieved. However, starting from iteration 2 and in subsequent iterations, until the termination o f the algorithm, the agent is always using the information it has learned so far to update the Q-value function estimates (i.e. reinforce its learning o f the Q-values) applying the e-greedy policy rather than randomly selecting the actions as in the first iteration. The optimal value function J[s) and the optimal generated policies 7T*(s)e a r g m a x £ ) , ( . s , a ) are stored for all elements o f the state space. Prior to illustration and discussion o f the results, the following section presents the process followed in establishing the parameters that were used in the R L algorithm. 5.2.5. Establishing the parameters of the RL Algorithm In the initial tests o f using the Reinforcement Learning technique to solve the single reservoir problem, a series o f runs was performed to establish the appropriate settings o f the R L algorithm parameters. The following sections present examples o f the runs, which were performed for establishing the exploitation rate and the learning rate R L parameters. Case 2 (considering 3 states and 12 stages) was selected as an example to present the results for a fixed number o f iterations N = 8000 i n all runs. 101 5.2.5.1. Exploitation rate The effect o f changes in the exploitation rate parameter was investigated first using a constant exploitation rate parameter e ranging from 0.2 to 1.0. For the rest o f the cases a variable exploitation rate was used. The variable rate was expressed as a function of the age of the agent as outlined in Equation 4.6 with the value o f parameter ranging from 0.00001 to 0.0001. For both cases (constant and variable exploitation rate) a learning rate parameter y/= 0.5 (Equation 4.7) was used. Figure 5.1 presents the results for this set o f runs for both constant and variable exploitation rates. The maximum error in the value function is presented as a percentage o f the maximum difference from the solution derived by the S D P model. The results o f these runs indicate that varying the exploitation rate with time appears to provide the best performance. UJ .Q.01 \ 0 , 1 , 2 , 3 , 4 , 5 6 , , 7 8 , 9 , , 10 Run Number Figure 5.1. Effect of varying the exploitation rate Examining the convergence behavior o f the solution (as an example, at time period t=5 and state 5=11), Figure 5.2 shows that the expected rewards increase with increasing experience of the learning agent. A t the beginning o f the iterations, the estimated value function increases significantly. However, the rate o f increase gradually decreases until it diminishes, as the algorithm converges to the optimal solution. It is clear that when a 102 greedy policy is consistently followed, it results in a poor quality solution, compared with other e-greedy policies (using either a constant or variable exploitation rate with e smaller than 1). Following the greedy policy, the value function starts to increase and levels off earlier and at a lower level compared to other policies (at about 0.95 o f the optimal solution). O n the other hand, the e-greedy method is capable o f reaching close to the exact solution. The e-greedy method performs better as the agent continues to choose the action randomly at a small fraction o f time and therefore it has a better chance to find a better approximation to the optimal solution. Figure 5.3 illustrates the robustness o f the Q-Learning method over a wide range o f values for the exploitation rates parameter £ Iteration Figure 5.2. Convergence of Q-Learning solution applying different exploration-exploitation policies 103 Iteration Figure 5.3. 5.2.5.2. Performance of different e-greedy policies Learning rate To establish the best value for the learning rate parameter, and to determine the sensitivity o f the solution to the variation in the learning rate, sets o f runs were performed using constant learning rates a o f 0.15 and 1.0. A l s o , a polynomial learning rate was used (Equation 4.7) with the exponent yr ranging in value from 0.5 to 1.0. The same exploitation rate was used throughout the runs assuming 0.0001. Figure 5.4 demonstrates that by gradually reducing the learning rate as the number o f iterations increases, the model convergence to a better quality solution (lower error in the value function). The best results were obtained by setting ^=0.6 and the accuracy o f the solution deteriorates i f y/ is either increased or decreased. Figure 5.4 shows that the error in the value function could have a V shape relationship with the learning rate, which stresses the importance o f experimentation with these parameters prior to the adoption o f the values for use in the R L algorithms. 104 0.10 -. 0.09 c 0.08 - 0.00 -I 1 0 2000 : — 1 4000 6000 < 8000 Iteration Figure 5.4. 5.2.6. Performance of different learning rates Results To examine the quality o f the solution obtained using the R L approach for the single reservoir problem, the results o f the three test cases described earlier were evaluated. First, the R L model parameters for each o f the three test cases were established using a variable exploitation rate parameter ^ a n d learning rate parameter y/. The following table presents the parameters and the number o f iterations used for each o f the test cases. Table 5.3. Test Case Test cases study parameters Iterations Exploitation rate parameter Learning rate parameter (O 0.59 Case 1 15,000 0.000065 Case 2 15,000 0.000050 0.58 Case 3 20,000 0.000045 0.61 The convergence o f the value function is presented in Figure 5.5. The results indicate that for all 3 test cases, the R L model was capable o f converging to an optimal solution with a very little difference from the solution derived by the S D P model. The error i n the value function is defined as: 105 (5.19) Error = max where f is) is the value function obtained using the S D P model. 0.100 c o c 0.080 S UL ii I 0.060 - - - C a s e 1(3 s t a t e s a n d 4 s t a g e s ) C 'I 0.040 LU 0.020 — C a s e 2(3 s t a t e s a n d 12 s t a g e s ) C a s e 3(10 s t a t e s a n d 12 s t a g e s ) 0.000 5000 15000 10000 20000 Iteration Figure 5.5. Convergence of the value function The maximum relative error i n the value function derived b y the R L model for each o f the three cases was: 0.0009, 0.0005, and 0.0009 respectively. For each stage, the mean relative error (MRE) in the R L solution was calculated and listed in Table 5.4. The MRE was calculated as: MRE,=—Y (5.20) The results indicate that the mean relative error does not exceed 0.04%. Table 5.4. Mean relative error (%) for the single reservoir R L model 1 2 3 4 5 6 7 8 9 10 11 12 Case 1 0.03 0.04 0.03 0.04 - - - - - - - - Case 2 0.02 0.02 0.03 0.02 0.03 0.03 0.01 0.03 0.02 0.04 0.03 0.02 Case 3 0.03 0.02 0.03 0.03 0.02 0.02 0.03 0.03 0.03 0.03 0.03 0.04 Period 106 The optimal value function derived b y R L Q-Learning and S D P models were superimposed and compared. Figure 5.6 presents the study results for stages 6 and 12 o f the single reservoir test case 3. The results indicate a close match o f the value function obtained by the R L and the S D P models. 335,000 Stage 6 334,000 = 330,000 CD > 329,000 328,000 327,000 13 335,000 14 15 Storage Stage 12 334,000 333,000 332,000 -I 331,000 330,000 329,000 328,000 327,000 14 Figure 5.6. 15 16 Storage Value function obtained by R L and SDP at time periods 6 and 12 Figure 5.7 displays the optimal forward sale policies derived by both o f the R L and S D P models for stage 6 and stage 12. The operating policies suggested by the S D P and R L model were identical. A s the results presented suggest, it is more profitable to contract more energy, as the storage in the reservoir increases. 107 Stage 6 4 n Storage Stage 12 10 11 12 13 14 15 16 17 18 18 19 Storage Figure 5.7. Optimal policies derived by the R L and SDP models at time periods 6 and 12 The results obtained indicate that for all o f the three test cases the performance o f the R L is stable and provides a good approximation to the optimal solution. The results also, interestingly, show that the number o f iterations required to reach convergence are not affected by the size o f the state space. This also provides a good measure o f the robustness o f the R L approach. Consequently, the single reservoir test case offers some useful insights on the potential use o f the R L algorithms in handling the larger scale multireservoir problem, which w i l l be discussed next. 108 5.3. Two Reservoir Model - Test Case This test case was used as a building block for the development and the implementation o f approach to the large-scale multireservoir problem. This case study provided a way to test the performance o f the algorithm in handling larger scale problems. The R L Q-Learning model was run for a finite number o f iterations until it converged. Convergence was assessed in terms o f the difference between the current solution and the previous iteration solution, as suggested by Gosavi, 2003. The derived R L model optimal solutions were then compared with the S D P model results. 5.3.1. System Modeling The system considered in this test case consisted o f the G M S generating plant at the Williston reservoir and the M C A generating plant at the Kinbasket reservoir on the Peace and Columbia Rivers respectively. The state variables include a subset o f the full G M S and M C A storage volumes. The G M S storage state variable was discretized to 30 increments covering the range from 260,000 to 410,000 cms-day, and 5 increments for M i c a ranging from 245,000 to 285,000 cms-day. Accordingly, the state space for each time period consists o f 150 storage combinations for both reservoirs. In addition, a provision was made in the formulation to account for the system electricity demand in the state space. Table 5.5 presents the forecasted system electricity load and the peak load in the four time periods considered in the model runs, each consisting o f three months. Table 5.5. System load and peak demand 1 2 3 4 Load - M W h 4,983 5,362 5,804 5,872 Peak - M W 8,820 9,720 10,458 10,219 Period 109 To test the performance o f the proposed R L multireservoir model, the historical inflow data for water years 1964 -1968 were used in five scenarios o f the random inflow variable. The monthly inflow data to the reservoirs on the Peace and the Columbia system and the assumed scenario probability associated with each o f the inflow sequences are presented in Table 5.6: Table 5.6. Scenario Peace system inflow scenarios (cms) Columbia Peace Prob. Period CO 1 2 3 4 5 0.10 0.20 0.30 0.25 0.15 1 2 3 4 1 2 3 4 1180 1100 787 644 354 326 788 1150 551 579 358 275 258 307 499 570 582 308 330 357 272 278 1000 673 694 692 342 208 823 759 437 501 718 492 291 267 288 285 312 264 Monthly price forecasts for the U S and Alberta markets were used to represent the market conditions i n the northwest. Table 5.7 presents the heavy load hours ( H L H ) and light load hours ( L L H ) price forecast corresponding to the five inflow water years. Table 5.7. US and Alberta market price forecast ($/MWh) CO PUS CO 1 2 3 4 5 CO P AB HLH LLH HLH LLH 32.78 41.41 36.08 39.20 34.18 32.93 40.61 36.35 38.97 34.53 34.53 43.62 38.01 41.30 36.01 34.69 42.78 38.29 41.06 36.37 The range o f the forward transactions decision variable was discretized to 5 decisions. Each decision was distributed between the heavy load hours and light load hours. 110 Table 5.8. Range of forward transactions Period LLH-EVIP GWh HLH-EXP GWh Minimum Maximum Minimum Maximum 1 2 3 4 57 287 97 484 32 158 107 535 48 239 84 419 48 239 111 553 where I M P and E X P are the imports and exports respectively. 5.3.2. Results and Analysis The problem formulation was described in detail in the previous chapter. The R L model was run on a three month time step for four time periods. To simulate the behavior o f the real system under different inflow and release conditions, the G O M model was used off-line in batch mode prior to performing the R L model runs. The G O M model was run on a daily time step with three sub-time steps during the weekdays and one time step during the weekend. The purpose o f running G O M on a finer time step than the R L model was to capture variations in the load and prices that occur in a typical day. First, a batch o f 15,000 runs was carried out to cover all possible combinations of: the storage state variable, inflow, and forward sale (decision variable) for the four time periods (30*5*5*5*4=15000). For each run, the G O M model generates the optimal system rewards and the transition state, which are then stored in lookup tables for use in the R L model runs. Values for the R L model parameters o f y/- 0.5 and ^- 0.000025 were adopted for the model runs. The model was set to run until it converged, which was achieved after 80,000 iterations as shown in Figure 5.8. Ill 250,000,000 200,000,000 UJ 150,000,000 • | 100,000,000 re 50,000,000 0 10000 20000 30000 40000 50000 60000 70000 80000 Iteration Figure 5.8. Convergence of the R L model The R L model results were compared to the optimal solution derived by the S D P model. Figures 5.9 presents a sample o f the optimal value functions estimated by the R L and the S D P models. The storage value function for different G M S storage levels is given on the abscissa for five M C A storage levels. The graph demonstrates the capability o f the R L model to converge to near optimal solution with a mean relative error not exceeding 0.0256. The results also indicate the stability o f the model for the different time periods. The M R E for each o f the four stages is given in the following table: Table 5.9. Period M.R.E (%) Percentage of Mean relative error for the two reservoirs R L model 1 2 2.35 2.56 3 4 2.44 2.37 112 Figure 5.9. G M S storage value function as a function of M C A storage The optimal operating policy obtained by the R L model was also evaluated. Assuming different initial storage levels for G M S and M C A and for several inflow scenarios, a number o f simulation runs were carried out using the optimal policies at each stage for both the R L and the S D P model. For the different inflow scenarios and starting conditions, the simulation results were found to be identical to those derived by the S D P model. T w o examples o f the simulation run results are illustrated in Figure 5.10. Stage - Month Figure 5.10. Stage - Month Simulated optimal operation planning policy 113 Results from the two reservoir test model clearly demonstrated the capability o f the R L model in solving the problem and obtaining a good approximation o f the optimal value function. Using the G O M model, in batch mode has the merit o f providing the feedback information needed for the R L model in an off line setting. This resulted in a speed up o f the R L model runs. However, some infeasibility difficulties were encountered with this mode o f G O M runs, as the model could not always find the optimal solution at the grid points for the storage state space as it had a coarse grid structure. This problem has occurred since the feasible optimal solution may exist at any point i n the state space that is not necessarily at an exact point o f the coarse grid. Consequently, there was a need to use a finer grid structure. For the state space used in this study, a batch o f 15,000 G O M runs was performed. T o cover the whole state space, the amount o f data management and C P U time would increase significantly and it would become impractical for use in this study. This problem was dealt with during the implementation phase o f the R L R O M as w i l l be described in section 5.4, through the use o f function approximation instead o f lookup tables. 5.3.3. Efficiency of the R L Algorithm A series o f runs were conducted to assess the efficiency o f the R L model i n terms o f the computational speed. The run time ( C P U time) o f the R L model was compared to that of the S D P model. Assuming s state discretization, t time periods, a decision variables, e 1 scenarios o f random variables, then the problem size becomes (s.t.a.e). The two models were run for different problem sizes ranging from 900 to 178200. Figure 5.11 displays the variation in computational time needed for the R L and the S D P models with problem size. The figure clearly demonstrates that the S D P model run time appears to be increasing quadratically as the size o f the problem increases whereas the R L model has a linear run time relationship with the problem size. These results suggest that the R L model is more efficient in terms o f computational time as the size o f the problem increases larger than 150000. 114 If the S D P were to be used to solve the two reservoir problem covering: the whole state space assuming 59000 states, 36 time periods, 10 scenarios, and 5 actions, the problem size is in the order o f 10 . Using a polynomial regression relationship established from the results presented i n Figure 5.11 indicates that the estimated time for S D P to solve a problem o f 10 i n size would be impractical. It can be concluded from this 6 analysis that the R L overcomes the dimensionality problem through the potential use o f sampling and function approximation techniques as w i l l be presented next. 500 450 400 350 300 250 200 150 100 -| 50 0 --•--SDP —. RL 200000 400000 600000 800000 1000000 Problem Size Figure 5.11. Comparison of the R L and SDP C P U time 115 5.4. R L R O M Multireservoir Model Implementation - Case Study The storage and generation facilities on the Peace and Columbia Rivers dominate the system operation planning function at B C Hydro. It is w e l l recognized by the system operation planners that proper planning o f the system should consider the coordinated operation o f these two systems, and that they should be integrated together in one model. This section presents the results o f implementation o f the R L R O M model on B C Hydro's main reservoir systems on the Peace and the Columbia Rivers while addressing the uncertainties that are usually encountered by the system operator, namely: the natural inflow, the system electricity load, and the market price. A case study w i l l be presented to illustrate the capability o f the R L R O M model to provide answers to this complex operation planning problem. The main outcomes o f the R L R O M model are: the optimal control planning policy for the system, the estimated value o f water in storage, and the marginal value o f water in the main reservoirs o f the B C Hydro system. The capabilities of the R L model to handle the dimensionality problem are also demonstrated. Several enhancements were made to the R L model and to the approach that was presented in the previous section. The moment matching scenario generation methodology described in the previous chapter was used to generate a finite set o f scenarios that portrays the properties o f the random variables. T o deal with the larger state space, the R L algorithm with function approximation was used instead o f lookup tables. The storage value function w i l l be represented by a continuous piecewise linear function, fls'). In addition, the formulation o f the R L R O M model was modified to allow for on-line interaction with the G O M model at each stage in the iterations. This setup has provided the G O M model with the flexibility to find the optimal solution at any point on a continuous state space. In addition, this setup has avoided the infeasibility problems that were encountered in the G O M batch runs using the grid structure o f the state space as described in section 5.3. 116 5.4.1. Description of the Case Study This case study considers the operation planning o f B C Hydro's hydroelectric generating plants on the Peace River and on the main stem o f the Columbia River. These two systems account for about 65% o f the total B C Hydro generation capacity o f about 11,000 M W . The production schedules for all other hydro projects were fixed at their normal operation. The most significant o f these resources include hydro projects on the Bridge River, Campbell River, and other small hydro systems and thermal generation plants such as Burrard and Island Cogeneration plant (ICG). The hydroelectric system on the Peace River is located in the northern interior o f British Columbia and it consists of: (1) The W . A . C . Bennett dam (Williston Reservoir) and the G . M . Shrum generating station ( G M S ) and (2) The Peace Canyon dam (Dinosaur Reservoir) and the Peace Canyon generating station ( P C N ) . The Peace system supplies approximately one third o f B C Hydro's annual energy production. The powerhouse o f the G M S generating station is equipped with 10 generating units with a total capacity o f 2,730 megawatts ( M W ) . The average annual inflow to the Williston reservoir is about 1,080 m /s and its drainage area is about 68,900 k m . 3 2 Williston reservoir is a multiyear storage reservoir and is the largest in the Province, with a live storage capacity o f 39,471 million m . The reservoir reaches its lowest operational 3 level (645 m) in A p r i l and M a y and its maximum water level (672 m) i n September and October. The Peace Canyon dam is located 23 kilometers downstream o f the Bennett dam with a generating capacity o f 694 M W with 4 generating units. The Dinosaur reservoir inundates 9 k m with a very small local natural inflow and is normally operated between 2 elevations 500 m and 502.9 m. The G M S and P C N generating stations are typically operated in a hydraulic balance and the flow is tightly controlled during the winter season to manage downstream ice conditions on the Peace River. 117 The Columbia basin is situated in the southern interior o f British Columbia. The main hydroelectric facilities on the Columbia River comprise: (1) M i c a dam (Kinbasket reservoir) and M i c a generating station ( M C A ) , (2) Revelstoke dam and reservoir and the Revelstoke generating station ( R E V ) , and the Hugh Keenleyside dam (Arrow lakes reservoir) and A r r o w lakes generating station ( K N A ) . M i c a and the Hugh Keenleyside projects were constructed under the Columbia River Treaty and are operated to maximize the mutual benefits to Canada and the U S , with respect to flood control and power generation. The M i c a generating station consists o f 4 generating units with a generating capacity o f 1805 M W . The Kinbasket reservoir, which is the second largest multiyear storage reservoir within B C Hydro system, has a live storage capacity o f 14,800 million m . It receives an average annual inflow o f 586 m /s 3 3 from a drainage basin o f 21,000 k m . The Revelstoke dam is located about 130 k m south 2 of the M i c a D a m . The Revelstoke generating station also has 4 generating units with a total generating capacity o f 2000 M W . The drainage area upstream o f the Revelstoke dam is about 5,500 k m , and the average annual local inflow is 221 m /s, which is 2 3 predominantly snowmelt driven. The maximum operating elevation is 573.0 m with a normal daily fluctuation o f about 1.0 m. The A r r o w lakes are about 230 k m long downstream o f the Revelstoke D a m . Their drainage basin is about 10,000 k m . The 2 inflows to the A r r o w lakes are regulated by the M i c a Dam. A r r o w lakes generating station has 2 generating units with a capacity o f 185 M W . Throughout this section, the Peace and the Columbia main five reservoirs/dams/ generating stations w i l l be referred to as: G M S , P C N , M C A , R E V , and K N A . 5.4.2. Input Data This section presents a summary o f the historical/ forecasted data that were used as inputs in the case study. A l s o , the physical and the operating limits for the five reservoirs considered in this case study are outlined. The case study is set to run on a monthly time step, which was then subdivided to three sub-time steps for weekdays ( W K ) and one time 118 step for the weekend ( W E ) . Weekends include Sundays and holidays. The weekday time steps were split to: peak ( W K P K ) , high ( W K H I ) , and the low ( W K L O ) sub-time steps with 2, 14, and 8 hours respectively. The subdivision into sub-time steps enhances representation o f the hourly variation in system load and market prices. Table 5.10 presents the monthly hours i n each sub-time step for typical months in a non-leap year. Table 5.10. Monthly Hours in each sub-time step Total Sub-time step Month WKPK WKHI WKLO 364 208 48 50 52 48 52 48 50 52 336 350 364 336 364 192 336 350 364 200 208 192 208 192 200 208 August 52 50 364 350 208 200 September 50 350 200 52 October November December January February March April May June July 5.4.2.1. WE 120 144 144 744 120 96 744 672 744 120 144 144 ' 720 744 720 744 96 120 144 720 744 744 120 720 Inflows The characteristics o f the local natural inflows to the five reservoirs on the Peace and Columbia Rivers are given i n Table 5.11. The historical records considered covers 60 water years for the period o f 1940-2000. Figure 5.12 presents the average monthly inflow and the cumulative monthly inflows. The inflows, which mainly result from snowmelt, typically increase in A p r i l and M a y , peak in June and July, and taper off in January. Figure 5.13 presents the normalized cumulative distribution o f the average monthly inflows for the five reservoirs in this study. 119 Table 5.11. River Peace Columbia Annual average reservoir local inflow characteristics in cms Reservoir Dam Average Minimum Maximum Std.Dev. Williston G.M. Shrum GMS 12,940 9,158 17,197 1,750 Dinosaur Peace Canyon PCN 321 1 1,618 353 Kinbasket Mica MCA 6,891 5,578 8,726 678 Revelstoke Revelstoke REV 2,821 2,112 4,030 379 Arrow Keenleyside KNA 4,256 2,141 5,518 834 4000 Oct Nov D e c Jan Feb Mar Apr M a y Jun Jul Aug Sep M onth Figure 5.12. Average monthly reservoirs local natural inflows Figure 5.13. Normalized cumulative monthly distribution of the average annual reservoirs local natural inflows 120 Market Prices 5.4.2.2. Table 5.12 and Figure 5.14 present the average monthly market price forecast for each sub-time step. It was assumed that the peak and the high weekday prices are equal and that the weekend market prices were equal to the ( W K L O ) weekday sub-time step. Table 5.12. Average monthly market price in each sub-time steps ($/MWh) Month October November December January February March April May June July August September Minimum Average Maximum WKPK WKHI WKLO WE 54.23 54.23 52.39 52.39 58.95 58.95 54.92 54.92 63.02 63.02 49.57 49.57 66.13 66.13 53.08 53.08 56.43 56.43 51.93 51.93 54.86 54.86 50.60 50.60 54.17 54.17 46.11 46.11 50.78 50.78 36.54 36.54 50.12 50.12 36.83 36.83 54.63 54.63 43.20 43.20 75.13 75.13 45.76 45.76 56.89 56.89 50.51 50.51 50.12 50.12 36.54 36.54 57.94 57.94 47.62 47.62 75.13 75.13 54.92 54.92 80 20 1 1 Oct 1 1 1 1 1 1 1 Nov D e c Jan F e b Mar Apr May Jun 1 1 Jul 1 Aug Sep Month -•-WKPK-WKHI Figure 5.14. -•—WKLO-WE PRICE Average monthly market price in each sub-time step 121 5.4.2.3. The Electricity System Load average monthly electricity load forecast considered in the case study is presented in Table 5.13. This electricity load should be met mainly by the generation from the five hydropower plants on the Peace and Columbia Rivers and other system resources. The other system resources were considered fixed; however, they were shaped according to the historical operation pattern. Figure 5.15 displays the average monthly percentage o f the annual load, based on 15 years o f B C Hydro historical records. Table 5.13. Average monthly electricity system load in each sub-time step WKHI WKPK Month 2,374 2,437 2,659 2,798 2,562 2,583 2,168 2,128 2,189 2,184 2,135 2,148 339 348 380 400 366 369 310 304 313 312 305 307 October November December January February March April May June July August September 10 9.69 Total WE WKLO 1,357 1,393 1,519 1,599 1,464 1,476 1,239 1,216 1,251 1,248 1,220 1,227 (GWh) 4,853 5,222 5,652 5,719 5,124 5,280 4,645 4,523 4,330 4,464 4,539 4,418 783 1,044 1,094 922 732 852 929 875 577 720 879 736 9.77 9.5 c c < o 0) 8.5 8.99 8.91 9 8.67 -H 7.90 8 7.59 7.5 - • Oct Nov Dec Jan Feb Mar Apr 7.70 7r28 4X, 1 Jun Jul Aug 7.45 I n . i l h II May Sep Month Figure 5.15. Historical monthly distribution of % of system load 122 5.4.2.4. Transmission limits The monthly transmission limits used in the study for the imports and exports to the U S and Alberta markets are given in Tables 5.14 and 5.15. These limits represent the capacity o f the transmission lines between B C and the U S and Alberta based on historical records. Table 5.14. B C - U S transmission limits WE WKPK WKHI WKLO 1,800 2,600 2,600 2,600 2,600 1,500 1,500 2,600 2,600 2,600 2,600 1,500 1,500 2,600 2,600 2,600 2,600 2,600 2,600 2,600 WKPK WKHI WKLO October 1,800 1,800 1,800 November 1,500 1,500 December 1,500 1,500 Month Export ( M W h ) Import ( M W h ) Month WE January 1,500 1,500 1,500 1,500 2,600 February 1,500 1,500 1,500 1,500 2,600 2,600 2,600 2,600 March 1,500 1,500 1,500 1,500 2,300 2,300 2,300 2,300 April 1,800 1,800 1,800 1,800 2,300 2,300 2,300 2,300 May 1,800 1,800 1,800 1,800 2,300 2,300 2,300 2,300 June 1,800 1,800 1,800 1,800 2,300 2,300 2,300 2,300 July 1,926 1,949 1,957 1,960 2,600 2,600 2,600 2,600 2,600 2,600 2,600 2,600 August 1,926 1,948 1,956 1,931 2,600 2,600 September 1,926 1,942 1,948 1,930 2,600 2,600 Table 5.15. BC-Alberta transmission limits Export ( M W h ) Import ( M W h ) Month WKPK WKHI WKLO W E HI WKPK WKHI WKLO WKHI October 560 560 560 560 700 732 744 700 November 640 640 640 640 725 741 747 725 December 560 560 560 560 725 757 769 725 769 767 January 560 560 560 560 750 764 February 560 560 560 560 725 757 769 725 March 640 640 640 640 725 741 747 725 April 560 560 560 560 560 586 595 569 May 560 560 560 560 560 573 578 562 June 560 560 560 560 560 586 595 560 July 435 435 435 435 560 582 590 587 August 435 435 435 435 580 606 615 584 September 435 435 435 435 560 586 595 565 123 5.4.2.5. Forward Sales Based on historical records for the period o f January 2000 to June 2005, a range o f the pre-sales in the U S and Alberta markets was used in the case study. Table 5.16 lists the minimum and maximum monthly forward sales for each sub-time step. The negative sign means pre-export and the positive sign means pre-import. Table 5.16. Forward transactions (Pre-export and Pre-import) WKPK WKHI WKLO October November December January February March April May 71 42 498 293 169 134 36 40 249 281 51 86 80 82 June 95 31 16 359 601 560 574 663 214 115 68 97 118 179 133 176 199 143 148 82 571 205 July August September Maximum (GWh) Minimum (GWh) Month WKPK WKHI WKLO 105 84 42 61 74 112 83 -50 -57 -41 -8 -1 -353 -396 -27 -73 -98 10 2 110 124 16 12 90 93 128 WE 63 20 -285 -59 -10 441 141 116 73 95 42 WE -17 -46 -61 6 1 73 46 -45 -66 109 83 -315 -460 -118 -36 60 26 -74 -22 -77 -539 -67 -42 124 5.4.2.6. Operation Constraints The operation and the physical limits related to reservoirs storage, turbine discharge, plant discharge, and plants generation are given in the Table 5.17. Table 5.17. Reservoirs operation and physical limits GMS Storage - cms-d Turbine Discharge - cms PCN KNA REV MCA Max 475,000 2,695 285,000 60,500 104,825 Min 55,000 2,440 125,000 58,500 7,850 Max 1,858 0 2,025 1,140 0 1,743 0 845 Min 0 0 Plant Discharge- cms Max Min 1,959 43 1,982 283* Generation - M W Max 2,730 700 1,800 2,000 190 Min 58 50 0 0 0 * P C N minimum ice flow releases are HOOcms in December and 1500 cms in January. To represent the Columbia River Treaty operation, the monthly average releases from K N A were used in this case study as listed in Table 5.18. Table 5.18. Average outflow releases from Keenleyside Month Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep KNA Outflow -cms 1101 1410 1664 1812 1226 761 712 718 830 1678 2159 1742 125 5.4.3. Scenario Generation 5.4.3.1. Inflow, Price, and Load Scenarios The moment matching scenario generation technique was used to develop a number of scenarios that represent the same statistical properties o f the historical data records for the inflow, price, and load variables. Based on 60 years o f records, the first four moments (mean, standard deviation, skewness, and kurtosis) and the correlation o f the monthly inflow for the Peace and Columbia Rivers were estimated. The Peace inflows represent the sum o f the G M S and the P C N local inflows. The Columbia inflows represent the aggregated local inflows o f M C A , R E V , and K N A . The statistics presented in Tables 5.19 and 5.20 define the properties o f the Peace and the Columbia inflows 24 random variables. Table 5.19. Monthly Peace system inflow statistics Peace Inflow Statistics Mean - cms Oct Nov Dec Jan 817 570 375 324 Mar Apr May Jun Jul Aug Sep 279 270 545 2512 3751 2002 1012 804 72 235 685 906 613 337 234 1.08 0.22 0.76 0.36 1.25 0.30 1.55 2.07 1.63 3.06 1.59 Feb SD -cms a 243 179 95 93 71 Skew Y 0.24 0.54 -0.02 0.822 0.342 Kurtosis K Table 5.20. 1.56 1.80 1.50 2.18 1.62 0.67 1.94 2.67 Monthly Columbia system inflow statistics Columbia Inflow Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Mean -cms H 659 474 337 285 264 283 637 2044 3298 2845 1811 1030 SD - cms a 172 144 80 69 69 80 190 490 645 606 340 234 Skew Y 1.20 0.74 0.71 0.28 0.14 2.05 2.01 1.58 2.03 1.52 1.19 2.90 1.65 2.93 0.39 1.66 0.73 K 1.23 3.00 0.72 Kurtosis 0.39 1.65 2.00 4.23 126 A correlation analysis between the different variables was also carried out for the moment matching method. However, the results indicate a weak correlation between the Peace and Columbia inflows, and this weak correlation protects against drought conditions at the system level. Operations planners at B C Hydro carried out a regression analysis to establish a relationship to relate the electricity prices in the heavy load hour ( H L H ) and light load hour ( L L H ) prices with the annual inflow volume at the Dalles dam (located on the Columbia River in the U S ) . This relationship was then expressed in the form o f price multipliers to inflate or deflate the average electricity market price forecast. Therefore, an additional random variable was added to the moment matching algorithm to generate scenarios representing the Dalles inflow volumes, and a set o f the H L H and L L H price multipliers were then used to generate the market price scenarios. Table 5.21 lists the statistics o f the Dalles inflow volumes. Table 5.21. Dalles Inflow Statistic Mean SD (bm) 3 (bm ) 3 Skew Kurtosis Inflow volume statistics at the Dalles Dam 11 a 171 32 Y -0.03 K 1.50 A s shown in Figure 5.16 a low correlation exists between the Peace and the Columbia inflow at K N A and the inflow volume at the Dalles during the fall and winter months. O n the other hand, a higher correlation exists between the Columbia inflow at K N A and the Dalles inflow volume during the summer months as compared with other months o f the year. 127 Figure 5.16. Correlation between the Dalles inflow volume and the monthly Peace and Columbia inflows To express the monthly variability o f the load forecast, the 5 , 50 , and 9 5 th th th percentiles o f the annual electricity load forecast as presented in Table 5.13 were considered. Table 5.22 presents the results o f using the three-point approximation method of a continuous variable as described by Pearson and Tukey, 1965, to estimate the mean and standard deviation o f the electric system load. Table 5.22. Results of the three point approximation of the mean and the variance Statistic Percentile Load (GWh) 5th Percentile Ps 56,046 Median Pso 58,769 95th Percentile P95 61,480 Mean in 58,767 Standard Deviation a 1,652 A Monte Carlo simulation was then used to generate 10,000 load scenarios. Figure 5.17 presents the system electric load cumulative probability distribution function for a log-normal distribution. The probability distribution was divided into 60 increments 128 to match the number o f historical inflow data sets for the Peace, the Columbia Rivers, and at the Columbia River at Dalles. These sixty load values were then used to estimate the first four moments and the load correlation with the other random variables. S Q. 0.40 0.30 0.20 0.10 0.00 — 54,000 56,000 58,000 60,000 62,000 64,000 System Load - GWh Figure 5.17. 5.4.3.2. Probability distribution function of forecasted system load Results of the Moment Matching Algorithm The moment matching algorithm developed by Kaut et al. (2003) was used to generate the set o f scenarios used in this study. The input data included the first four moments and the correlation matrix o f the 26 variables as described above (twelve month Peace and Columbia inflows, the Dalles inflow volume, and the electricity load forecast). The algorithm was run to generate a number o f scenario sets: ranging from 10 to 100 scenarios for each o f the 26 variables. To evaluate the quality o f the generated scenarios, a comparison between the statistics o f the generated scenarios and the historical data was carried out. The absolute mean relative error ( M R E ) metric was used in this analysis. Figure 5.18 presents the mean relative error ( M R E ) for the estimated four moments. Overall, the graph indicates that the generated scenarios preserve the statistical properties of the original data for the 26 variables with high degree o f accuracy. The results o f the mean are a 100% match for the mean o f the historical data, even for the low number o f 10 scenarios case. For the other statistics (standard deviation, skewness, and kurtosis), 129 Figure 5.18 demonstrates that the M R E decreases with the increase in the number o f generated scenarios. The M R E decreases to less than 5% as the number o f scenarios reaches 30 and it further reduces to lower values as the number o f generated scenarios increases, but with a lower rate o f decrease. 20 i Figure 5.18. Testing the quality of the different numbers of generated scenarios A s an example, a comparison o f the statistics o f the 30 generated scenarios o f the Peace River inflows with the historical records are displayed in Figure 5.19. The graph clearly demonstrates the high degree o f accuracy o f the estimated four moments o f the generated scenarios. 130 Mean 4000 • 3500 Scenario Generation Historical Data 3000 2 500 2000 1500 1000 500 0 — * OCT NOV JAN DEC FEB M Standard 1000 I I 900 700 JUN JUL A U G S E P JUN JUL A U G SEP JUN JUL A U G SEP Deviation I vjitsi l e i O L Hist Drical 800 MAY onth I I idi l u A P R MAR IUII Data 600 50 0 400 300 200 100 0 • NOV OCT DEC — J A N FEB MAR A P R M A Y M o nth Skewness 1.4 • 1.2 Scenario Generation • h t s t o r i c a l D E) t a 1.0 o.a 0.6 0.4 0.2 0.0 -0.2 OCT JAN NOV DEC I I I scenario FEB M A R A P R M A Y Kurtosis 3.5 Historical 3.0 I I Generation Data • 2.5 2.0 1.5 1.0 OCT N O V D E C J A N F E B M A R A P R M A Y J U N J U L A U G S E P M o n t h Figure 5.19. Comparison between the Peace inflow historical data and the generated scenarios statistics 131 5.4.4. Results of the R L R O M M o d e l Runs The input data for the case study presented in section 5.4.2 and the set o f scenarios o f the random variables described in section 5.4.3 were used to run the R L R O M model. The model was set to run on a monthly time step for thirty six months starting from the beginning o f the water year in October. The results o f using the model for the case study are presented and discussed in the following sections. To assess the quality o f the operation planning policies derived b y the model, the results are presented and analyzed. Finally, the model was run in control mode and the results o f a simulation run are presented and discussed. The following parameter setting was adopted in this case study: the exploitation exponent parameter 0.0125, learning rate exponent yr- 0.845, and annual discount rate o f 8%. 5.4.4.1. Value of water in storage A s stated in the previous chapter, the objective function maximizes the expected net revenue from the spot and forward transactions. Figures 5.20 to 5.22 present three dimensional (3-D) views o f the value function in Canadian dollars, as a function o f G M S and M C A storage volumes. The 3-D plots present the shape o f the value o f water in storage surface function for different months o f the year, namely: December, M a y , and August. The choice o f the three months was meant to represent: the winter, spring (freshet), and summer respectively. The graphs clearly demonstrate the dependence o f the value function on the storage level in both o f G M S and M C A reservoirs. It can also be seen that this dependence on the storage level varies from one month to the other. For example, the value function i n M a y is less sensitive to the storage level i n the other reservoir. It can also be seen that the minimum value o f water in storage occurs during the freshet when the storage level is rapidly increasing, and the market price is low. The slope o f the value function surface represents the dollar value o f each unit o f storage in dollars per cubic meter second-day ($/cms-d). The derivative o f the value 132 function with respect to the storage volume in M C A represents the marginal value o f water in M C A . Similarly, the derivative o f the value function with respect to G M S storage represents the marginal value o f water in G M S . The marginal value o f water is also called the shadow price or the Lagrange multiplier o f the mass balance equation for a reservoir (Equation 4.9). The steeper slopes o f the storage surface value functions for December and August, as shown in Figures 5.20 and 5.22, indicate a high marginal value o f water ($/cms-d) in those months compared with the corresponding values in M a y . This is mainly attributed to the high market prices and high demand during the winter months as shown in Figure 5.14 and Figure 5.15. Whereas in summer, the demand is low and the inflow hydrograph is i n a receding trend, and the high marginal values is a result o f the higher market opportunities during the heavy load hours ( H L H ) as shown in Table 5.12. O n the contrary, high inflows, low demands, and low market prices during the freshet results in flatter slopes o f the storage value function as shown in Figure 5.21. 133 Value ofWater in Storage (M$) 285,000 205,000 125,000 Kinbasket (MCA)storage (cms-d) Williston (GMS)-Storage (cms-d) I 500-1,000 Figure 5.20. • 1,000-1,500 • 1,500-2,000 • 2,000-2,500 Storage value function in December 2,5002,000 . .... x 1,500 Value ofWater in Storage (M$) 1 0 0 Q 500 285,000 205,000 0 Williston (GMS)-Storage (cms-d) I 0-500 Figure 5.21. ^ 1500-1,000 >,000 S' n 1,000-1,500 • 1,500-2,000 Kinbasket (MCA)s t o r a g e (cms-d) I 2,000-2,500 Storage value function in May V a l u e of W a t e r In S t o r a g e (M$> n b a s k e t (MCA)t o r a g e (cms-d) Williston (GMS)-Storage (cms-d) 0 0-500 Figure 5.22. •500-1,000 0 1,000-1,500 • 1.500-2,000 Storage value function in August •2,000-2,500 To examine the value function, the December storage value is plotted, as an example, in a 2-D view as shown in Figure 5.23 for different increments o f G M S storage. The results show the decreasing slope o f the value function with the increase o f storage in M i c a . This indicates the effect of the available storage volume on the value o f water in M i c a reservoir: as the M i c a reservoir level rises the marginal value o f the water decreases. A l s o , the graph demonstrates the decreasing marginal value o f water in M i c a as more storage is available in G M S . The decreasing gap between the curves o f the different G M S storage increments indicate a reduction in the marginal value o f water in G M S for a given storage level in M i c a as shown in Figure 5.23. Figure 5.23. Mica Storage value function in December Figure 5.24 presents the monthly value o f water in storage for different G M S reservoir storage increments with a storage level at 50% in M i c a reservoir (205,000 cmsd). The results indicate that the highest storage values occur in the period o f November to January and the lowest storage values in the period o f A p r i l to June. W i t h low storage in G M S in December and January, the graph demonstrates that the marginal water values (slope o f the first segment o f the storage value function) tend to be higher because: the electricity demand is high, inflows are low, and the Peace River ice flow constraint is imposed in December and January. 135 2,400,000,000 2,000,000,000 — 1,600,000,000 > 1,200,000,000 CD | 800,000,000 400,000,000 j 0 J, 50,000 , , , , , , , , 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 Mica Storage (cms-d) —•— October —I—April Figure 5.24. —«—November May December — J a n u a r y June July —*— February August —•—March September Monthly G M S storage value function with Mica reservoir at 50% storage level 5.4.4.2. Marginal value of energy One o f the most significant pieces o f information obtained from the model results is the marginal value o f energy, expressed in $ per megawatt-hours ( $ / M W h ) . A s stated earlier, the significance o f this information is that the operation planning dispatch decisions are based on a comparison o f the energy value with market prices. The marginal value o f energy is obtained by converting the marginal value o f water in $/cmsd to $ / M W h using the H K factor as expressed in Equation 4.22. Tables 5.23 and 5.24 present the monthly range o f the marginal value o f energy for M C A and G M S in ( $ / M W h ) . For each month, the marginal value o f energy is given for the combination o f empty or full M C A storage (125,000-285,000 cms-d) and empty or full storage in G M S (55,000-475,000 cms-d). For the case o f empty M C A reservoir, the summary results in Table 5.23 clearly demonstrate the impact o f the storage level in G M S on the marginal value in M C A in almost all months with the exception o f the period from A p r i l to June where the M C A marginal value o f energy is not sensitive to the storage level in G M S . The distinctly high marginal values o f energy in certain months could be attributed to the cost o f satisfying the high K N A outflow releases required in those 136 months by the Columbia treaty operation (average K N A outflow values were presented in Table 5.18). When M C A reservoir is full, it can be seen that the marginal value of energy is less sensitive to the available storage level in G M S . The same independence in M C A marginal values o f energy is noticeable during the freshet even for empty G M S storage. However, the influence o f G M S storage on M C A marginal value of energy during the freshet is most noticeable in the middle range of M C A storage as shown in Figure 5.28. Table 5.23. Month M C A monthly range of marginal value of energy GMS Storage cms-d October November December January February March April May June July August September MCA Marginal Value ($/MWh) Empty Full 125,000cms-d 285,000 cms-d 55,000 295.87 47.14 475,000 173.32 43.66 55,000 338.65 48.6 475,000 224.7 44.93 55,000 264.9 49.49 475,000 141.2 43.85 55,000 335.18 44.67 475,000 126.65 37.36 55,000 186.78 36.69 475,000 86.61 27.03 55,000 83.5 25.88 475,000 53.79 14.96 55,000 48.72 7.62 475,000 42.7 4.68 55,000 50.04 3.33 475,000 49.72 1.72 55,000 60.56 14.35 475,000 53.62 10.18 55,000 106.71 36.47 475,000 103.53 35.02 55,000 161.64 45.29 475,000 138.72 44.19 55,000 199.16 46.5 475,000 141.63 44.51 137 The summary results presented in Table 5.24 show the higher G M S marginal value of energy when the reservoir is at low storage levels during the period September to January. This is mainly due to the high cost associated with meeting the Peace ice flow constraint in December and January (1100 and 1500 cms respectively). The lowest G M S marginal value occurs in the months of A p r i l and M a y where the marginal value drops to values close to zero $ / M W h , when both of G M S and M C A are full. Table 5.24. Month G M S monthly range of marginal value of energy MCA Storage Empty Full 55,000cms-d 475,000 cms-d 125,000 231.78 41.87 285,000 96.70 41.74 125,000 440.05 42.12 285,000 267.82 41.78 cms-d October November December January February March April May June July August September GMS Marginal Value ($/MWh) 125,000 461.40 41.21 285,000 297.25 41.49 125,000 349.27 37.84 285,000 54.81 37.00 125,000 200.78 26.96 285,000 64.85 24.34 125,000 94.50 11.77 285,000 50.91 8.66 125,000 47.53 1.77 285,000 45.16 0.20 125,000 48.89 0.86 285,000 48.42 0.40 125,000 53.64 23.42 285,000 47.93 11.58 125,000 65.76 42.06 285,000 51.83 40.63 125,000 64.14 42.45 285,000 55.46 40.88 125,000 128.93 42.50 285,000 61.15 42.19 138 Figures 5.25 to 5.28 present typical examples o f the marginal value o f energy for M C A and G M S in winter, spring, and summer months - December, M a y , and August. Figure 5.25 demonstrates the sensitivity o f the marginal values of energy in M C A to the available storage in the reservoir particularly when the reservoir drops below 205,000 cms-d. This sensitivity is more prominent when the G M S reservoir storage is at or below 150,000 cms-d. When M C A and G M S reservoirs are at their lowest levels (i.e. at 125,000 and 55,000 cms-d respectively), M C A marginal value o f energy increases rapidly. The high marginal value is due to the penalty for violating the minimum storage constraint to meet the system electricity load or the Columbia treaty operation outflow constraint at K N A . In general, the results suggest that at the low range o f G M S and M C A storage levels, the marginal value o f energy is higher than the average L L H market price in December. In such a case it would be more valuable to store water than to release it. 275 i 125,000 , 145,000 165,000 185,000 205,000 225,000 245,000 265,000 285,000 Mica Storage (cms-d) — • — G M S V=55,000 - * — GMS"V=250,000 ——GMS~V=450,000 F i g u r e 5.25. —•— GMS V-100,000 -•—GMS~V=300,000 G M S V=475,000 GMS V=150,000 —t— GMS_V=350,000 -*-GMS_V=200,000 — GMS_V=400,000 M C A m a r g i n a l value of energy i n December O n the other hand, Figures 5.26 and 5.27 show that the marginal value o f energy i n G M S is more sensitive to the variation in M C A storage below a G M S storage volume o f 250,000 cms-d. The dependence o f G M S marginal value o f energy on M C A storage is 139 apparent when the storage in M C A is at or below 185,000 cms.d. The higher value o f G M S energy in December is mainly attributed to the higher electricity load and the high market prices accompanied with the lower inflow conditions in winter as compared with the freshet period. The high outflow releases downstream o f Peace Canyon dam ( P C N ) in December and January, are required to maintain the winter ice flow constraint and this also contributes to the high marginal values particularly when M C A is at or below 145,000 cms-d. Figure 5.27 demonstrates that the variation o f the G M S marginal value for different M C A storage is decreasing as G M S reservoir is above 50% o f its full storage volume (250,000cms-d). This is apparent in the decreasing gap between o f the marginal value o f energy curves for different values o f M C A storage as G M S storage increases. Figure 5.26. G M S marginal value of energy in December 140 Figure 5.27. G M S marginal value of energy in December Figures 5.28 and 5.29 display the marginal value o f energy for M C A and G M S i n the month o f May. In general, the marginal value o f water is low during this time o f the year for the full range o f storage in both o f M C A and G M S reservoirs. The highest M C A and G M S marginal values o f energy occur when the M C A and G M S reservoirs are empty and the value o f energy is about 50.0 $ / M W h . O n the other hand, the value o f energy drops to almost zero when the reservoirs are at full pool level. The low marginal value o f water is attributed to the fact that additional storage in these reservoirs would have a high probability o f spill during this period. Figure 5.28 demonstrates the significance o f the change in G M S storage levels on the M C A marginal values o f energy in the range from 165,000 to 265,000 cms-d. Figure 5.29 portrays the influence o f variation in M C A storage on G M S marginal value o f energy, in particular with G M S storage in the range from 250,000 to 400,000 cms-d. 141 125,000 145,000 165,000 185,000 205,000 225,000 245,000 265,000 285,000 Mica Storage (cms-d) -GMS_V=55,000 - * - G M S V=100,000 GMS V=150,000 -GMS V=250,000 —•— GMS~V=300,000 —(— GMS V=350,000 GMS V=450,000 GMS V=475.000 x Figure 5.28. M C A marginal value of energy in May Figure 5.29. G M S marginal value of energy in May GMS V=200,000 GMS V=400,000 Figure 5.30 demonstrates the marginal value o f energy for G M S i n August. The higher inflows and lower electricity demand in August, as compared with December, result i n a noticeably lower marginal value when both o f M C A and G M S are at high storage levels. Despite the lower load in August, as compared to the winter months, the high marginal values when the reservoirs are low could be attributed to the high California market opportunities in August, particularly during heavy load hours (see 142 Table 5.12 for market price structure). Figure 5.30 illustrates the significant influence o f M C A storage level on the G M S marginal value o f energy particularly when the G M S storage drops below 300,000 cms-d. Above 300,000 cms-d, the marginal value is less dependant on the G M S storage level. 50,000 100,000 150.000 200,000 250,000 300,000 350,000 400,000 450,000 GMS Storage (cms-d) — • — M C A _V=t25,000 _V=145,000 — • — M C A _V=225,000 — —M M CC A A V=245.000 — aI — Figure 5.30. M C A _V=«5,000 M C A _V=185,000 — « — MCA_V=205.000 M C A _V=265,000 — • — M C A _V=285.000 G M S marginal value of energy in August Figures 5.31 and 5.32 display the monthly M C A marginal value o f energy for different G M S storage increments with M C A storage at 125,000 and 285,000 cms-d respectively. Figure 5.31 displays on the secondary y-axis the average Keenleyside ( K N A ) outflow releases, as stipulated by the Columbia River Treaty. The impact o f these high release requirements in the winter months combined with high electricity load, high market prices, and low inflows are reflected in the high marginal values during those months. This effect is less noticeable in the summer as the load is not high and higher inflows are available as compared with winter months. The high marginal value of energy in the summer is mainly due to better market opportunities, particularly in August and September as shown earlier i n Table 5.12. Figures 5.31 and 5.32 demonstrate that, whatever the available storage in G M S and M C A 143 reservoirs, the lowest marginal values occur in A p r i l and M a y . Obviously, high natural inflows and lower electricity demand are the main reasons. A l s o both figures point to the increased sensitivity o f the M C A marginal values to the incremental change in G M S storage in the fall and winter months. 600 500 CD c LU 400 "5 cu 3 5 7000 8000 Oct B — —• N o v D e c J a n F e b Mar Apr May J u n Jul A u g S e p Month G M S V=100000 — • — G M S V=55000 K N A Outflow -~K— G M S "V=200000 —5*— G M S "~V=250000 G M S V=150000 — - — G M S "" V = 4 0 0 0 0 0 G M S V=300000 — 1 — G M S " "V=350000 G M S " "V=475000 GMS~V=450000 Figure 5.31. M C A monthly marginal value of energy for a low storage level at M C A ( M C A storage=125,000cms-d) Figure 5.32. M C A monthly marginal value of energy for a high storage level at M C A ( M C A storage=285,000cms-d) 144 Figure 5.33 demonstrates the effect o f the high Columbia Treaty release requirements on the K N A marginal value o f water in $/m when the M C A reservoir is at low and at 3 high storage levels. The marginal values are extracted from the shadow price o f the hydraulic continuity constraint (storage mass balance) o f the G O M model. The results indicate that when M C A storage level is low, the high flow releases result in significantly high marginal values o f water at K N A . These marginal values propagate and cause an increase in the M C A marginal o f water/energy as presented in Figure 5.31, in particular when the storage level at M C A is low. Figure 5.33. Marginal value of K N A storage constraint Figure 5.34 displays the monthly results o f G M S marginal values o f energy for a G M S storage level o f 55,000 cms-d. Similar to M C A ' s marginal values, the results indicate higher marginal values during the winter months. In addition, the results indicate the dependence o f the marginal values in G M S on the available storage i n M C A during the winter and fall months. The highest G M S marginal values occur in November, December, and January in response to the high releases that are required to satisfy the Peace River ice flow constraint and to high market prices. The graph also indicates that G M S marginal value is independent o f M C A storage in the spring and in early summer months as the storage level in the reservoir increases and when the demand is low. 145 Figure 5.34. G M S monthly marginal value of energy for an empty G M S reservoir storage (GMS storage=55,000cms-d) Figure 5.35 displays the monthly results o f G M S marginal values o f energy for a full G M S storage o f 475,000cms-d. Similar to M C A , the G M S lowest marginal values occur in A p r i l and in M a y . However, when the G M S reservoir is full, the marginal values are not sensitive to changes in M C A storage levels almost all year around. The graph also demonstrates that the marginal values during July and September are in the same order o f magnitude as those in October and December. 60 Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Month • M C A V=125.000 .-x—MCA~V/=185.000 —I—MCA^V=245,000 Figure 5.35. M C A V=145.000 —*—MCA~V=205,000 MCA^V=265,000 — • — M C A V=165,000 —•—MCA"^/=225,005 MCA3/=285,000 G M S monthly marginal value of energy for a full G M S reservoir storage (GMS storage=475,000cms-d) 146 To demonstrate the variation in the G M S marginal value o f energy i n the different months, the cases o f empty and full M C A storage are presented. For these two cases, Figures 5.36 and 5.37 portray the details o f the monthly variation in G M S marginal values o f energy for the range o f storage increments at G M S . The figures demonstrate the significant difference in G M S marginal value o f energy for G M S storage below a level o f 150,000 cms-d, for low and full M C A storage. This difference is apparent for most o f the year, with the exception o f the period from A p r i l to June. Figures 5.38 and 5.39 illustrate a magnified region for the results for G M S storage values above 150,000 cms-d. Comparing the results presented in the two figures, it can be seen that higher marginal values are noticeable in G M S when the M C A storage is low. This effect on G M S marginal value varies from month to month and decreases where G M S storage level increases. It also shows that the effect o f M C A storage on G M S marginal value is minimal when the G M S storage level is high. In both cases, the highest G M S marginal values occur in the period from October to January. The lowest marginal values occur during the months o f A p r i l and M a y . 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 GMS_Storage - cms-d -•—Oct -•—Nov May ——- Jun Figure 5.36. Dec Jul —*— Jan —* — Feb Aug Sep —•—Mar -H—Apr G M S monthly marginal value of energy for a low storage level at M C A ( M C A storage=125,000cms-d) 147 300 T 50,000 -•—Oct —I—Apr Figure 5.37. 150,000 —m—Nov May 250,000 350,000 GMS_Storage - cms-d Dec —— Jun — x — Jan Jul -*-Feb Aug 450,000 —•— Mar Sep G M S monthly marginal value of energy for a high storage level at M C A ( M C A storage=285,000cms-d) Figure 5.38. G M S monthly marginal value of energy for a low storage level at M C A ( M C A storage=125,000cms-d) 148 Figure 5.39. G M S monthly marginal value of energy for a high storage level at M C A ( M C A storage=285,000cms-d) 5.4.4.3. Optimum Control Policy Analysis In this section, the optimized control policies for operation planning derived by the R L R O M model are presented and discussed. The optimized control policy includes the net market transactions and the operation planning o f the five main reservoirs on the Peace and the Columbia Rivers, including the optimized plants' generation and turbine releases. T o illustrate the optimal control policies proposed b y the R L R O M model, the model was run using the market prices, electricity system load, and inflow scenario data presented in Table 5.25. Figures 5.40 and 5.41 present 3-D views o f the net market transactions in January and in August. It can be seen in Figure 5.40 that the model recommends a net import policy when the storage volumes i n M C A and G M S are low. For a M C A storage o f 125,000 cms-d, the results recommend a net import transaction regardless o f the storage level at G M S . A s the storage levels at M C A and at G M S exceed 200,000 cms-d, the model suggests a net export policy since the marginal value o f energy drops below the market price and it would be more profitable to export. 149 Figure 5.41 shows the high export policies recommended by the model in August as the market price in California is high in the heavy load hours. The G M S / M C A marginal values o f energy for the case o f full reservoirs are $44.51 and $42.19. W h e n both reservoirs are at the 50% storage level (i.e G M S and M C A storage are: 205,000 cms-d and 200,000 cms-d), the marginal values o f energy are: $45.93 and $50.70 respectively. Given that the market prices for H L H / L L H August market prices are $91 and $53, it can be observed that the market prices are higher than the marginal value o f energy even during L L H s . This explains why the suggested export policies by the R L R O M model are almost at the limit o f the inter-tie transfer capability o f both o f the U S and Alberta markets which is usually set at 2500 G W h in August. When the storage levels in G M S and M C A are low, the model recommends a m i x o f import and export depending on the hour o f the day. Nevertheless, the system is still a net exporter even with the l o w storage levels in G M S and M C A , as the electricity demand at this time o f the year is relatively low. 150 Table 5.25. Month Market price, system load, and natural inflow scenario data Market Price ($) W K P K WKHI W K L O WEHI WKPK System Load WKHI W K L O Local Inflow WEHI GMS PCN MCA REV KNA October 65 65 60 60 7,682 7,116 5,474 6,215 810 23 279 127 179 November 71 71 63 63 8,693 7,901 6,152 6,819 372 16 244 125 223 December 76 76 57 57 9,121 8,253 6,426 7,377 236 17 118 59 125 January 80 80 61 61 8,940 8,217 6,500 7,743 367 30 86 38 103 February 68 68 60 60 8,906 8,238 6,371 7,014 197 18 90 39 103 March 66 66 58 58 8,209 7,692 6,048 6,634 189 19 79 36 99 April 65 65 53 53 7,382 7,005 5,549 6,031 371 20 320 183 429 May 61 61 42 42 6,949 6,659 5,296 5,616 3,421 51 582 323 548 June 60 60 42 42 6,710 6,428 5,070 5,519 5,001 61 2,064 897 1,264 July 66 66 50 50 6,747 6,418 5,030 6,038 1,236 18 1,259 436 544 August 91 91 53 53 6,977 6,675 5,175 5,556 859 19 994 305 331 September 69 69 58 58 6,986 6,683 5,077 5,611 497 11 538 186 231 Figure 5.41. August net market transactions 152 For the cases when the storage levels in G M S are either low or full, Figures 5.42 and 5.43 portray the monthly net market transactions, for different M C A storage volumes. The results show that the net market transactions are largely dependent on the storage volume available in each reservoir. Figure 5.42 demonstrates that for a low G M S storage level, the system w i l l switch to a net importer o f energy mode for the whole range o f M C A storage in M a r c h and up to storage level o f 265,000 cms-d in February. For other months, the net transactions are largely dependent on the available storage in M C A . When G M S reservoir is full, Figure 5.43 shows that the system would be a net importer of energy in October, January, March, and A p r i l , when the storage level in M C A is low. Figure 5.43 also illustrates that when the M C A reservoir is full, the highest export occurs in the period from August to February, whereas the lowest export transactions occur in the period from March to June. This is attributed to the high H L H and L L H price margins in the summer and winter periods, as compared with those in the spring. The high level o f net export transactions occurs in the winter months for this scenario run as the load is lower than the available resources and there is a room to export. 2500 Mica Storage - cms-d • October m February June Figure 5.42. — M — November —•— March —»-July —*— December —i—April —m—August —K—January May September Monthly net market transactions for G M S storage of 55,000 cms-d 153 165000 185000 205000 225000 245000 265000 285B00 -1000 -1500 Mica Storage - cms-d - October - February -June - November - March -July - December -April -August January - May September Figure 5.43. Monthly net market transactions for G M S storage of 475,000 cms-d Figures 5.44 and 5.45 display the monthly variation in net market transactions with different storage levels in G M S and M C A . Figure 5.44 demonstrate that for low M C A and G M S storage levels, the system would be a net importer o f energy for most o f the year. When the G M S reservoir is full, the system would be a net exporter o f energy in all months except for January and March. Considering the case o f a full M C A reservoir, Figure 5.45 shows that the system would be a net exporter o f energy in all months with the exception o f M a r c h when storage in G M S is low. In addition, Figure 5.45 illustrates that when the M C A reservoir is full, the net transactions are sensitive to the amount o f G M S storage in all months except for August and January. 154 2500 -1500 -2000 Month GMS_V=55000, MCA_V=125000 Figure 5.44. —•—GMS_V =475000, MCA_V=125000 Monthly net market transactions for M C A storage of 125,000 cms-d 2500 - i — — 1 -1000 Month —•—GMS_V=55000, MCA_V=285000 - * — GMS_V=475000, MCA_V=285000 Figure 5.45. Monthly net market transactions for M C A storage of 285,000 cms-d A s indicated i n the previous chapter, the R L R O M optimizes the operation o f the five main plants on the Peace and the Columbia Rivers. The results for these plants for the months o f January are presented next. Figures 5.46 and 5.47 display the five plants generation i n January for different M C A storage when G M S reservoir is at low storage 155 and high storage levels respectively. Figure 5.46 shows that when the storage levels at M C A and at the G M S reservoirs are low, G M S generation was maintained at about 1300 G W h in order to release water that is needed to meet the Peace River ice flow constraint in January (1500 cms-d), as illustrated in Figure 5.48. Figure 5.47 illustrates that when the G M S storage is full, M C A generation increases gradually as the available storage in M C A increases. The Revelstoke generation increases, as M C A generation increases, to maintain the hydraulic balance with M C A reservoir as shown in Figures 5.48 and 5.49. Keenleyside is generating up to the maximum limit to minimize the spill flow needed to meet the Columbia Treaty release requirements. The spill flow from Keenleyside is the difference between the required treaty releases and the maximum turbine discharge. 1,500 X 1,250 CD 1,000 750 TO L _ 0> 500 c CD O 250 f 0 125,000 145,000 165,000 185,000 205,000 225,000 245,000 265,000 285,000 Mica Storage - cms-d -•—— G-GMS Figure 5.46. Plants ••G-PCN generation G-MCA in * January G-REV at -G-ARD different M C A storage increments with G M S storage at 55,000 cms-d 156 2,000 1,750 125,000 145,000 165,000 185,000 205,000 225,000 245,000 265,000 285,000 Mica Storage - cms-d -G-GMS Figure 5.47. - -G-PCN Plants • -G-MCA generation -G-REV in January at -G-ARD different M C A storage increments with G M S storage at 475,000 cms-d 1,600 • L400 § 1,200 S> 1,000 "I 8 0 0 « 600 I 400 H 200 0 125,000 145,000 165,000 185,000 205,000 225,000 245,000 265,000 285,000 Mica Storage - cms-d ~* Q-GMS Figure 5.48. I---Q-PCN Q-MCA * Q-REV -Q-ARD Plants average turbine releases in January at different M C A storage increments with G M S storage at 55,000 cms-d 157 2,000 1,750 E o | 1,500 o> 1,250 3Z 1,000 cc CD !o i — =! I— 125,000 145,000 165,000 185,000 205,000 225,000 245,000 265,000 285,000 Mica Storage - cms-d -i Figure 5.49. Q-GMS •Q-PCN Q-MCA -Q-REV -Q-ARD Plants average turbine releases in January at different M C A storage increments with G M S storage at 475,000 cms-d 5.4.4.4. Using the R L R O M model in Control Mode Once the R L agent learns the optimal control policy, it can then be used to control the operation planning o f the different reservoirs given the inflow, market prices and electricity load at any given time period. The R L R O M model is also capable o f predicting the operation policy for an extended period o f time (for example for several years) based on a given forecast. T o achieve this, the R L R O M model was modified to run in control mode, rather than i n learning mode, using the learned optimal control policies. A t any time period and for a given starting storage level, the target storage is linearly interpolated between the target points o f the state space. The model uses the learned target storage volumes, optimal forward sales, and the marginal values o f water to control the system operation. G i v e n this information and the forecasted inflow, market price, and the electricity load, the model solves the optimization problem at each time period and returns the optimal hydro generation schedules, turbine releases, and market transactions. To illustrate the results o f implementing the R L R O M model i n control mode, a data set o f inflows, price, and load forecast and initial storage (or forebay) levels conditions were considered in the analysis. 158 Figure 5.50 presents the monthly generation o f the five main plants in the study. The graph demonstrates that despite the lower electricity load in the months o f August and September the plants' generation is high to take advantage o f the export opportunities available in these months. O n the other hand, in June the plants' generation reduces to low levels, and the model recommends the system to store more water during the freshet and early summer months, as there is little market opportunity, and to generate heavily during the winter and late summer months to take advantage o f high market prices during these periods. 5,000 Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Months • GMS F i g u r e 5.50 BPCN • MCA rjREV • Keenleyside M o n t h l y plants generation Figure 5.51 displays the forecasted heavy load hours ( H L H ) and light load hours ( L L H ) market price and the marginal value o f energy for G M S and M C A i n $ / M W h . For the periods from October to February and from August to September, the marginal value of energy for G M S is lower than the L L H market price. Consequently, it is more profitable to release water from G M S than to store it. This is reflected in the high generation policy adopted in these months. In the period from M a r c h to July, it is more profitable to generate less at L L H , as the marginal value o f energy is higher than the market price and to generate more during H L H where the market price is higher than the 159 marginal value o f energy. The lower load and market prices explain the low generation during those months as it is more profitable to import and store water in periods o f good market opportunities i n the summer and in the winter months. The results presented in Figure 5.51 also indicate that the marginal value o f energy for M C A is slightly higher than that for G M S for most o f the year. The marginal values o f energy for M C A in the period from March to July are even higher than the H L H market prices. This explains the M C A low generation pattern during this period as shown in Figure 5.50. 65 60 S 55 £ 50 S 45 2 40 • •* ~ - • 35 30 Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Month MVW GMS Figure 5.51. MVW MCA •Price HLH •Price LLH Market price and marginal value of energy Figure 5.52 illustrates the load resource balance, where it can be seen that the system is in a net import mode for the period extending from February to July, while it is in a net export mode for the rest o f the year. 160 8,000 7,000 Months H G M S 1 IRFU • M T o t a l T r a n s a c t i o ns F i g u r e 5.52. ••••PCN 1 • m 1 Keenleyside Load IMUA 1 S m a l l HyHrr,-IPP-Thorrr,f>l L o a d resource balance Figure 5.53 shows that about 5 1 % o f the total energy is generated by the Peace River system ( G M S and P C N ) and 49 % was generated by the Columbia River system ( R E V , M C A , and K N A ) . About 80% o f the Peace generation was produced by G M S and the remaining 20% from P C N . The generation percentages from M C A , R E V , and K N A plants on the Columbia system were 42%, 52%, and 6% respectively. A detailed analysis of the results shows that combining the Peace and the Columbia five main plants, together generate about 65% o f the energy needed to meet the system load. These percentages closely match the percentages o f actual generation o f the B C Hydro system (Making the Connection, 2000), which gives confidence in the model results. 161 Keenleyside, 2.94% PCN, 10.74% Figure 5.53. Annual distribution of the five main plants generation To study the effect o f changes in the market price forecast on the optimized trading policies, three price scenarios (average, low, and high) were considered with an average annual price of $46, $41, and $51 respectively. Using the moment matching algorithm, three samples were generated for the stochastic variables and were used in this analysis. The results are displayed in Figure 5.54, and it suggests that, for the low price scenario, the system would be importing and exporting in different months o f the year, but it would be in a net importing mode (4695 G W h ) . Futher analysis o f the results shows that the system was a net exporter in January and in August, and a net importer for the rest o f the months. This is attributed to the fact that the L L H market prices are lower than the marginal value o f energy in all months and the H L H was also lower than the marginal value during the spring and fall periods where the system load is low and the market price is low. O n the other hand, for the case o f the high price scenario, the system would be in net exporter mode. The system is a net exporter in all months with the exception o f March to June. A s discussed earlier, when the marginal value o f energy is higher than the market price, the model tend to release more water for export purposes. 162 1500 -2500 J Month • Scenario 1-Average Price Figure 5.54. • Scenario 2-Lovv Price • Scenario 3-High Price Net market transactions for three price scenarios 5.5. Summary The R L approach methodology, as discussed in chapter 4 was first tested for a small scale single reservoir operation problem, and then on a two reservoir operation problem. The test and convergence results were encouraging and demonstrated the potential capability o f the R L approach to handle a larger scale multireservoir operation planning problem. The R L model was then expanded and linked to interact with the generalized optimization model ( G O M ) as the environment that represents (simulates) the real system. A piecewise linear function approximation technique was adapted to approximate the present value functions. Using this function approximation technique allowed the handling o f a larger state space and gave the flexibility to solve the optimization problem at hand. The target storage levels were approximated as a continuous function rather than grid points in the state space, as is usually done in the conventional S D P formulation. Three random variables; natural river inflows, electricity load, and market prices; were considered. A scenario generation-moment matching technique was adopted to generate the inflow, load, and price scenarios. In this way, the serial and cross correlations o f the 163 inflows, load, and price random variables were considered in the scenario generation process. The R L R O M model was used to optimize the operation o f the B C Hydro main reservoir systems on the Peace and the Columbia Rivers. T w o state variables for the G M S and M C A storage were considered. The objective o f the model was to estimate the value o f water in storage, marginal value o f water/energy, and the optimal operation planning decisions, or control policies, including market transactions. In addition, the model optimizes the operation o f the five main plants: G M S , P C N , M C A , R E V , and K N A . The model was run on a monthly time step with a number o f sub-time steps to capture the heavy load hour ( H L H ) and light load hour ( L L H ) variation i n load and in market prices. The R L R O M model results were presented and discussed, including the storage value function and the marginal value o f energy/water. The optimized control operation policy established was also presented and discussed. A modified version o f the model was developed which allows for use o f the results from the learning phase i n control mode. Examples o f the results o f using the R L R O M model in the control mode were also presented. The case study results indicate the dependence o f the marginal value o f energy in each reservoir on the available storage in other reservoirs. However, there are periods o f time and ranges o f storage levels where this dependence is not significant. A l s o , it was shown that the marginal value o f energy is largely affected by the constraints imposed on the system operation. The impact o f the ice flow constraint in the Peace River on the G M S marginal value o f energy was clearly demonstrated. Similarly, the influence o f the Columbia Treaty operation on the marginal value o f energy for M C A was presented and discussed. The results o f using the R L R O M model in system control mode indicate that the G M S and the M C A reservoirs are typically drawn down by A p r i l to May, just prior to the spring freshet. The drawdown process generally begins in September and October when inflows are low. The reservoir levels begin to increase in M a y when the turbine discharges are reduced due to lower system demands and increased local inflow. It reaches its highest annual levels in the August to September period. 164 6. CONCLUSIONS AND RECOMMENDATIONS 6.1. Summary Strategic planning o f multireservoir operation involves taking decisions with uncertain knowledge o f future supply (natural inflows), demand (domestic electricity load), and market conditions (market prices). In this research, different modeling approaches and optimization techniques for the operation o f multireservoir systems operation have been reviewed. A n explicit stochastic modeling approach was found to be the most appropriate way to address the different sources o f uncertainty. However, a major obstacle to this approach, that needs to be addressed and resolved, is the high dimensionality o f the problem. The research carried out, for this thesis, concluded that stochastic dynamic programming (SDP) is still a very powerful technique for capturing the essence o f the sequential, nonlinear, and stochastic reservoirs optimization problems. However, S D P requires that the values o f the transition probabilities and the transition rewards (model o f the system) are to be calculated. For large-scale systems that involve several stochastic variables, constructing the system model can be a very difficult task. It is for this reason that D P is said to be plagued by the curse o f dimensionality. One possibility to tackle these problems is through the use o f machine learning techniques from the field o f artificial intelligence ( A l ) , particularly the Reinforcement Learning ( R L ) technique. This thesis has explored the use o f R L artificial intelligence approach to solve the large-scale operations planning problem o f a hydroelectric power multireservoir system. R L is a computational approach for learning from interaction with an environment and from the consequences o f actions to derive optimal control strategies. R L has adapted key ideas from various disciplines namely: machine learning, operations research, control theory, psychology, and neuroscience (Sutton and Barto, 1998). The application o f the R L technique in the water resources systems domain is relatively new. However, the advantages that R L offers in dealing with large-scale problems, make it a promising area 165 of research in that field. Modern reinforcement learning techniques can be applied to both trial and error learning, without a formal model o f the environment, and to planning activities with a formal model o f the environment, where an estimate o f the statetransition probabilities and immediate expected rewards could easily be evaluated. A detailed review o f the R L approach and its main algorithms was conducted and presented in this thesis. A R L based approach is adopted in this research work to develop a stochastic large-scale medium/ long term multireservoir operation planning model ( R L R O M ) . The objective o f the model is to develop operating policies that maximizes the expected revenues from energy transactions in a competitive market, while considering the uncertainties, such as future inflows to the reservoirs and the integrated system operation and physical constraints. In this research, the dispatch o f the hydropower generation from the different available resources was based on the incremental production costs, which were based on the concept o f the marginal value o f water in storage. Accordingly, the value o f water ($/cms-day) /energy value ($/MWh) was derived as a function o f the storage volume o f the two major multiyear storage reservoirs on the Peace and Columbia Rivers ( G M S and M C A ) . The uncertainties in the main random variables were expressed in the form o f scenarios. A scenario generation-moment matching approach was adapted to generate a set o f scenarios for the inflow, market prices, and electricity loads, that preserved the statistical properties, in particular the moments and the correlations o f these multiple stochastic variables. A hydroelectric reservoir optimization model ( G O M ) based on a linear programming algorithm, which had been previously developed and was in use by B C H y d r o , was integrated with the R L R O M model. Instead o f interacting with the real system, the R L agent used the G O M as a model o f the environment to provide feedback on its actions and it used this information dynamically to learn the optimum control policy. 166 6.2. Conclusions The developed R L R O M model was successfully used to handle a large-scale multireservoir operation planning problem. The developed model was applied to B C Hydro's main reservoirs on the Peace and the Columbia Rivers. The R L R O M model was designed to run for a multiyear planning horizon (36 months for the case study considered). This allowed the development o f realistic planning policies, given the large amount o f multiyear storage in the Peace and Columbia River reservoir systems. The results demonstrated that the model was capable o f providing realistic answers to the strategic planning questions, including: What is the value o f water in storage in each o f the multiyear storage reservoirs? What is the marginal value o f water /energy in each o f the multiyear storage reservoirs? What are the optimal system control decisions including when to buy or sell energy and how much to buy or sell; and how much energy to generate from each o f the five plants in the system. The R L R O M model, which is presented in this thesis, considers the stochastic variables: domestic load, market prices, and natural inflows. The moment matching technique, which was used for generating scenarios o f the inflow, price, and electricity load variables has the great advantage o f using only a limited number o f scenarios to represent the properties, correlations and cross correlations o f an extensive time series based on historical records. The marginal value o f water/energy for M C A and G M S obtained from the R L R O M model represent key information that controls the strategic store/release decisions. In addition, this information feeds to other B C Hydro medium and short term optimization models (for example: G O M and S T O M ) . The shorter term models use this information together with the target storage in the trade-off decisions between the shorter term and longer term benefits. A t the beginning o f the runs, the model operates in a training mode and the agent is progressively learning the optimal value function and the optimal control policies for the system. Once the model results convergence, the agent knows the value function and the optimal control policies. A t this point, the R L R O M model can be used on a continuous 167 time bases to maintain optimum control over any time period. Another useful and practical application o f the model is to examine the system behavior under certain forecasts o f the random variables and to estimate the corresponding marginal value o f energy. This sensitivity analysis is useful for understanding the effects o f the changes i n any o f the random variables on the operation policy and on the marginal value o f water/energy. For example, being able to estimate the effects o f multiyear droughts on system operation is o f particular significance to strategic planning for operation o f the system. Reinforcement Learning offered two key advantages i n handling the large-scale multireservoir control problem: (1) R L provides a way to avoid the curse o f modeling. B y using sampling (simulation), R L can be applied to large-scale problems without an explicit evaluation and enumeration o f all the transition probabilities and their expected immediate rewards. (2) R L overcomes the curse o f dimensionality through the use o f function approximation methods. Those methods require a limited number o f elements to specify a value function, instead o f possibly millions o f states. A hydroelectric reservoir optimization model ( G O M ) based on linear programming was integrated with the R L R O M model. In this way, the optimization process was extended to include the other reservoirs on the Peace and the Columbia namely, the Dinosaur, Revelstoke, Keenleyside reservoirs, i n addition to the main ones at G M S and M i c a . This link, has allowed an efficient on-line interaction between the agent and the environment (the operator and the model) during the iterations. A l s o , it made it possible to capture the effects o f the diurnal variation o f the price and load on shorter time periods. In addition, the optimal plant generation schedules which are based on the optimal unit commitment and loading (represented i n G O M b y piecewise linear plant generation functions) is captured in the R L R O M model. 168 The results o f the case study implemented using B C Hydro system demonstrated the following: - The major influence o f the available storage i n the G M S and M C A storage reservoirs are on the storage value, marginal value, and optimum control policy. However, the study also demonstrated that there are periods o f time and ranges o f storage levels in G M S and M C A where this influence is not significant. - The Peace Canyon River minimum ice flow constraint has a major effect on the release policies and on the marginal values i n G M S . Similarly, for the Columbia River system the required Columbia Treaty operation influences the marginal value o f energy in M C A due to the high releases required from the Keenleyside dam during certain periods o f the year. - The model backs-off plants generation levels to store water, during light load hours ( L L H ) , to take advantage o f the difference in heavy load hours ( H L H ) and light load hours ( L L H ) price margins and it exports more when there is a market opportunity with high market prices. During the summer, when the domestic loads are low, however, the model is heavily exporting to take advantage o f the good market opportunity, particularly in California. - The moment matching scenario generation results are very encouraging. A sensitivity analysis on the effect o f the number o f generated scenarios on the mean relative error ( M R E ) i n the moments and correlation determined the appropriate number of scenarios that should be used in the R L R O M model runs. The number o f generated scenarios, which were derived, demonstrated that this is a very practical approach to obtaining good results, without excessive amounts o f computing. In summary, it can be concluded that the R L approach is a very effective approach to the strategic planning for the operation o f large-scale hydroelectric power generation systems. The developed methodology has the potential to significantly improve the operational planning for the integrated B C Hydro system. 169 6.3. Future Work There are a number o f ways in which the techniques, developed in this thesis, could be enhanced and areas into which it could be expanded. The following is a proposed list o f future related research work that could be carried out to extend and enhance this work: Investigate the use o f multi-agent (independent/cooperative agents) and to use parallel processing for running the model to speed up the learning process and to decrease C P U time. Investigate other R L approaches including S A R S A , Dyna-Q and other R L techniques that could potentially integrate the planning and learning process. Future research is needed to generalize the use o f the function approximation approach, outlined in this research. Neural networks are a possible alternative to the currently adopted piecewise linear function approximation technique. Automate the process o f selecting the values o f the R L parameters (learning rate and exploration/exploitation rate). Further develop the R L stochastic optimization model to be used for on-line real time control model. Extend the model to include other state variables, such as additional plants and check their effects on the marginal values o f energy. M o d e l the Columbia River Treaty operation constraints i n more detail. Include regional transmission limits, when modeling regional loads and resources in the B C Hydro system. 170 REFERENCES Abdulhai B . , R. Pringle, and G . J. Karakoulas, "Reinforcement learning for True Adaptive Traffic Signal Control," Journal of Transportation Engineering, ASCE, 129 (3), 278-285, 2003. Arnold, E . , P. Tatjewski, and P. Wolochowicz, " T w o Methods for Large-Scale Nonlinear Optimization and Their Comparison Optimization," Journal of Optimization on a Case Study of Theory and Applications, Hydropower 81(2), 221-248, 1994. Barros, M . , Tsai F., Y a n g , S.-L., Lopes, J., and Y e h , W . "Optimization o f Large-Scale Hydropower System Operations," Journal of Water Resources Planning and Management, A S C E , 129(3), 178-188, 2003. B C Hydro, "Making the Connection: the BC Hydro electric system and how it is operated," 2000. Bellman, R . E . , Dynamic Programming, Princeton University Press, Princeton, N e w Jersey, 1957. Benders, J. F., "Partitioning Procedures for Solving M i x e d Variables Programming Problems," Numerishe Mathematik, 4, 238-252, 1962. Bertsekas, D . and J. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, Mass., 1996. Bertsekas D . P., "Neuro-Dynamic Programming." Encyclopedia of Optimization, Kluwer, 2001. Bhattachaya B . , A . H . Lobbrecht, and D . P. Solomatine, "Neural Networks and Reinforcement Learning in Control o f Water Systems," Journal of Water Resources Planning and Management, A S C E , 129(6), 2003. Birge, J. R., "Decomposition and Partitioning Methods for Multistage Stochastic Programs," Oper. Res., 33, 989-1007, 1985. 171 Birge J. R. and M u l v e y J. M . , "Stochastic Programming," In Mathematical Programming for Industrial Engineers, eds. M . A v r i e l and B . Golany (Dekker, N e w Y o r k , 1996), 543-574, 1996. Can, E . K . and M . H . Houck, "Real Time Reservoir Operations by Goal Programming," Journal of Water Resources Planning and Management, A S C E , 110(3), 297-309, 1984. Cario M . , B . Nelson, "Modeling and Generating Random Vectors with Arbitrary Marginal Distributions and Correlation Matrix," Technical Report, Department o f Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois, 1997. Changchit, C , and M . P. Terrell, " C C G P M o d e l for Multiobjective Reservoir System," Journal of Water Resources Planning and Management, A S C E , 115(5), 1989. Crawely, P. and G . Dandy, "Optimal Operation o f Multiple Reservoir Systems," Journal of Water Resources Planning and Management, A S C E , 119(1), 1-17, 1993. Coloroni, A . , and G . Fronza, "Reservoir Management V i a Reliability Programming," Water Resources Research, 12(1), 1976. Crites, R. H . and Barto, A . G . , "Improving elevator performance using reinforcement learning," In D . S. Touretzky, M . C . Mozer, M . E . H . , editor, Advances Information Processing Systems: Proceedings in Neural of the 1995 Conference, pages 1017- 1023, Cambridge, M A . M I T Press, (1996). Dantzig, G . B . and G . Infanger, "Intelligent Control and Optimization under Uncertainty with Application to Hydro Power," European Journal of Operational Research, 97, 396-407, 1997. Davis, R. E . and R . Pronovost, " T w o Stochastic Dynamic Programming Procedures for L o n g Term Reservoir Management," paper presented at I E E E Power Engineering Society summer meeting, I E E E , San Francesco, California, July 9-14, 1972. Devroye, L u c , Non-Uniform Random Variate Generation, Springer-Verlag, N e w Y o r k , 1986. 172 Dorigo, M . , E . Bonabeau., and G . Theraulaz, " A n t Algorithms and Stigmergy," Future Generation Comput. Systems, 16, 851-871, 2000. Dupacova J., Giorgio Consigili, and Stein W . Wallace, "Scenarios for Multistage Stochastic Programs," Annals of. Operations Research, 100:25-53, 2000. Druce, D . J., "Decision Support for Short Term Export Sales From a Hydroelectric System," In Computerized Decision Support Systems for Water Managers, Labadie J., L E . Brazil, I. Corbue, and L E . Johanson, (eds)., A S C E , 490-497, (1989). Druce, D . J., "Incorporating D a i l y Flood Control Objectives Into a Monthly Stochastic .Dynamic Programming M o d e l for a Hydroelectric Complex," Water Resources Research, 26(1), 5-11, 1990. Ernst Damien, Mevludin Glavic and Louis Wehenkel,. "Power Systems Stability Control .Reinforcement Learning Framework," Transactions Accepted on for publication in Power IEEE Systems, http://www.montefiore.ulg.ac.be/~ernst/IEEEtrans-2003.pdf, 2003. Evan-Dar, E . and Y . Mansour, "Learning Rates for Q-Learning," Journal of Machine Learning Research," 5, 1-25, 2003. Gasser, Michael, "Reinforcement learning: exploitation and exploration", http://www.indiana.edu/~q320/Notes/rl4.html. 2004. Giles, J. and W . Wunderlich, "Weekly Multipurpose Planning M o d e l for T V A Reservoir Systems," Journal of Water Resources Planning and Management, A S C E , 107(2), 1-20, 1981. Goldberg, D . , "Genetic Algorithms in Search Optimization and Machine Learning," Addison-Welsy Publishing Company, Inc., Reading, Massachusetts, 1989. Gosavi A . , "Simulation-Based Optimization," Kluwer Academic Publishers, 2003. Grygier, J. and J. Stedinger, " Algorithms for optimizing Hydropower System Operation," Water Resources Research, 21(1), 1-10, 1985. 173 Foufoula-Georgiou E . and P. K . Kitanidis, "Gradient Dynamic Programming for Stochastic Optimal Control o f Multidimensional Water Resources Systems," Water Resources Research, 24(8), 1345-1359, 1988. Fourer, R., D . M . Gay, B . W . Kenigham, "AMPL: Programming," A Modeling Language of Mathematical The Scientific Series Press, 2003. Halliburton, T. "Development o f a M e d i u m Term Planning M o d e l for B P A , N S R Information, 1997. Hiew, K . , J. Labadie, and J. Scott, "Optimal Operational Analyses o f the Colorado-Big Thompson Project," in Computerized Decision Support Systems foe Water Managers, J. Labadie et al., (eds.), A S C E , N e w Y o r k , 632-646, 1989. Higle J. L . and Sen S., "Stochastic Decomposition: an algorithm for two-stage linear programs with recourse," Mathematics of Operations Research, 16:650-669, 1991. Hogan, A . J., J. G . , Morris, and H . E . Thompson, Decision Problems Under Risk and Chance Constrained Programming: Dilemmas in the transition, Management Science, 27(6), 698-716, 1981. Howson, H . R. and N . G . F. Sancho, A N e w Algorithm for the Solution o f Multistate Dynamic Programming Problems, Math. Programming, 8, 104-116, 1975. Hoyland K . , M . Kaut and S. W . Wallace, " A Heuristic for Moment-Matching Scenario Generation," Computational Optimization and Applications, 24(2-3), 169-185, 2003. Hoyland K . , and S. W . Wallace, "Generating Scenario Trees for Multistage Decision Problems," Management Science, 47(2), 295-307, 2001. Huang, W - C , R. Harboe, and J. J. Bogardi, "Testing Stochastic Dynamic Programming Models Conditioned on Observed or Forecasted Inflows," Journal of Water Resources Planning and Management, A S C E , 117(1), 28-36, 1991. Infanger G . , Planning Under Uncertainty: Solving Large-Scale Stochastic Linear Programs, B o y d and Fraser, Danvers, 1994. 174 Jacobson, H . and Q. Mayne, Differential Dynamic Programming, Elsevier, N e w Y o r k , 1970. Jacobs, J. G . , J. Grygier, D . Morton, G . Schulz, K . Staschus, and J. Stedinger, "SOCRATES: A System for Scheduling Hydroelectric Generation Uncertainty," in Models for planning under uncertainty, Under Vladimirou et al. (eds.), Baltzer Publishers, Bussum, The Netherlands, 1995. Jalali, M . R., A . Ashfar, and M . A . Marino, "Improved A n t Colony Optimization," Scientia Iranica, 13(3), 295-302, 2006. Johnson, S.A., J. R. Stedinger, C . A . Shoemaker, Y . L i , and J. A . Tejada-Guibert, "Numerical Spline Stochastic Dynamic Programming," in Proceedings Annual Conference on Water Resources Planning and Management, of the 16th A S C E , 137- 140, 1989. Johnson, S.A., J. R. Stedinger, C . A . Shoemaker, Y . L i , and J. A . Tejada-Guibert, " Numerical Solution o f Continuous-State Dynamic Programs U s i n g Linear and Spline Interpolation," Oper. Res., 41(3), 484-500, 1993. Kaelbling P. Leslie, M . L . Littman, and A . W . Moore, "Reinforcement Learning: A Survey," Journal of Artificial Intelligence Research, 4, 237-285, 1996. Kaut M . and S. W . Wallace, "Evaluation o f Scenario-Generation Methods for Stochastic Programming," http://work.michalkaut.net/CV and studv/SG evaluation.pdf 2003 Kelman, J., J. R . Stedinger, L . A . Cooper, E . H s u , and S. Y u a n , "Sampling Stochastic Dynamic Programming applied to reservoir operation," Water Resources Research, 26(3), 1990. K o , S - K , D . Fontane, and J. Labadie, "Multiobjective Optimization o f Reservoir Systems Operations," Water Resources Bulletin, (28(1), 111-127, 1992. Kumar, D . N . and M . J. Reddy, " A n t Colony Optimization for Multi-Purpose Reservoir Operation," Water resources Management, 20(6), 879-898, 2006. 175 Kracman, D . R., D . C . M c K i n n e y , D . W . Watkins, and L . S . Lasdon, "Stochastic Optimization o f the Highland Lakes System in Texas," Journal of Water Resources Planning and Management, A S C E , 132(2), 62-70, 2006. Labadie, J. W . , "Dynamic Programming with the Micro-Computer," in Encyclopedia Microcomputers, of A . Kent and J. Williams, eds., V o l . 5, 275-337, Marcel Dekker, Inc., N e w Y o r k , 1990. Labadie, J . W . , "Reservoir System Optimization M o d e l s , " Water Resources Update Journal, 107, Universities Council on Water Resources, 1998. Labadie, J. W . , "Optimal Operation o f Multireservoir Systems: State-of-the-Art Review," Journal of Water Resources Planning and Management, A S C E , 130(2), 93-111, 2004. Lamond, B . F. and A . Boukhtouta, "Optimizing Long-Term Hydropower Production Using Markov Decision Processes," International Transactions in Operational Research, 3, 223-241, 1996. Lamond, B . F. and Boukhtouta, A . , "Water reservoir applications o f Markov decision processes," Handbook of Markov Decision Processes: Methods and Applications, Eds. Feinberg and A . Schwartz, Kluwer, 537-558, 2001. Lamond, B . F., "Stochastic optimization o f a hydroelectric reservoir using piecewise polynomial approximations," INFOR, 41(1), 51-69, 2003. Lamond, B . F. and A . Boukhtouta, " A Neural Approximation for the Optimal Control o f a Hydroplant with Random Inflows and Concave Revenues," Journal of Energy Engineering, A S C E , 131(1), 72-95, 2005. Larson, R., State Increment Dynamic Programming, Elsevier, N e w Y o r k , 1968. Loganathan, G . and D . Bhattacharya, "Goal-Programming Techniques for Optimal Reservoir Operations, Journal of Water Resources Planning and Management, A S C E , 116(6), 820-838, 1990. Loucks, D . and P. Dorfman, " A n Evaluation o f Some Linear Decision Rules in ChanceConstrained Models for Reservoir Planning and Operation," Water Resources Research, 11(6), 1975. Loucks, D . P . , Stedinger, J. R., and Haith, D . A . , Water Resources System Planning and Analysis. Prentice-Hall Inc., Englewood Cliffs, N e w Jersy 1981. 176 Loretan M . , "Generating Market Risk Scenarios Using Principal Component Analysis: Methological and Practical Consideration," In The Measurement of Market Risk, C G F S Publications N o . 7, 23-60. Bank for International Aggregate Settlements, http//www.bis.org/publ.ecsc07.htm. 1997 Lund, J . and I. Ferreira, "Operating Rule Optimization for Missouri River Reservoir System," Journal of Water Resources Planning and Management, A S C E , 122(4), 287-295, 1996. L u r i , P. M . and M . S. Goldberg, " A n Approximate Method for Sampling Correlated Variables from Partially Specified Distributions," Management Science, 44(2), 203218, 1998. Martin, Q . W . , "Optimal D a i l y Operation o f Surface Water Systems," Journal of Water Resources Planning and Management, A S C E , 113(4), 453-470, 1986. Murray, D . and S. Yakowitz, "Constrained Differential Dynamic Programming and its Application to Multireservoir Control," Water Resources Research, 15(5), 1017- 1027, 1979. Mousavi, S., M . Karamouz, and M . B . Menhadj, "Fuzzy-State Stochastic Dynamic Programming for Reservoir Operation," Journal of Water Resources Planning and Management, A S C E , 130(6), 460-470, 2004. Nash, G . , "Optimization o f the Operation o f a Two-Reservoir Hydropower System, Ph.D. Thesis, Department o f C i v i l Engineering, University o f British Columbia," 2003. Parish, Rudolph S., "Generating Distributions," Computational Pearl, J., "Heuristics: Random Deviates from Multivariate Pearson Statistics and Data Analysis, 283-295, 1990. Intelligent Search Strategies for Computer Problem Solving," Addison-Wesley, 1984. Pearson, E . S . and J.W. Tukey, "Approximate means and Standard Deviations Based on Distances between Percentage Points o f Frequency Curves," Biometrika, 52 (3/4), 533-546, 1965. 177 Pennanen T. and K o i v u M . , "Integration quadratures i n Discretization o f Stochastic Programming," E-Print Series, http://www.speps.info, 2002. Pereira, M . V . F., "Stochastic Optimization o f a Multireservoir Hydroelectric System: A Decomposition Approach," Water Resources Research, 21(6), 779-792, 1985. Pereira, M . V . F., "Optimal Stochastic Scheduling o f Large Hydroelectric Systems," Electrical Power & Energy Systems, 11(3), 161-169, 1989. Pereira, M . V . F. and Pinto, L . M . V . G . , "Multi-stage Stochastic Optimization Applied to Energy Planning," Mathematical Programming, 52(3), 359-375, 1991. Pereira, M . V . F, Campodonico, N . , Kelman, R., "Application o f Stochastic Dual D P and Extensions to Hydrothermal Scheduling," PRSI Technical Report, http://www.psrinc. com.br/reports. asp. 1999. Pflug G . C . , Scenario Tree Generation For Multiperiod Financial Optimal discretization," Mathematical Programming, 89(2), 251-271, 2001. Piekutowski, M . R . , Litwonowicz, T. and Frowd, R. J., "Optimal Short-Term Scheduling for a Large-Scale Cascaded Hydro System," IEEE Transactions on Power Systems, 9(2), 292-298, 1993. Picacardi, C, and R. Soncini, "Stochastic Analysis Program ( S A P ) for SDP Optimization," Dep. O f Environ. Eng., Cornell Univ., Ithaca, N . Y . , 1991. Pronovost, R. and J. Boulva, "Long-Range Operation Planning o f a Hydro-Thermal System Modeling and Optimization," Meeting the Canadian Electrical Association, Toronto, Ont. M a r c h 13-17, 1978. Raman, H . and V . Chandramouli, "Deriving a General Operating Policy for Reservoirs Using Neural Networks," Journal of Water Resources Planning and Management, 122(5), 342-347, 1996. Reis, L . F. R., G . A . Walters, D . Savic, and F. H . Chaudary, "Multi-Reservoir Operation Planning U s i n g Hybrid Genetic Algorithm and Linear Programming ( G A - L P ) : A n Alternative Stochastic Approach," Water Resources Management, 19(6), 831-848, 2005. 178 Reznicek, K . , and Cheng, T . C . E . , "Stochastic Modeling o f Reservoir Operations," European Journal of Operational Research, 50, 235-248, 1991. Robbins, H . and S. Monro, " A Stochastic Approximation Method," Ann Math. Statict., (22), 400-407, 1951. Romisch W. and H. Heitsch, "Scenario Reduction Algorithms in Stochastic Programming," Computational Optimization and Applications, 24(2-3), 187-206, 2003. Rotting, T. A . and Gjelsvik, A . , "Stochastic Dual Dynamic Programming for Seasonal Scheduling in the Norwegian Power System," I E E E Transactions on Power Systems, 7(1), 273-279, 1992. Russell, S. O., and P. E . Campbell, "Reservoir programming," Journal Operating of Water Resources Planning Rules with and Management, fuzzy ASCE, 122(3), 165-170, 1996. Saad, M . , A . Turgeon, P. Bigras, and R. Duquette, "Learning Disaggregation Technique for the Operation o f Long-Term Hydrotechnical Power Systems," Water Resources Research, 30(11), 3195-3202, 1994. Singh, S. and D . Bertsekas, "Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems," Advances in Neural information Processing Systems, Conf. Proc, M I T Press, Cambridge, Mass., 1996 Shawwash Z . K . , T. Siu, and S. Russel, "The B C Hydro Short Term Hydro Scheduling Optimization M o d e l " IEEE Transactions on Power Systems, 2000. Shawwash Z . K . "A Decision Support System for Real-Time Hydropower Scheduling in a Competitive Power Market Environment,'''' P h D Thesis, Dept. o f C i v i l Engineering, U B C , 2000. Sutton, R . S . , "Reinforcement Learning," In Rob W i l s o n and Frank K e i l (Eds.), The MIT Encyclopedia of the Cognitive Sciences, MIT Press, 1999, ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/Sutton-99-MITECS.pdf 179 Sutton, R. S. and A . G . Barto, "Reinforcement Learning - An Introduction, " M I T Press, Cambridge, Massachusetts, 1998. Sutton, R . S . , "Learning to Predict by the Methods o f Temporal Difference," Machine Learning, Kluwar Academic Publishers, 3: 9-44, 1988. Takeuchi, K . , " Chance-Constrained M o d e l for Real-Time Reservoir Operation U s i n g Drought Duration Curve," Water Resources Research, 22(4), 1986. Tejada-Guibert, J., J. Stedinger, and K . Staschus, "Optimization o f the Value o f C V P ' s Hydropower Scheduling," Journal of Water Resources Planning and Management, A S C E , 116(1), 53-71, 1990. Tejada-Guibert, J. A . , S. Johnson, and J. R. Stedinger, "Comparison o f T w o Approaches for Implementing Multireservoir Operating Policies derived Using Stochastic Dynamic Programming," Water Resources Research 29(12), 3969-3980, 1993. TejadaGuibert, J. A . , S. Johnson, and J. R . Stedinger, "The Value o f Hydrologic Information in Stochastic Dynamic Programming Models o f a Multireservoir System," Water Resources Research 31(10), 2571-2579, 1995. Tesauro, G . J., "Temporal difference learning and T D - G a m m o n , " Communications ACM(38), of the 58-68, 1995. Tilmant, A . , E . Persoons, M . Vanclooster, " Deriving efficient reservoir operating rules using flexible stochastic dynamic programming," Proc. I ' Int. Conf. on Water s Resources Management, W I T Press, U . K . 353-364, 2001. Tsitsiklis, J . N , "Asynchronous Stochastic Approximation and Q-Learning," Machine Learning (16), 185-202, 1994. Turgeon A . , " Optimal operation o f Multireservoir Power Systems with Stochastic Inflows," Water Resources Research, 16(2), 275-283, 1980. Turgeon, A . , " Optimal Short-Term Hydro Scheduling F r o m the Principle o f Progressive Optimality," Water Resources Research, 17(3), 481-486, 1981. Turgeon, A . , " A n Application o f Parametric Mixed-Integer Linear Programming to Hydropower Development," Water Resources Research, 23(3), 1987. 180 Turgeon A . , "Solving a Stochastic reservoir management problem with multilag autocorrelated inflows," Water Resources Research, 41(12), W12414, 2005. Unver, O. and L . Mays, " M o d e l for Real-Time Optimal Flood Operation o f a Reservoir System," Water Resources Management, 4, 21-46, 1990. Valdes, J. B . , Montburn-Di Filippo, J., Strzepek, K . M . , and Restepo, P. J., "AggregationDisaggregation Approach to Multireservoir Operation," Journal of Water Resources Planning and Management, 118(4), 423-444, 1992. V a n Slyke, R. and R . J . - B . Wets, " L-Shaped Linear Programs with Applications to Optimal Control and Stochastic Programming, SIAM J. Appl. Math., 17, 638-663, 1969. Vogel, R. M . and J. R. Stedinger, "The value o f streamflow models i n overyear reservoir design applications," Water Resources Research, 24(9), 1483-1490, 1988. Wardlaw R. and M . Sharif, " Evaluation o f Genetic Algorithms for Optimal Reservoir System Operation," Journal of Water Resources Planning and Management, 125(1), 25-33, 1999. Watkins, C . J. C . H . Learning from Delayed Rewards. Ph.D. Thesis. K i n g ' s College, University o f Cambridge, Cambridge, U K , 1989. Wen-Ceng Huang, Y u a n Lun-Chin, and Lee C h i - M i n g , " L i n k i n g Genetic Algorithms with Stochastic Dynamic Programming to Long-Term Operation o f a Multireservoir System," Water Resources Research, 38(12), 1304, 2002. Wilson, G . , "Reinforcement Learning: A N e w Technique for the Real-Time Optimal Control o f Hydraulic Networks," Proc. Hydroinformatics Conf, Balkema, The Netherlands, 1996. Yakowitz S., "Dynamic Programming Applications in Water Resources," Water Resources Research, 18(4), 673-696, 1982. 181 Y e h , W . W - G . , " Reservoir management and operation models: A State-of-the-art Review," Water Resources Research, 21(12), 1797-1818, 1985. Young, G . K . , "Finding Reservoir Operation Rules," J. Hydraul. Div., A S C E , 93(HY6), 297-321, 1967. Zhang, W . and Dietterich, T. G . , "High-performance job-shop scheduling with a t i m e delay T D (1) network," Advances Proceedings in Neural of the 1995 Conference, Information Processing Systems: pages 1024-1030, Cambridge, M A . M I T Press, 1996. 182 APPENDIX A ^ L E A R N I N G NUMERICAL EXAMPLE Let us define a Markov chain with two states sj, s and at each state two actions aj, a 2 2 are allowed: States = {sj, s } 2 Actions: A(s,) - {a , a} n n , A(s ) = {a 2 a) 2h 22 Rewards: R(s,a) = r (si, a , , ) , r (si, ai ), r (s , a i), r (s , a ) t t 2 t 2 2 t 2 22 1 2 The transition probability matrix ( T P M ) associated with action 1 and 2 are P and P : t 0.7 0.3 0.9 0.1 0.4 0.6 0.2 0.8 t 1 2 The transition reward matrix ( T R M ) associated with action 1 and 2 are R and R : t R 6 -5 7 12 Rf 10 17 -14 13 t The objective is to maximize the expected rewards. A discount factor o f 0.8 is assumed in this example. The learning rate oris defined as: B/n(s,a), where B is a constant assumed 0.1 and n is the number o f state-action pair visits. The exploration rate £ is assumed C/t, where C is a constant assumed 0.9 and t is the number o f time periods. In the beginning, the learning agent tends to explore more and select non-greedy actions. A s such, it randomly select with a probability £ a different action from the action suggested by the policy learned so far Ms), where tt(s) = argmax Q*(s, a). a A s the number o f episodes is getting larger, the learning agent w i l l gradually turn to be greedy and select the best action estimated at state s most o f the time with probability 1- £. Calculations o f the first ten steps are presented in Table A . l . The following is a detailed description o f the calculation procedure i n this (2-Learning algorithm example: 183 Period 1: initialize a l l the state-action value function (g-Table) to 0 randomly select an initial state s, the selected state is sj calculate the exploration rate = C/t = 0.9/1.0 = 0.9 randomly sample an action a with probability 0.9, the selected action is ai update the number o f visits to state si and action ai, n(si, aj) = 1 simulate action 1 and observe the transition to next state and the associated rewards. The next state is S2 and the reward r(s 1,(22) is -5. the learning rate a=B/n(s , cu) = 0.1/1.0 =0.10 } update the Q-table using the equation: 2 0 , a) <- Q(s, a) + a[r + ymax Q(s', a') - Q(s, a)] a Q(sj, ai)^0 + 0.1 [-5 + 0.8* max{0,0} - 0] = -0.50 Period 2: the current state is S2. Calculate the exploration rate e-C/t- 0.9/2.0 = 0.45 sample the actions to select the best (greedy) action with probability 0.45 and a random action with probability 0.55. The selected action is update the number o f visits to state S2 and action a2, n(s2, a ) = 1 2 simulate the action and observe the transition to next state and the associated rewards. The next state is S2 and the immediate reward r(s2,a2) is 13 the learning rate a = B/ n(s , a^) = 0.1 /1.0 =0.10 2 update the g-table: Q(s , a ) <- 0 + 0.1 [13 + 0.8* max{0,0} - 0] = 1.30 2 2 184 .Period 10: the current state is S2. Calculate the exploration rate e - C/t = 0.9/10.0 = 0.09 sample the actions to select the best (greedy) action with probability 0.91 and a random action with probability 0.09. The best action at state S2 is ai as (0.7>0.607),. select action ai update the number o f visits to state S2 and action aj, n(s2, ai) - 2 simulate the action and observe the transition to next state and the associated rewards. The next state is si and the immediate reward r(s2,ai) is 7 the learning rate a-B/n(s , 2 a) 2 = 0.1/2.0 =0.05 update the g-table: Q(s , a / ) ^ 0 . 7 + 0.05[7 + 0.8*max{-0.103,3.016} - 0 . 7 ] = 1.136 2 185 Table A . l . Time State Period Si Sample calculations for the Q-Learning example Probability Matrix al SI I S2 a2 SI | S2 Exploration Rate Reward Martix Rl SI | S2 R2 SI | S2 Simulate Action & Observe Feedback Visits DG reedy Action Selection a1 a2 I 0 n u 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 2 2 2 1 3 2 2 1 4 2 2 2 4 2 Trans, state reward Learning Rate Q-V alues n al a2 0.100 -0.500 0 0 0 ' 0.709933 1 2 0.7 0.4 0.3-<CO.9 0.6 0.2 0.1 0.8 0 s1 s2 0.7 0.4 0.3 0.6 0.9 0.2 0.1 6 0.8*: s1 s2 0.7 0.4 0.3 0.6 0.9 0.2 s1 s2 7 -5 12 10 -14 A "7 1/ 13 f\ ar\r\r\ U.9UUU ai 0.1668252 f -5 12 10 -14 13 0.1 0.8 6 7 -5 12 10 -14 17 13 0.3000 6 7 -5 12 10 -14 17 13 0.2250 a2 0.2584773 3 a1 0.0484106 a2 4 s1 s2 0.7 0.4 0.3 0.6 0.9 0.2 0.1 0.8 5 s1 s2 0.7 0.4 0.3 0.6 0.9 0.2 0.1 0.8 6 7 -5 12 10 -14 17 13 0.1800 6 s1 s2 0.7 0.4 0.3 0.6 0.9 0.2 0.1 0.8 6 7 -5 12 10 -14 17 13 0.1500 0.9477175 a1 7 s1 s2 0.7 0.4 0.3 0.6 0.9 0.2 0.1 0.8 6 7 -5 12 10 -14 17 13 0.1286 0.1759355 a2 -5 12 10 -14 17 13 0.1125 0.8248643 a2 0.9071856 a2 8 s1 s2 0.7 0.4 0.3 0.6 0.9 0.2 0.1 0.8 6 7 9 s1 s2 0.7 0.4 0.3 0.6 0.9 0.2 0.1 0.8 6 7 -5 12 10 -14 17 13 0.1000 10 s1 s2 0.7 0.4 0.3 0.6 0.9 0.2 0.1 0.8 6 7 -5 12 10 -14 17 13 0.0900 13 0.100 -0.5 0 0 1.30 s1 7 0.100 -0.5 0.70 0 1.3 0.9748682 s2 17 0.100 -0.5 0.7 1.804 1.3 s1 -14 0.050 -0.5 0.7 1.804 0.607 0.6368037 s1 6 0.050 -0.103 0.7 1.804 0.607 0.8324002 s1 10 0.050 -0.103 0.7 2.286 0.607 0.1913321 s1 10 0.033 -0.103 0.7 2.604 0.607 0.9812344 s2 17 0.025 -0.103 0.7 3.016 0.607 7 0.050 -0.103 1.136 3.016 0.607 0.0694871 0.9071856 a1 S2 0.3560084 0.1620204 0.2377849 a2 •0.4037938 s1
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A reinforcement learning algorithm for operations planning...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A reinforcement learning algorithm for operations planning of a hydroelectric power multireservoir system Abdalla, Alaa Eatzaz 2007
pdf
Page Metadata
Item Metadata
Title | A reinforcement learning algorithm for operations planning of a hydroelectric power multireservoir system |
Creator |
Abdalla, Alaa Eatzaz |
Publisher | University of British Columbia |
Date Issued | 2007 |
Description | The main objective of reservoir operations planning is to determine the optimum operation policies that maximize the expected value of the system resources over the planning horizon. This control problem is challenged with different sources of uncertainty that a reservoir system planner has to deal with. In the reservoir operations planning problem, there is a trade-off between the marginal value of water in storage and the electricity market price. The marginal value of water is uncertain too and is largely dependent on storage in the reservoir and storage in other reservoirs as well. The challenge here is how to deal with this large scale multireservoir problem under the encountered uncertainties. In this thesis, the use of a novel methodology to establish a good approximation of the optimal control of a large-scale hydroelectric power system applying Reinforcement Learning (RL) is presented. RL is an artificial intelligence method to machine learning that offers key advantages in handling problems that are too large to be solved by conventional dynamic programming methods. In this approach, a control agent progressively learns the optimal strategies that maximize rewards through interaction with a dynamic environment. This thesis introduces the main concepts and computational aspects of using RL for the multireservoir operations planning problem. A scenario generation-moment matching technique was adopted to generate a set of scenarios for the natural river inflows, electricity load, and market prices random variables. In this way, the statistical properties of the original distributions are preserved. The developed reinforcement learning reservoir optimization model (RLROM) was successfully applied to the BC Hydro main reservoirs on the Peace and Columbia Rivers. The model was used to: derive optimal control policies for this multireservoir system, to estimate the value of water in storage, and to establish the marginal value of water / energy. The RLROM outputs were compared to the classical method of optimizing reservoir operations, namely, stochastic dynamic programming (SDP), and the results for one and two reservoir systems were identical. The results suggests that the RL model is much more efficient at handling large scale reservoir operations problems and can give a very good approximate solution to this complex problem. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-01-19 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0063269 |
URI | http://hdl.handle.net/2429/30702 |
Degree |
Doctor of Philosophy - PhD |
Program |
Civil Engineering |
Affiliation |
Applied Science, Faculty of Civil Engineering, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2007-266663.pdf [ 17.19MB ]
- Metadata
- JSON: 831-1.0063269.json
- JSON-LD: 831-1.0063269-ld.json
- RDF/XML (Pretty): 831-1.0063269-rdf.xml
- RDF/JSON: 831-1.0063269-rdf.json
- Turtle: 831-1.0063269-turtle.txt
- N-Triples: 831-1.0063269-rdf-ntriples.txt
- Original Record: 831-1.0063269-source.json
- Full Text
- 831-1.0063269-fulltext.txt
- Citation
- 831-1.0063269.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0063269/manifest