Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A reinforcement learning algorithm for operations planning of a hydroelectric power multireservoir system Abdalla, Alaa Eatzaz 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2007-266663.pdf [ 17.19MB ]
Metadata
JSON: 831-1.0063269.json
JSON-LD: 831-1.0063269-ld.json
RDF/XML (Pretty): 831-1.0063269-rdf.xml
RDF/JSON: 831-1.0063269-rdf.json
Turtle: 831-1.0063269-turtle.txt
N-Triples: 831-1.0063269-rdf-ntriples.txt
Original Record: 831-1.0063269-source.json
Full Text
831-1.0063269-fulltext.txt
Citation
831-1.0063269.ris

Full Text

A REINFORCEMENT LEARNING A L G O R I T H M FOR OPERATIONS PLANNING O F A H Y D R O E L E C T R I C POWER MULTIRESERVOIR SYSTEM  by ALAA EATZAZ ABDALLA  B.Sc. A i n Shams University, 1984 M . A . S c . Katholieke Universiteit Leuven, 1990  A THESIS SUBMITTED IN P A R T I A L F U L F I L M E N T OF T H E REQUIREMENTS FOR T H E D E G R E E OF  DOCTOR OF PHILOSOPHY  in  T H E F A C U L T Y OF G R A D U A T E STUDIES  (CIVIL ENGINEERING)  T H E U N I V E R S I T Y O F BRITISH C O L U M B I A A p r i l 2007  © A l a a Eatzaz Abdalla, 2007  ABSTRACT The main objective o f reservoir operations planning is to determine the optimum operation policies that maximize the expected value o f the system resources over the planning horizon. This control problem is challenged with different sources o f uncertainty that a reservoir system planner has to deal with. In the reservoir operations planning problem, there is a trade-off between the marginal value o f water in storage and the electricity market price. The marginal value o f water is uncertain too and is largely dependent on storage i n the reservoir and storage in other reservoirs as well. The challenge here is how to deal with this large scale multireservoir problem under the encountered uncertainties. In this thesis, the use o f a novel methodology to establish a good approximation o f the optimal control o f a large-scale hydroelectric power system applying Reinforcement Learning ( R L ) is presented. R L is an artificial intelligence method to machine learning that offers key advantages in handling problems that are too large to be solved by conventional dynamic programming methods. In this approach, a control agent progressively learns the optimal strategies that maximize rewards through interaction with a dynamic environment. This thesis introduces the main concepts and computational aspects o f using R L for the multireservoir operations planning problem. A scenario generation-moment matching technique was adopted to generate a set o f scenarios for the natural river inflows, electricity load, and market prices random variables. In this way, the statistical properties o f the original distributions are preserved. The developed reinforcement learning reservoir optimization model ( R L R O M ) was successfully applied to the B C Hydro main reservoirs on the Peace and Columbia Rivers. The model was used to: derive optimal control policies for this multireservoir system, to estimate the value o f water in storage, and to establish the marginal value o f water / energy. The R L R O M outputs were compared to the classical method o f optimizing reservoir operations, namely, stochastic dynamic programming (SDP), and the results for one and two reservoir systems were identical. The results suggests that the R L model is much more efficient at handling large scale reservoir operations problems and can give a very good approximate solution to this complex problem.  ii  TABLE OF CONTENTS A B S T R A C T  II  T A B L E O F C O N T E N T S  Ill  LIST O F FIGURES  VI  LIST O F T A B L E S LIST O F A B B R E V I A T I O N S  X XI  G L O S S A R Y  XIII  A C K N O W L E D G E M E N T S  XIV  1.  I N T R O D U C T I O N  1  1.1.  BACKGROUND  1  1.2.  PROBLEM STATEMENT  4  1.3.  GOALS A N D OBJECTIVES  5  1.4.  ORGANIZATION O F T H E THESIS  7  2.  L I T E R A T U R E R E V I E W 2.1.  MODELING APPROACHES  2.2.  OPTIMIZATION TECHNIQUES  2.3.  8 8 10  2.2.1. Deterministic Optimization Models  10  2.2.2.  Stochastic Optimization Models  14  2.2.3.  Heuristic Models  23  SAMPLING TECHNIQUES  26  2.3.1.  Time Series Models  27  2.3.2.  Conditional Sampling  27  2.3.3.  Sampling from Specified Marginals and Correlations  28  2.3.4.  Moment Matching  28  2.3.5.  Path Based Methods  29  2.3.6.  Optimal Discretization  29  2.3.7.  Scenario Reduction  30  2.3.8.  Interior Sampling Methods  30  2.4.  RL  APPROACH  30  2.5.  CONCLUSIONS  32  iii  3.  T H E REINFORCEMENT LEARNING APPROACH 3.1.  INTRODUCTION  34  3.2.  REINFORCEMENT LEARNING APPROACH V E R S U S DYNAMIC PROGRAMMING  35  3.3.  REINFORCEMENT LEARN I N G PROBLEM FORMULATION  37  Basic Elements  3.3.1.  R L  3.3.2.  Exploration/ Exploitation  39  3.3.3.  Return Functions  40  3.3.4.  Markovian Decision Process ( M D P )  41  REINFORCEMENT LEARNING ALGORITHMS  42  3.4.  4.  34  38  3.4.1.  Dynamic Programming Methods (DP)  44  3.4.2.  Monte Carlo Methods ( M C )  45  3.4.3.  Temporal Difference (TD)  47  3.5.  FUNCTION APPROXIMATION  54  3.6.  ON-LINE A N D OFF-LINE  55  3.7.  SUMMARY  56  REINFORCEMENT LEARNING MULTIRESERVOIR OPTIMIZATION  M O D E L (RLROM) D E V E L O P M E N T  57  4.1.  INTRODUCTION  57  4.2.  M A T H E M A T I C A L FORMULATION  59  4.2.1.  S D P Formulation  61  4.2.2.  R L Formulation  62  4.2.3.  State Space  65  4.2.4.  Action (Decision Variable)  65  4.2.5.  Exploration/Exploitation Rate  66  4.2.6.  Learning Rate  66  4.2.7.  Rewards  67  4.2.8.  Stochastic Modeling o f Random Variables  4.2.9.  Lookup Tables and Function Approximation  4.2.10.  M o d e l of the Environment  68  70 74  4.3.  R L R O M M o d e l Algorithm Using Lookup Tables  81  4.3.2.  The R L M o d e l Algorithm Using Function Approximation  86  4.3.3.  Convergence  90  SUMMARY  91  M O D E L TESTING  A N D I M P L E M E N T A T I O N  92  5.1.  INTRODUCTION  92  5.2.  SINGLE RESERVOIR M O D E L - TEST C A S E  93  5.2.1.  Problem Description  93  5.2.2.  S D P mathematical formulation  95  5.2.3.  R L mathematical formulation  97  5.2.4.  Solution Steps o f Q-Learning Algorithm  99  5.2.5.  Establishing the parameters of the R L Algorithm  101  5.2.6.  Results  105  5.3.  Two  RESERVOIR M O D E L - TEST CASE  109  5.3.1.  System Modeling  109  5.3.2.  Results and Analysis  Ill  5.3.3.  Efficiency of the R L Algorithm  114  5.4.  RLROM  MULTIRESERVOIR  M O D E L IMPLEMENTATION  - CASE STUDY  116  5.4.1.  Description of the Case Study  117  5.4.2.  Input Data  118  5.4.3.  Scenario Generation  126  5.4.4.  Results of the R L R O M M o d e l Runs  132  5.5.  6.  80  4.3.1.  4.4.  5.  SOLUTION ALGORITHM  SUMMARY  CONCLUSIONS  163  A N D R E C O M M E N D A T I O N S  165  6.1.  SUMMARY  165  6.2.  CONCLUSIONS  167  6.3.  FUTURE W O R K  170  R E F E R E N C E S APPENDIX A  171 ^ L E A R N I N G N U M E R I C A L  E X A M P L E  183  v  LIST OF FIGURES F I G U R E 3.1.  REINFORCEMENT LEARNING A N D D Y N A M I C PROGRAMMING C O M P A R I S O N ( A D O P T E D F R O M GOSAVI, 2003)  37  F I G U R E 3.2.  AGENT-ENVIRONMENT INTERACTION IN R L  38  F I G U R E 3.3.  COMPARISON O F DIFFERENT R L APPROACHES (ADOPTED F R O M SUTTON A N D B A R T O 1998)  53  FIGURE 4.1.  R L AGENT-ENVIRONMENT INTERFACE INLEARNING M O D E  63  FIGURE 4.2.  LEARNING R A T E A N D E X P L O I T A T I O N R A T E  67  FIGURE 4.3.  P W L APPROXIMATION OF STORAGE V A L U E FUNCTION  72  F I G U R E 4.4.  3-D V I E W O F T H E P W L V A L U E O F S T O R A G E F U N C T I O N A S A F U N C T I O N O F G M S A N D MICA STORAGE  FIGURE 4.5.  73  2-D V I E W O F T H E P W L V A L U E O F S T O R A G E FUNCTION A S A FUNCTION O F G M S A N D MICA STORAGE  F I G U R E 4.6.  73  P W L V A L U E OF STORAGE FUNCTION A SA FUNCTION OF G M S STORAGE A T MICA TARGET STORAGE  74  F I G U R E 4.7.  PIECEWISE L I N E A R H Y D R O P O W E R P R O D U C T I O N  FUNCTION  F I G U R E 4.8.  R L R O M M O D E L ALGORITHM USING LOOKUP TABLES  85  F I G U R E 4.9.  R L R O M M O D E L A L G O R I T H M U S I N G FUNCTION APPROXIMATION  88  FIGURE 4.10. S C H E M A T I C R E P R E S E N T A T I O N O F R L R O M M O D E L A L G O R I T H M  76  USING  FUNCTION APPROXIMATION F I G U R E 5.1.  EFFECT O F V A R Y I N G T H E E X P L O I T A T I O N R A T E  F I G U R E 5.2.  CONVERGENCE O F Q-LEARNING SOLUTION A P P L Y I N G DIFFERENT EXPLORATION-EXPLOITATION POLICIES  89 102  103  F I G U R E 5.3.  PERFORMANCE O F D I F F E R E N T ^ - G R E E D Y POLICIES  104  F I G U R E 5.4.  PERFORMANCE O F D I F F E R E N T L E A R N I N G R A T E S  105  F I G U R E 5.5.  CONVERGENCE O F T H E V A L U E F U N C T I O N  106  F I G U R E 5.6.  V A L U E F U N C T I O N O B T A I N E D B Y R L A N D S D P A T T I M E P E R I O D S 6 A N D 12..  107  F I G U R E 5.7.  OPTIMAL POLICIES D E R I V E D B YT H E R L A N D S D P M O D E L S A T T I M E PERIODS A N D  12  6  108  F I G U R E 5.8.  CONVERGENCE O F T H E R L M O D E L  112  F I G U R E 5.9.  G M S STORAGE V A L U E FUNCTION A SA FUNCTION OF M C A STORAGE  113  vi  F I G U R E 5.10. S I M U L A T E D O P T I M A L O P E R A T I O N  PLANNING POLICY  113  FIGURE 5.11. C O M P A R I S O N O F T H E R L A N D S D P C P U T I M E F I G U R E 5.12. A V E R A G E M O N T H L Y R E S E R V O I R S  LOCAL N A T U R A L INFLOWS  FIGURE 5.13.NORMALIZED C U M U L A T I V E M O N T H L Y D I S T R I B U T I O N A N N U A L RESERVOIRS  115 120  OF THE A V E R A G E  L O C A L N A T U R A L INFLOWS  120  F I G U R E 5.14. A V E R A G E M O N T H L Y M A R K E T P R I C E I N E A C H S U B - T I M E S T E P  121  FIGURE 5.15.HISTORICAL M O N T H L Y D I S T R I B U T I O N  122  FIGURE 5.16.CORRELATION  OF % OF SYSTEM LOAD  B E T W E E N THE DALLES INFLOW V O L U M E A N D THE  MONTHLY  PEACE A N D COLUMBIA INFLOWS FIGURE 5.17.PROBABILITY  DISTRIBUTION  128  FUNCTION  OFFORECASTED  F I G U R E 5.18. T E S T I N G T H E Q U A L I T Y O F T H E D I F F E R E N T N U M B E R S  SYSTEM LOAD  129  OF GENERATED  SCENARIOS  130  F I G U R E 5.19. C O M P A R I S O N B E T W E E N T H E P E A C E I N F L O W H I S T O R I C A L GENERATED SCENARIOS  DATA A N D  THE  STATISTICS  131  F I G U R E 5.20. S T O R A G E V A L U E F U N C T I O N  INDECEMBER  134  F I G U R E 5.21. S T O R A G E V A L U E F U N C T I O N  INM A Y  134  F I G U R E 5.22. S T O R A G E V A L U E F U N C T I O N  INAUGUST  134  FIGURE 5.23.MICA STORAGE V A L U E F U N C T I O N  INDECEMBER  FIGURE 5.24.MONTHLY G M S S T O R A G E V A L U E F U N C T I O N  WITH MICA RESERVOIR  STORAGE LEVEL  135 A T  50% 136  FIGURE 5 . 2 5 . M C A M A R G I N A L V A L U E O F E N E R G Y I N DECEMBER  139  F I G U R E 5.26. G M S M A R G I N A L V A L U E O F E N E R G Y I N D E C E M B E R  140  F I G U R E 5.27. G M S M A R G I N A L V A L U E O F E N E R G Y I N D E C E M B E R  141  FIGURE 5 . 2 8 . M C A M A R G I N A L V A L U E O F E N E R G Y I N M A Y  142  FIGURE 5 . 2 9 . G M S M A R G I N A L V A L U E O F E N E R G Y IN M A Y  142  FIGURE 5 . 3 0 . G M S M A R G I N A L V A L U E O F E N E R G Y I N AUGUST  143  F I G U R E 5.31. M C A M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A L O W S T O R A G E L E V E L A T M C A  ( M C A STORAGES25,000CMS-D)  FIGURE 5 . 3 2 . M C A M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O RA H I G H S T O R A G E A T M C A ( M C A STORAGE=285,000CMS-D) FIGURE 5.33.MARGINAL V A L U E O F K N A S T O R A G E C O N S T R A I N T  144 LEVEL 144 145  vii  FIGURE 5.34. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A N E M P T Y G M S RESERVOIR  STORAGE ( G M SSTORAGE=55,000CMS-D)  146  FIGURE 5.35. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A F U L L G M S R E S E R V O I R STORAGE ( G M SSTORAGE=475,000CMS-D)  146  FIGURE 5.36. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A L O W S T O R A G E L E V E L A T MCA(MCASTORAGE=125,000CMS-D)  147  FIGURE 5.37. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A H I G H S T O R A G E  LEVEL  A T M C A ( M C A STORAGE=285,000CMS-D)  148  FIGURE 5.38. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O RA L O W S T O R A G E L E V E L A T MCA(MCASTORAGE=125,000CMS-D)  148  FIGURE 5.39. G M S M O N T H L Y M A R G I N A L V A L U E O F E N E R G Y F O R A H I G H S T O R A G E AT  LEVEL  M C A( M C A S T O R A G E = 2 8 5 , 0 0 0 C M S - D )  149  FIGURE 5.40. J A N U A R Y N E T M A R K E T T R A N S A C T I O N S  152  F I G U R E 5.41 . A U G U S T N E T M A R K E T T R A N S A C T I O N S  152  FIGURE 5.42. M O N T H L Y N E T M A R K E T T R A N S A C T I O N S  F O RG M S S T O R A G E O F 55,000  CMS-D  153 FIGURE 5.43.MONTHLY N E T M A R K E T T R A N S A C T I O N S  F O RG M S S T O R A G E O F 475,000  CMS-D  154  FIGURE 5.44.MONTHLY N E T M A R K E T T R A N S A C T I O N S  F O RM C AS T O R A G E O F 125,000  CMS-D  155  FIGURE 5.45.MONTHLY N E T M A R K E T T R A N S A C T I O N S  F O RM C AS T O R A G E O F 285,000  CMS-D  155  FIGURE 5.46.PLANTS G E N E R A T I O N I N J A N U A R Y A T D I F F E R E N T M C AS T O R A G E  INCREMENTS  W I T H G M S S T O R A G E A T 55,000 C M S - D FIGURE 5.47.PLANTS G E N E R A T I O N I N JANUARY A T D I F F E R E N T M C AS T O R A G E  156 INCREMENTS  W I T H G M S S T O R A G E A T 475,000 C M S - D  157  FIGURE 5.48. P L A N T S A V E R A G E T U R B I N E R E L E A S E S I N J A N U A R Y A T D I F F E R E N T M C A STORAGE INCREMENTS  W I T H G M S S T O R A G E A T 55,000 C M S - D  157  FIGURE 5.49.PLANTS A V E R A G E T U R B I N E R E L E A S E S I N J A N U A R Y A T D I F F E R E N T M C A STORAGE INCREMENTS  W I T H G M S S T O R A G E A T 475,000 C M S - D  FIGURE 5.50 M O N T H L Y P L A N T S G E N E R A T I O N  158 159  Vlll  FIGURE 5.51. M A R K E T P R I C E A N D M A R G I N A L V A L U E O F E N E R G Y  160  FIGURE 5.52.LOAD R E S O U R C E B A L A N C E  161  F I G U R E 5.53. A N N U A L D I S T R I B U T I O N  162  O F T H EFIVE M A I N P L A N T S G E N E R A T I O N  FIGURE 5.54.NET M A R K E T T R A N S A C T I O N S F O RT H R E E P R I C E S C E N A R I O S  163  IX  MDP  Markovian decision process  MM  Moment matching  MRE  Mean relative error  MW  Megawatt  MVW  Marginal value o f water  NDP  Neuro dynamic programming  NFM  Network flow model  NLP  Non-linear programming  pdf  probability density function  PWL  Piecewise linear function  RL  Reinforcement learning  RLROM  Reinforcement learning reservoir optimization model  RP  Reliability programming  RTC  Real time control  SDP  Stochastic dynamic programming  SDDP  Stochastic dual dynamic programming  SLP  Stochastic linear programming  SLPR  Stochastic linear programming with recourse  STOM  Short term optimization model  TD  Temporal difference  TPM  Transition probability matrix  TRM  Transition reward matrix  WK  Weekday  WKPK  Weekday peak load hours  WKLO  Weekday light load hours  WKHI  Weekday high load hours  WE  Weekend  GLOSSARY Agent  A learner or controller or decision maker  Age o f the agent The number o f iterations performed by the agent Capacity  The rated power output o f a power plant, normally measured in M W  Demand  The rate at which electric energy is delivered to or by a system, generally expressed in M W , or averaged over a period o f time (e.g., MWh)  Energy  The amount o f electricity produced or used over a period o f time usually M W h  Environment  The thing that the agent interacts with to provide a control signal. It could be a real system or a model o f the real system.  Episode  Trial or iteration or cycle. Represents the planning period. Each episode ends at a terminal time period T  Greedy policy  A policy that follows always the best action  6 - greedy policy A policy that follows the best action with probability e while exploring new actions with probability 1- e Planning horizon The number o f stages / time periods Operating reserve A specific level o f reserve power should be available at all times to insure reliable electricity grid operation. Step size  Learning rate parameter a  Sub-time step  The time step is divided to a number o f sub-time steps to capture the variation in certain parameters or variables at a shorter time increment  Time step  The planning period T is subdivided to number o f periods which are called the time steps or stages t  Target  A desirable direction to move in the next step  Terminal stage  The last period in an iteration (the end o f the planning horizon)  ACKNOWLEDGEMENTS I greatly acknowledge the support and invaluable insights o f m y thesis advisor Dr. Ziad Shawwash. Without his guidance this thesis would probably never have been completed. M y understanding o f topics relating to this research has been shaped by his guidance throughout the course o f this research. I wish to extend my special thanks to Dr. S.O. Denis Russell for his wise advice and suggestions. I would like also to express my sincere thanks to D r . A . D . Russell, Dr. J . A . Meech, Dr. R . G . M i l l a r , and D r . T . A . Sayed for their helpful comments and feedback throughout the course o f this research.  Special appreciations are due to Dr. Thomas Siu for his continuous help and support. This thesis has benefited from his expertise, deep understanding, and explanations o f operations planning o f B C Hydro reservoir systems. I am also thankful to m y colleagues in B C Hydro: D . Smith, D . Druce, Dr. G . Nash, H . Louie, W . Cheng, and D r . G . Sreckovic for their valuable assistance and support. Interactions with U B C students have also been fruitful. I would like to acknowledge J. L i , Y . Tang, and R . Mazariegos for their help.  Special thanks are due to Dr. B . Abdulhai. M y understanding o f Reinforcement Learning benefited from the discussions I had with h i m and from his knowledge o f artificial intelligence.  I am very grateful for my wife Dalai, for her love, patience and support during the entire period o f m y study. Thanks for m y son A l y for bringing me great j o y in stressful time. Lastly, and most importantly, I wish to thank m y parents, Zeinab and Eatzaz, for their great care and encouragement during my studies and throughout my life.  xiv  1. INTRODUCTION 1.1. Background In British Columbia, there are large number o f dams distributed throughout the different basins and watersheds o f the province. Most o f the large dams i n B . C . serve hydroelectric power purposes. The British Columbia Hydro and Power Authority ( B C Hydro)  operates a hydro dominated power generation  system that includes 30  hydroelectric generating stations providing approximately 90 per cent o f the system installed capacity (11,100 M W ) . These plants can be placed into four main groups: Peace system (2 plants) producing about 34% o f the energy requirements, Columbia system (3 plants) producing 31%, Kootenay Canal and Seven M i l e generating stations producing 13% and the remaining 23 plants that supply about 16% o f the energy production. The balance o f energy requirements is supplied from two thermal generating facilities and from  energy purchases. The B C Hydro integrated system produced about 54,000  gigawatt-hours in 2004.  The main hydroelectric storage facilities in the B C Hydro system are: the Williston reservoir (39.4 billion m ) on the Peace River and the Kinbasket reservoir (14.8 billion 3  m ) on the Columbia River. The two reservoirs provide multi-year storage 3  and  accordingly, they are used for the strategic and long-term operations planning o f the B C Hydro system.  The reservoir systems in B . C . provide many important benefits i n addition to hydroelectric power production. These include: domestic and industrial water supply, flood control, and recreation. During the operation o f these reservoir systems, conflicts may arise between some o f these uses, particularly i n periods o f sustained drought or storage deficits. A l s o , new pressures have been imposed on the reservoir system due to expanding energy needs, increasing water demands, and growing public awareness o f environmental issues. In addition, publicly owned reservoir systems often have to deal with complex legal agreements (e.g. Columbia River Treaty), fisheries, and non-power  1  release requirements, Federal/Provincial regulations (e.g. water licenses), and pressures from different interest groups. Recently increased attention has been given to improving the economic and operating efficiency o f the existing reservoir systems.  The electrical transmission network in British Columbia is interconnected with Alberta and Western U S systems. These interconnections allow for the purchase and sale of electricity in the wholesale electricity market, and therefore complicate the reservoir operations planning problem, since it must incorporate maximizing the profit from energy transactions in these two markets.  The complexities o f the multipurpose, multiple reservoir systems generally require that release decisions be determined by an optimization and/or simulation model. Computer simulation models have been applied for many years, as powerful tools i n reservoir systems management and operation. These models are descriptive o f the behavior o f the system (e.g. releases, storage), given a specified scenario (e.g. the sequence o f flow data, storage, demands, priorities, and constraints); and they are able to illustrate the changes resulting from alternative scenarios. They are also useful i n examining the long-term reliability o f a proposed operating strategy when combined with Monte Carlo analysis. However, they are not well suited to develop the optimal strategy, particularly for large-scale systems.  Prescriptive optimization models on the  other hand, offer the  capability to  systematically derive optimal solution given a specific objective and a set o f constraints. The operating policies developed with optimization procedures often need to be checked with a simulation model in order to ensure their feasibility.  In B C Hydro, the operations planning process is carried out using several simulation and optimization models. This modeling system is divided in terms o f time horizons. The cascaded operations planning modeling hierarchy in B C Hydro is categorized as follows: - Long-term operations planning (strategic planning, 1-6 years in monthly time-steps).  2  - Medium-term operations planning (strategic/tactical planning, up to 12 month period in hourly to weekly time-steps). - Short-term operations planning (tactical planning, for one week in hourly to daily time-steps). - Real-time operations planning (1 hour-1 week in hourly time-steps). For long term planning, a marginal cost model ( M C M ) has been developed in-house at B C Hydro by Druce (1989, 1990), and kept up to date. The model applies Stochastic Dynamic Programming (SDP) to calculate the monthly marginal value o f water stored in the Williston reservoir, the largest reservoir in the system, over a planning period o f 4 to 6 years. The S D P model takes into account the uncertainties in inflows and market prices. The calculated marginal value o f water stored i n the Williston reservoir is then used as a proxy for the long-term system marginal cost. A s well, the model develops a probabilistic forecast o f B C Hydro system price signals, reservoir storage volumes and expected spills.  Shawwash et al. (2000) developed a short-term optimization model ( S T O M ) that is used by B C Hydro operation engineers to determine the optimal hourly generation and trading  schedules  while meeting  system  demands.  The  model applies  a linear  programming technique in which the inflow, prices, and system loads are deterministic within the short-term period. The model has been modified and extended to the mediumterm generalized optimization model ( G O M ) that is capable to handle longer planning periods ranging from hourly to monthly time-steps. The model also allows for variable time steps that can be specified on daily, weekly, and monthly intervals, and to include sub-time steps within a time step. Hence, G O M has the advantage o f capturing variations in load, market prices, and generation schedules for shorter periods within the time step during weekdays or weekends.  Nash (2003) developed a stochastic optimization model for the Columbia and Peace reservoir systems (aggregated to two reservoirs). The model consists mainly o f two sub-models. In the first model, which is the long-term model, the monthly storage value function and the marginal values o f water derived by a D P - L P based model are passed to the second model. The second model is the shorter-term model that applies a stochastic  3  linear programming with recourse algorithm ( S L P R ) in which the inflows, prices, and demands are defined by a scenario tree. The storage value curves generated by the longterm model are used as terminal value functions in the shorter-term model. The outputs o f the shorter-term model for the two aggregated reservoir systems are the refined marginal values over the shorter periods and the operating decisions for each period. The present research work builds upon, extends, and enhances the above techniques and develops a new approach to solve the operations planning problem for multireservoir systems.  1.2. Problem Statement The main objective o f reservoir operations planning is to determine the release policies that maximize the expected value o f the system resources over the planning period. This objective is generally achieved by maximizing the value o f the energy produced while meeting firm domestic demands and taking into consideration the value of stored water at the end o f the planning horizon. The reservoir operating strategy should provide answers to these questions at every decision point in time: when and how much water to store or release and when, where, and what quantity o f energy to trade in the competitive wholesale electricity market while meeting the firm domestic demands and non-power requirements.  This control problem is challenged with three main sources o f uncertainty with which a reservoir system operator has to deal with. The main sources o f uncertainty are the inflows to the reservoir system. The system operation policy has to protect against situations o f shortage where the system is unable to meet the demands in a dry year. A l s o , it has to be capable to store any sudden increase i n the inflow while avoiding wasteful water spills. In addition, there is uncertainty in forecasting electricity demand that w i l l affect the amount o f energy generated since the firm demand must be met regardless o f cost. There is also uncertainty in market prices that varies seasonally, daily, and between weekdays and weekends.  4  The marginal value o f water represents the value o f an extra increment o f storage in a given reservoir ($/m ). In the reservoir operations optimization problem, there is a trade3  off between the marginal value o f water and the present market price. If the market price is higher than the marginal value o f water, then it is more profitable to sell energy. O n the other hand, when the market price is lower than the marginal value o f water, then it is more profitable to purchase energy from the market. The marginal value o f water is equal to the derivative o f the value o f water function with respect to storage. The optimal use o f the water in the reservoir corresponds to the point that minimizes the sum o f immediate and future costs. This is also where the derivatives o f the immediate cost function and future cost function with respect to storage become equal (Pereira et al. 1989).  To derive a set o f realistic control policies and better estimate the value o f system resources, there is a need to integrate B C Hydro's main reservoir systems into a single model that takes into consideration the uncertainties in the inflows, market prices, and the system load. This integration is essential to allow for the effect o f interactions between the different reservoir systems on the amount o f total energy produced by the system. The value o f water in a given reservoir is a function o f the storage level in that reservoir and those in other reservoirs. Thus, the value o f water in storage in any reservoir cannot be established unless assumptions are made about the other storage variables i n the system. In this research, the main hydroelectric power generation facilities o f B C Hydro system are included in a long/medium term stochastic optimization model. These hydroelectric power generation facilities are located on several independent river systems, for example, Peace River ( G . M . Shrum and Peace Canyon) and Columbia River (Mica, Revelstoke, and Keenleyside).  1.3. Goals and Objectives The main goal o f this research work is to develop and implement a long/medium term stochastic optimization model that serves as a decision support tool for the operations planning o f the B C Hydro's reservoir systems. The model should be able to:  5  •  Handle the dimensionality problem o f this large-scale stochastic problem i n an efficient manner,  •  Provide forecasts o f the expected value o f revenues, energy generation, and expected market transactions (imports/exports),  •  M o d e l several river systems within the B C Hydro system,  •  Address the main uncertainties in the operations planning problem (inflow, market price, and demand),  •  Provide the marginal value o f water for the major reservoirs, and  •  Deal with variable time steps and be able to address different objective functions.  To achieve these goals, several objectives were identified: 1.  Acquire an in-depth understanding and knowledge o f the reservoir operations  planning problem in general and o f B C Hydro's reservoir systems in particular. This was achieved by thoroughly investigating the modeling environment at B C Hydro and other hydroelectric power generation entities. Special attention was given to integrating  the  generalized  optimization model  ( G O M ) with  the  stochastic  optimization model.  2.  Carry out an extensive review o f the literature on reservoir optimization  techniques with a particular emphasis on stochastic optimization techniques. This literature review was extended to include state o f the art techniques developed and applied in the fields o f machine learning and artificial intelligence environments. The aim o f the literature review was to assess the merits and the drawbacks o f the different optimization techniques and their potential to handle the complexity and the dimensionality problems o f the large scale, stochastic, multiperiod, multireservoir operations planning problem.  3.  Formulate the stochastic optimization model that addresses the uncertainties in  inflow, system load, and market prices.  6  4.  Investigate  and develop the Reinforcement Learning ( R L ) technique  with  Function Approximation and Sampling techniques, an approach, that is becoming popular and seems to offer the promise o f handling large scale stochastic problems  5.  Test the performance o f the R L model and develop an algorithm to implement it  for the optimization o f the operation o f B C Hydro's main reservoir systems.  1.4. Organization of the thesis This chapter presented the motivation, goals and the focus o f the thesis. Chapter 2 reviews the different techniques and approaches in handling the reservoir optimization problem  cited in the  literature.  Chapter  3  introduces  the  main  concepts  and  computational aspects o f using reinforcement learning ( R L ) . Chapter 4 describes the methodology and the mathematical formulation adopted in the development o f the R L R O M model. Chapter 5 presents the results o f applying the R L solution methodology for a single reservoir and testing the extended two reservoir problem. The model was then implemented on B C Hydro's main reservoir system on the Peace and the Columbia Rivers. Chapter 6 provides conclusions and recommendations for future research work.  7  2. LITERATURE REVIEW The literature review carried out i n this research is presented herein from two perspectives:  modeling  approaches  and  optimization techniques.  The  modeling  approaches commonly applied to the reservoir optimization problem can be grouped into two  main  categories:  Implicit  Stochastic  optimization and  Explicit  Stochastic  optimization. In terms o f the optimization techniques, the reservoir optimization models can be classified as: •  Deterministic  models  including:  Linear  programming,  Network  Flow,  Multiobjective, N o n Linear programming, and Dynamic Optimzation Models, •  Stochastic optimization models including: Dynamic, Linear, Dual Dynamic Programming, Chance Constrained and Reliability Programming,  •  Heuristic models including: Genetic Algorithms, Artificial Neural Networks, Fuzzy Programming, etc...  2.1. Modeling Approaches Deterministic analysis o f reservoir operational problems has several computational advantages over the stochastic analysis. Ignoring the stochasticity o f the  system  simplifies the model resulting in more efficient performance; however this simplification introduces a bias i n the results. Loucks et al. (1981) state that: "Deterministic models based on average or mean values o f inputs, such as stream flows, are usually optimistic. System benefits are overestimated, and costs and losses are underestimated i f they are based only on the expected values o f each input variable instead o f the probability distributions describing those variables".  Reservoir optimization models can be useful tools for identifying and evaluating the impacts o f various alternative system operations. Yet, these models are not likely to be very useful unless they consider the uncertain conditions affecting the future performance o f those systems. This includes the uncertain future demands imposed on those systems, the uncertain future costs o f those systems and the uncertain quantities and qualities o f  8  the flow within those systems. Assumptions made regarding each o f these sources o f uncertainty can have a major impact on system planning and operation. These facts have served to motivate the development o f stochastic models, models that take into consideration at least some o f the important sources o f uncertainty and their impacts on system design and operation.  Implicit 'stochastic reservoir optimization models optimize over series o f random variables assuming perfect knowledge o f the future. Multiple regression analysis o f the results o f the deterministic optimization model can be applied to generate operational rules. However, Labadie (1998) claims that regression analysis can result i n poor correlations that may invalidate the operating rules.  Explicit stochastic optimization models deal with the probabilistic nature o f random variables directly, rather than dealing with deterministic sequences. Accordingly, the optimization is performed without assuming perfect knowledge o f future events. The benefit o f this approach is to better quantify the impact o f the uncertainty i n the random variables and consequently come up with better reservoir management  decisions.  However, this approach is more computationally expensive than the implicit optimization approach.  Most o f the explicit stochastic optimization models assume that unregulated inflows are the dominant source o f uncertainty in the system and can be represented  by  appropriate probability distributions. These may be parametric or nonparametric based on frequency analysis. Other random variables that may be defined include: market prices and demands. Unregulated natural inflows may be highly correlated spatially and/or temporally. Explicit stochastic models use probability distributions o f stream flow. This requires two main simplifications to keep the dimensionality o f the problem manageable. First, discretization o f the probability data, and second, relatively simple stochastic models are usually used (e.g., lag-1 Markov model). Most inflow sequences show serial correlation (Pereira et al., 1999), and are represented in modeling inflows by a lag-1 autoregressive or multivariate model.  9  2.2. Optimization Techniques A broad array o f mathematical models has been applied for the optimization o f reservoir systems operations and management. The choice o f the modeling technique relies largely on the characteristics o f each application. The following sections briefly review optimization methods that are widely used to solve the reservoir  system  optimization problem, with a focus on the techniques that are applied i n multireservoir systems. Y e h (1985) and Labadie (1998, 2004) presented a comprehensive in-depth state o f the art review o f the optimization techniques used i n reservoir operation and management.  2.2.1.  Deterministic Optimization Models  2.2.1.1.  Linear Programming Models  One o f the most favored optimization techniques i n reservoir system models is linear programming (LP). L P requires that all the constraints and objective function be linear or be "linearizable" b y applying one o f the available linearization techniques, such as piecewise linearization or Taylor series expansion. L P models guarantee convergence to global optimal solutions. In addition, for large-scale reservoir optimization problems where the number o f variables and constraints are large, decomposition techniques such as Dantzig-Wolf or Bender decomposition technique can be used to accelerate the solution process (Yeh, 1985). L P problem formulation is easy and L P problems can be readily solved b y applying commercially available L P solvers.  Turgeon (1987) applied a monthly mixed integer L P model for site selection o f hydropower plants to be built. H i e w et al. (1989) applied L P technique to an eight reservoir system i n northern Colorado.  Shawwash et al. (2000) presented an L P short term optimization model ( S T O M ) , which has subsequently been developed to determine the optimal short term schedules  10  that meet the hourly load and maximizes the return to B C Hydro resources from spot transactions in the Western U . S . and Alberta energy markets. Currently, the model is used in B C Hydro by the generation operations engineers to optimize the scheduling o f the main reservoir systems. The authors state that using the model to develop the optimized schedule contributed between 0.25-1.0% in the form o f additional revenue from sales and in the value o f additional stored water. The authors indicate that one o f the major benefits o f using L P is the derived sensitivity analysis data that can be obtained from the simplex dual variables. A s an example, the dual variable o f load resource balance equation provides the system incremental cost at each time step. This information can be used in planning spot trading schedules and in real time operation o f the system.  Other applications o f the L P technique to the reservoir operations problem include: Martin (1986), Piekutowski et al. (1993), and Crawley and Dandy (1993).  2.2.1.2.  Network Flow Models  Network flow models ( N F M ) have been applied in a broad range o f water resource applications, as they are easy to formulate and efficient to solve. A reservoir system is represented as a network o f nodes that are linked by arcs. Nodes could represent storage or non-storage points o f confluence or diversion and arcs represent releases, channel flows, and carryover storage. This representation also has the advantage o f easily defining piecewise linear functions through the specification o f multiple links between nodes. F l o w bounds and unit costs are defined by the flow limits and slopes o f each linear piece (Labadie, 1997).  Lund and Ferreira (1996) applied a network flow algorithm to the Missouri River multireservoir system. The multireservoir system is optimized for a period o f 90 years i n monthly time steps. The authors concluded that system operation rules could be inferred from deterministic optimization applying a long hydrologic sequence.  Shawwash et al. (2000) observed that some o f the methodology's limitations were encountered when using arcs to describe flow patterns in a complex system, such as B C  11  Hydro's, that includes a combination o f very large and very small reservoir systems, were encountered.  2.2.1.3.  Labadie  Multiobjective Optimization Models  (1997)  presented  two approaches  for dealing  with  multiobjective  optimization problems. In the first approach, the primary objective is represented i n the objective function while treating the other objectives are treated as constraints at desired target levels (epsilon method). The second approach assigns weights to each objective (weighting method).  Can and Houck (1984) applied a preemptive goal programming (PGP) approach to a four reservoir multipurpose system i n the Green River Valley, Kentucky. In their comparative study with other L P models the authors concluded that P G P allows the flexible expression o f policy constraints as objectives and it performed well compared with a more data intensive L P optimal operating model. The basic concept o f P G P is to set aspiration levels (targets) for each objective and prioritize them. Attainment o f the goals is sought sequentially. A significant advantage o f P G P is that it does not require any penalty-benefit  function, reducing the need for a detailed economic analysis.  However, one drawback o f P G P is that it does not allow trading a small degradation i n a high priority objective for a large improvement i n a lower priority objective (Loganathan and Bhattachatya, 1990). A s goal programming (GP) relies on achieving predetermined target values, the global optima for objectives may not be explored.  2.2.1.4.  Nonlinear Programming Models  Non-linear programming ( N L P ) is not as popular as L P and dynamic programming (DP) i n solving reservoir systems optimization problems. The reason is basically because the optimization process is slow and can return inferior and non-optimal solutions. However, i n cases where a problem cannot be realistically linearized, it may be solved as a N L P problem particularly with inclusion o f hydropower generation i n the objective function and/or the constraints. Labadie (1997) indicates that the most powerful and  12  robust N L P algorithms available to solve reservoir system optimization problems include: Sequential Linear Programming (SLP), Sequential Quadratic Programming (SQP), Method o f Multipliers ( M O M ) , and the Generalized Reduced Gradient Method ( G R G ) .  Recent applications o f N L P to hydropower reservoir operations include: TejadaGuibert et al. (1990), A r n o l d et al. (1994), and Barros et al. (2003). Barros et al. (2003) applied N L P model to a large scale multireservoir system in Brazil. This multiobjective optimization problem was solved applying L P and S L P using a Taylor series expansion. The authors concluded that the N L P model is the most accurate and suitable for real-time operations than the L P model.  2.2.1.5.  Dynamic Programming Models  Dynamic programming (DP) is another powerful optimization technique that has been used extensively to solve reservoir system optimization problems. Unlike L P and N L P techniques that simultaneously solve the problem for all time periods, D P algorithms decompose the original problem into sub-problems that are solved sequentially over each stage (time period). D P formulation requires the definition o f a set o f state variables to describe the system state at the beginning o f each stage and a set o f decisions that transform the current stage state to the next one. D P has the advantage o f handling nonlinear, nonconvex, and discontinuous objective and constraint functions. However, a major problem that limits the application o f D P to large-scale multireservoir systems is the exponential growth in computation time as the number o f discretized state variables increases. This is widely known as the curse o f dimensionality (Bellman, 1957).  One o f the earliest applications o f deterministic D P to reservoir operation was by Y o u n g (1967), who studied a finite horizon, single reservoir operation problem.  Various extensions have been developed to overcome the curse o f dimensionality in applying dynamic programming application to reservoir operation. Larson (1968) introduced incremental dynamic programming (IDP). The IDP procedure starts with a trial solution and the recursive D P equation examines adjacent states around the current  13  solution. If better values are obtained, then the current solution is updated. Jacobson and Mayne (1970) developed a Differential Dynamic Programming ( D D P ) technique that uses analytical solution rather than discretization o f the state space. Murray and Yakowitz (1979) extended this approach to constrained problems. Johnson et al. (1993) introduced the  Coarse  Grid  Interpolation  technique.  This technique  relies  on using  larger  discretization intervals. Solution accuracy is retained by interpolating within a coarser grid structure.  2.2.2.  Stochastic Optimization Models  2.2.2.1.  Stochastic  Stochastic Dynamic Programming Models  dynamic  programming  (SDP)  is  a  powerful  tool  for  studying  multireservoir system operation because the stochastic nature o f inflows and  the  nonlinear energy generation functions can be modeled explicitly. Interestingly, Yakowitz (1982) found that the first application o f S D P preceded the application o f deterministic D P by more than a decade. Lamond and Boukhtouta (1996, 2001) and Lamond (2003) presented a survey o f stochastic reservoir optimization methods and models.  A  multireservoir, multiperiod S D P model  is formulated  by  considering  the  multiperiod optimization i n stages. Each stage corresponds to one period. Release decisions are made to maximize current benefits plus the expected benefits from future operation, which are represented by a recursively calculated cost to go function. Solution o f the S D P model for a multireservoir system yields the "cost-to-go" function and a release policy decision rule for each time period as a function o f the system state variables.  Since optimization is performed conditionally on all discrete combinations o f the state vector, the specter o f the curse o f dimensionality arises. For a multireservoir model with m discritization levels for n reservoirs, computational time and storage requirements are proportional to m".  14  The state variable typically includes the volume o f water in reservoirs and sometimes a description o f current, or forecasted hydrological conditions (Kelman et al., 1990). A periodic Markovian process typically describes reservoir inflows in S D P models. The choice o f the inflow state variable in an S D P model depends on the  system's  characteristics as well as the information available for decision-making. In addition, computational considerations often influence how hydrologic information is represented in S D P . Huang et al. (1991) applied four types o f representations o f the inflow state variable to the Feitsui reservoir S D P optimization model in Taiwan. The authors found that using previous period inflows resulted i n superior performance compared to the use of the present period inflows. Piccardi and R. Soncini (1991) found that policies derived from an S D P model without a hydrologic state variable resulted in simulated performance similar to that o f policies derived using the previous period's inflow, although the S D P and simulation agreed more closely when the previous period's inflow were employed.  Tejada-Guibert et al. (1995) examined the value o f hydrologic information i n S D P multireservoir models by using different hydrological state variables for the ShastaTrinity subsystem o f the Central V a l l e y project in California. Then, the S D P policies were compared using a simulation model assuming that the current inflows were known. The authors applied four types o f models with different inflow state variables, and concluded that the value o f using sophisticated inflow forecasting depends on several system characteristics, including the relative magnitude o f water and power demands and the severity o f the penalties on shortages. Turgeon (2005) applied a parametric approach to represent the inflows by a linear autoregression ( A R ) model, used to solve the S D P reservoir management problem. Instead o f using the traditional lag-1 models, the author stated that there are many advantages in considering multilag autocorrelation o f inflows. To avoid an increase in state space, the multilag autocorrelation o f inflows was represented by the conditional mean o f the daily inflow.  The use o f S D P to optimize multireservoir systems is usually accompanied by the assumption that various natural inflows are not cross correlated. This results in solutions that provide a rough estimate o f the optimal design or operation policy. T o handle this  15  problem, Y e h (1985) suggested the separation o f the D P optimization and stream flow generation, or using an aggregation/decomposition methods similar to those proposed by Turgeon(1980).  a) Dynamic Programming with Successive Approximation  The Dynamic Programming with Successive Approximation ( D P S A ) method consists o f breaking up the original multi-state variable problem into a series o f one-state variable sub-problems in such a manner that the sequence o f optimizations over the sub-problems converges to the solution o f the original problem. Davis et al. (1972) used the D P S A to determine a local feedback policy for each reservoir for a network o f reservoir-hydroplant complexes in parallel. Pronovost and Boulva (1978) have used Davis' method to obtain an open-loop policy, which gives near optimal results rather than local feedback to eliminate the drawback o f this method. Turgeon (1980) concluded that to obtain an openloop policy solution, the successive approximation method must be solved repetitively at the beginning o f each period that may be computationally costly.  b) Aggregation and Decomposition SDP  Turgeon (1980) introduced the aggregation and decomposition method consisting o f breaking-up the original n-state variable stochastic problem into n stochastic subproblems o f two-state variables that are solved by S D P . The final result o f this method is a  suboptimal global feedback  operating policy for the  system o f n reservoirs.  Furthermore, Turgeon (1980) assumed that the electrical energy generated by any plant is a constant times the turbine releases. Accordingly, instead o f utilizing the reservoir storage as a state variable, a potential energy term is created for treating the nonlinearity o f the power generation function.  Turgeon (1980) applied the D P S A and the aggregation/decomposition methods to a network o f 6 reservoir hydroplant complexes. In his comparative study he concluded that the later gives a better operating policy with the same time and computer memory requirements. In addition, the computational effort o f the aggregation/decomposition  16  method increases linearly with the number o f reservoirs since for each additional reservoir, only one additional D P two-state problem has to be solved.  Valdes et al. (1992) applied this technique to the 4 reservoir lower Caroni hydropower system in Venzuela. Disaggregation was performed both spatially and temporally, resulting in daily operational policies from the monthly equivalent reservoir policies. Saad et al. (1994) incorporated neural networks to improve the disaggregation process and to account for nonlinear dependencies between the system elements. The method was successfully applied to finding long-term operational policies for HydroQuebec's 5 reservoir hydropower system on the L a Grande River. Labadie (1997) indicated that the main problem with the use o f state aggregation/decomposition methods is the loss o f information that occurs during the aggregation process.  c) The Principle of Progressive Optimality  Turgeon (1981) presented  an algorithm based on the principle o f progressive  optimality o f Howson and Sancho (1975), for which the state variables do not have to be discretized. He applied the technique to an example consisting o f four hydropower plants in series to determine the optimal short time scheduling for multireservoir system consisting o f 4 hydropower plants. The algorithm does not have the recursive equation i n terms o f the optimal value function and might be considered as a multidimensional continuous version o f the IDP procedure.  This approach has the advantage o f dealing with discontinuous return functions and with hydropower production functions that do not have to be linearized or approximated by convex linear functions. A l s o , there is no problem o f dimensionality since only one trajectory o f the reservoir storage must be stored in the computer memory. A s this iterative procedure is a function o f the selected initial solution, Turgeon (1981) proposed the use o f D P S A with a very coarse grid o f the state variables to obtain a good trial trajectory before using this approach, which can then be solved by a direct search method.  17  d) SDP with Function Approximation The discretized "cost-to-go" function can be approximated by a continuous function. Since an approximate value for the cost-to-go function is only calculated at the discretized state values, the value o f the function at other points can be estimated by interpolating nearby grid points. Several authors explored the reduction in computational effort possible when multivariate polynomials or piecewise polynomial functions are employed in S D P algorithms (Foufoula-Georgiou and Kitanidis (1988); Johnson et al. (1989); Johnson, et al 1993; Tejada-Guibert et al (1993), and Lamond (2003). TejadaGuibert et al. (1993) concluded that computational savings are possible; mainly because: (1) the improved accuracy o f higher order functions which results in good solutions even with a coarse state space discretization and (2) efficient gradient-based optimization algorithms can be used to compute better approximations to the optimal solutions.  Johnson et al. (1993) applied a high order spline function to approximate the cost-to go function so that a coarse discretization o f the state space could be used. The spline is constructed o f individual multivariate cubic polynomials, each defined over a sub-region o f the state space domain. The spline coefficients were determined by requiring that the spline be able to interpolate the cost function values at each state space grid point. This approach proved to be successful in reducing the solution time for a system with two to t  five reservoirs. Tejada-Guibert et al. (1993 and 1995) applied these piecewise cubic functions to approximate the cost-to-go function for the five hydropower plants o f the Shasta-Trinity system in North California. He also recommended the use o f a sampling S D P algorithm suggested by Kelman et al. (1990) as an attractive approach to describing the distributions and temporal correlations o f inflows. Lamond (2003) applied a piecewise polynomial approximation o f the future value function for a single hydroelectric reservoir model. The authors concluded that the adopted method is faster than both discrete D P value iteration and a continuous D P method using splines on a fixed grid. A l s o , they suggested that spline approximation is not well suited when the rewards are piecewise linear.  18  Lamond and Boukhtouta (2005) applied the neuro-dynamic programming approach (NDP) o f Bertsekas and Tsitsiklis (1996) to approximate the cost-to-go function by a neural network. They applied the N D P to compute an approximate optimal policy for the control o f a single hydroelectric reservoir with random inflows, concave, piecewise linear revenues from electricity sales taking into account the head variations and the turbine efficiency. Their N D P approach is based on a backward induction o f a feed forward neural network with an input layer, hidden layer and a single output layer to approximate the future value function.  Lamond and Boukhtouta (2005) concluded that the N D P approximation architecture gives very smooth  approximate  functions,  which allowed the  use  of a  coarse  discretization o f the state and the inflow variables in the training step o f the neural functions. Their findings reinforce and confirm Bertsekas (2001) claims that N D P can be impressively effective in problems where the traditional D P methods would be hardly applicable.  2.2.2.2.  Stochastic Linear Programming  Stochastic linear programming ( S L P ) deals with uncertainty in the model parameters by considering a number o f scenarios. Each scenario describes the values o f the uncertain parameters and their probability o f occurrence. The primary advantage o f scenario-based stochastic models is the  flexibility  it offers in modeling the decision process and i n  defining scenarios, particularly when the state dimension is high. However, the difficulty with this modeling approach is that an ample number o f scenarios result in a large scale linear programming problem, which in turn requires special solution algorithms that rely mainly on decomposition approaches.  Stochastic linear programming with recourse ( S L P R ) utilizes scenarios to represent the uncertainty in model parameters in the form o f stages. The S L P R in its simple form subdivides the problem into two stages. The first stage decisions are proactive or planning decisions, which are made with certainty, while the second stage decisions are reactive or operating decisions. Accordingly, S L P R models support the "here and now  19  decision", while providing a number o f "wait and see" strategies depending on which scenario unfolds. These models are non-anticipative: i n each stage, decisions must be made without knowledge o f the realization o f random variables in future stages.  When each inflow scenario is treated deterministically, the deterministic variables represent the set o f first stage decisions and the stochastic variables represent future release decisions corresponding to a specific scenario. It should be noted that only the first stage decisions are actually implemented, since the future decisions are not known with certainty. Following the implementation o f the first stage decisions, the problem is then reformulated starting with the next period decisions, and solved over the planning horizon.  The first applications o f two-stage and multi-stage S L P to reservoir management (Pereira and Pinto, 1985, 1991) used the Benders Decomposition Method (Benders, 1962: V a n Slyke and Wets, 1969). This method is powerful because it allows a large-scale problem to be solved iteratively. Moreover, using this technique in a nested form allows multi-stage problems to be decomposed by both scenario and decision period (Birge, 1985).  Jacobs et al. (1995) applied S L P using Benders decomposition to a three reservoir hydropower system in Northern California. Decomposition o f the linear programming problem into smaller network  flow  optimization problems resulted in significant  computational savings over attempts at direct solution.  Dantzig and Infanger (1997) combined Benders decomposition (Benders, 1962) with the importance sampling technique to reduce the variance o f the sample estimates. The dual o f the multistage formulation measures the impact o f future responses, which is fed back to the model's present time i n the form o f cuts. These cuts are sequentially added at different stages o f the multi-stage dynamic system. Dantzig and Infanger (1997) indicated that thesecuts constitute a set o f generated rules that guide the control problem to balance between present and future costs and drive the system away from future infeasibilities and towards optimality. Kracman et al. (2006) developed a multistage multiobjective S L P  20  reservoir planning model for the Highland Lakes system in Texas applying generated scenarios using a quantile sampling technique. The authors state that this scenario generation technique, which was adopted, preserves the spatial correlation o f the random inflows.  2.2.2.3.  Stochastic Dual Dynamic Programming  Stochastic Dual Dynamic Programming ( S D D P ) developed by Pereira  (1989)  represents an interesting m i x o f stochastic linear and dynamic programming optimization techniques. S D D P solves a multidimensional stochastic dynamic programming problem and it approximates the future cost function ( F C F ) as a piecewise linear function. Unlike conventional S D P , which discretizes the state space and solves the F C F for all points, S D D P samples the state space and solves the D P problem iteratively. The S D D P approach, as presented by Pereira et al. (1999), is described in the following paragraphs.  The first phase starts with a backward recursion calculation o f the F C F . The slope o f the F C F around a given state is calculated by solving a series o f one stage L P s for each inflow scenario. The slopes o f the F C F at the different states are estimated from the dual variable o f the mass balance constraint, as these multipliers represent the change in the objective function value with respect to storage (df /dS). The resulting cost-to-go, which is based on the highest value in each state (convex hull), represent a lower bound for the actual F C F . In the second phase a Monte-Carlo simulation is performed in a forward pass which simulates the system operation using the release policy calculated so far. Similar to the backward recursion calculations, a set o f one stage L P problems has to be solved for each inflow scenario. The upper bound o f the F C F is estimated as the mean o f the Monte Carlo simulation results. To address the uncertainty around the true expected value o f the cost function, Pereira et al. (1999), used the 95% confidence intervals to estimate the confidence limits around the true values.  The Optimal solution is obtained i f the lower bound o f the F C F lies inside the confidence limits. If not, a new iteration with backward and forward calculations has to  21  be performed adding additional sets o f states (additional cuts or plans to the F C F ) . The states that the simulation passes through are used in the new backward recursion.  It should be noted that the planes obtained in each iteration are retained and the new planes are added to the set generated so far. This is in contrast to the conventional S D P where a new cost-to-go table has to be developed in each stage.  Pereira (1989), and Pereira and Pinto (1991) applied the S D D P to a hydrothermal scheduling problem in Brazil. Rotting and Gjelsvik (1992) applied the S D D P to seasonal planning for system o f 35 reservoirs in a 28 river systems, which represents a part o f the Norwegian hydropower system. The system is operated to minimize the thermal operating costs while considering the terminal value o f water storage. They concluded that the S D D P procedure is successful and convergence o f the algorithm is obtained with a saving in run time over the basic S D P approach by a factor o f 16. Halliburton (1997) however, states that: "convergence is questionable for both the U . S . Bonneville Power Administration ( B P A ) and N e w Zealand systems". He summarized the difficulties with S D D P as: non convergence, long C P U time, difficulties in setting the large number o f interacting penalties, and the inability to handle certain type o f constraints (non convex, applying across a number o f time periods, etc...).  2.2.2.4.  Chance-Constrained Programming and Reliability Programming  Chance-Constrained Programming ( C C P ) considers the probability conditions on constraints. Typically, the probability o f satisfying a constraint is required to be greater than a threshold value. These constraints have the impact o f tightening the restrictions on reservoir releases at the desired risk levels, thereby encouraging more conservative operational strategies. C C P converts a stochastic type problem to a deterministic-type one, and then solves the deterministic equivalent.  Loucks and Dorfman (1975) concluded that the use o f chance constraints leads to overly conservative rules for reservoir operation. Takeuchi (1986) invoked a C C P model to solve a real-time reservoir operation problem. The chance-constraints were set on the  22  probability o f the reservoir becoming empty. Changchit et al. (1989) combined C C P with goal programming to operate a multiple reservoir system.  Y e h (1985) concluded that C C P formulations neither explicitly penalize the constraint violation nor provide recourse action to correct constraint violations as a penalty. Hogan et al. (1981) warned that the practical usefulness o f C C P is seriously limited and it should not be regarded as substitution for stochastic programming. Labadie (1997) indicated that the C C P does not represent the true risk that must be estimated by performing Monte Carlo analyses on the proposed operational policies.  Colorni and Fronza (1976) initiated the application o f reliability programming (RP) to the reservoir management problem that was regarded as an extension to the C C P . In their model, risk was accounted for by choosing different probability values that constrain the degree o f satisfying the contracted release. Reznicek and Cheng (1991) expressed the probability o f the constraints as decision variables and were therefore incorporated i n the objective function.  2.2.3.  Heuristic Models  Heuristics methods are criteria, methods, or principles for deciding that among several alternative courses o f action are the most effective in achieving certain goals (Pearl, 1984). Heuristic algorithms cannot guarantee global optimum solutions, however they are well-suited to problems that are difficult to formulate and solve by applying algorithmic methods (e.g. non-linear-nonconvex functions). Genetic algorithms ( G A ) and artificial neural networks are the most commonly used heuristic methods for the reservoir operations planning problem.  Recently, Ant Colony Optimization ( A C O ) algorithms, which are evolutionary methods based on the foraging behavior o f ants, have been successfully applied to a number o f benchmark combinatorial optimization problems (Dorigo et al., 2000). A C O was inspired by the behavior o f ants in finding the shortest route between their nest and a  23  food source. Jalai et al. (2006) applied ant A C O to the Dez reservoir in Iran for a finite horizon and a single time series o f inflows. The authors concluded that the A C O algorithm provided improved release policies as compared with another G A model. The same conclusion o f A C O outperforming the G A algorithm was found by Kumar and Reddy (2006) who applied A C O to a multipurpose reservoir i n India.  2.2.3.1.  Genetic Algorithm  Genetic algorithm ( G A ) is a powerful population oriented search method based on the principle o f Darwinian natural selection and survival o f the  fittest.  G A performs  optimization through a process analogous to "the mechanics o f natural selection and natural genetics" (Goldberg, 1989).  Genetic algorithms (strings/chromosomes),  deal with a population o f individual candidate  solutions  which undergo changes by means o f genetic operations o f  reproduction through selection, crossover, and mutation operations. These solutions are ranked according to their fitness with respect to the objective function. Based on their fitness values, individuals (parents) are selected for reproduction o f the next generation by exchanging genetic information to form children (crossover). The parents are removed and replaced i n the new population by the children to keep a stable population size. The result is a new population (offspring) with normally better fitness. After a number o f generations, the population is expected to evolve artificially, and the (near) optimal solution w i l l be reached. The global optimum solution however cannot be guaranteed since the convexity o f the objective function can't be proven. The G A adjusts populations of release rule structures based on values o f the fitness (objective) function values according to the hydrologic simulation model results.  Wardlaw and Sherif (1999) successfully applied G A to a four-reservoir system in which a global optimum was achieved. The authors concluded in their evaluative study that G A provides robust and acceptable solutions and could be satisfactorily used in realtime operations with stochastically generated inflows. Haung et al (2002) applied a genetic  algorithm  based-stochastic  dynamic  programming  to  cope  with  the  24  dimensionality problem in a parallel multireservoir system in northern Taiwan to derive a joint long-term operation policy. Haung et al (2002) concluded that although the use o f GA-based S D P may be time consuming as it proceeds from generation to generation, the model could overcome the "dimensionality curse" in searching solutions. Reis et al. (2005) proposed a hybrid genetic algorithm and linear programming approach for multireservoir operation planning. Their model handled the stochastic future inflows by a three stage tree o f synthetically generated inflows. They applied their approach to a hypothetical hydrothermal four reservoir system and compared the results with a S D D P model. The authors concluded that the hybrid scheme offers some computational advantages over the S D D P model. However, it is computationally more time consuming.  2.2.3.2.  Artificial Neural Networks  A n artificial neural network ( A N N ) is a model inspired by the structure o f the brain that is well suited to complicated tasks such as pattern recognition, data compression and optimization. In neural network terminology, a formal neuron simulates the behavior o f the biological neuron whose dendrites collect the energy from its input signals and whose axon transmits a signal to other neurons. In the formal neuron, the energy from the dendrites  is presented  by a weighted sum o f the input variables, and the axon  transmission is represented by applying a transfer function to the weighted sum o f inputs. The training o f the A N N is usually performed using a gradient-type back propagation procedure, which determines the values o f the weights on all interconnections that best explain the input-output relationship.  A N N has been used within S D P models to approximate the "cost-to-go" function with fewer sampling points. Saad et al. (1994) applied an A N N to the long-term stochastic operation o f the hydroelectric multireservoir system o f Quebec's L a Grande River. The neural network was trained to disaggregate the storage level o f each reservoir o f the system for an aggregated storage levels for the system. The inputs to the network are the aggregated storage levels determined by S D P for the aggregated reservoirs. The neural network is trained by applying a large number o f equally likely stream flow  25  sequences.  Saad et al. (1994) concluded that in comparison with the principal  components approach, A N N is more efficient.  Raman and Chandramouli (1996) used A N N to obtain optimal release rules based on initial storage, inflows, and demands for A l i y a r reservoir in Tamil Nadu, India. The A N N was trained by applying the results  o f a deterministic D P model. Raman and  Chandramouli (1996) concluded that simulation o f operation with rules obtained by the trained A N N outperformed those produced by linear regression analysis, as well as optimal feedback rules obtained from explicit stochastic optimization using S D P .  2.2.3.3.  Fuzzy  Programming  Several researchers have used fuzzy set theory and fuzzy logic to deal with uncertainties associated with the reservoir operation problem. Fuzzy set theory is generally used to describe imprecision and vagueness. In fuzzy logic, variables are partly represented by several categories and the degree o f belongingness to a set or category can be described numerically by a membership number between 0 and 1.0. Russell and Campbell (1996) used fuzzy logic to derive operating rules for a hydroelectric plant, where the inflow and price o f energy can vary. Tilmant et al. (2001) developed a fiizzy S D P approach with fiizzy objectives and fuzzy intersections between immediate and future release decisions consequences. Mousavi et al (2004) developed a technique called fuzzy-state  stochastic dynamic programming (FSSDP) for reservoir operation that  considers the uncertainties in the hydrologic variables and the imprecision due to variable discretization as fiizzy variables. The transition probabilities are considered by defining a fuzzy Markov chains.  2.3. Sampling Techniques In the applications o f stochastic programming models for the reservoir optimization problem we are usually faced with the problem o f how to represent the random variables (inflow, demand, prices). The problem becomes rather complex with multivariate random vectors, particularly i f these vectors are correlated. Generation o f data trajectories or  26  scenarios represents the most challenging and time consuming part o f the solution o f stochastic optimization problems. The objective is to generate trajectories or scenarios that best approximate the given distribution o f the random variables in a computationally manageable way in the optimization model. Different sampling-based approaches have been proposed to handle the problem o f generating scenarios. A number o f these methods have been presented by Kaut and Wallace (2003). The following is a brief overview o f the generation o f data trajectories and sampling methods.  2.3.1.  Time Series Models  Time series models are intended to replicate the spatial and temporal structure o f the random variables. Examples o f time series models include: Autoregressive models, M o v i n g Average Models, and Bayesian Vector Autoregression model ( V A R ) . M a n y o f the  reported  applications  o f S D P in reservoir  management  models  use  lag-1  autoregressive or multivariate model. The use o f time series to generate data trajectories involves selecting a model and estimating its parameters. These two steps add to the uncertainty o f the analysis. V o g e l and Stedinger (1988) have documented that errors arising from parameter estimation often overwhelm issues o f model choice.  2.3.2.  Conditional Sampling  Because o f its simplicity, conditional sampling is the most popular method for generating scenario trees in stochastic programming. It is based on approximating probability measures by empirical ones generated by random samples. Because o f computational restrictions, these samples cannot be very large, so the empirical measures can be poor approximations o f the original ones. Pennanen and K o i v u (2002) show that modern integration quadratures provide a simple and attractive alternative to random sampling. These quadratures are designed to give good approximations o f given probability measures by a small number o f quadrature points. Loretan (1997) applied principal component analysis to reduce the dimensionality o f the scenario tree. Sampling from principal components, allows correlated random vectors to be obtained.  27  2.3.3.  Sampling from Specified Marginals and Correlations  Many  techniques  are available to generate random  distributions (Devroye,  1986). These  techniques  samples  from  univariate  are not applicable i n sampling  multivariate vectors particularly i f they are correlated. In the case o f multivariate distributions, some algorithms were developed assuming the correlation matrix and marginal distributions (beta, lognormal, Pearson, e t c . . ) are fully specified (e.g. Cario and Nelson, 1997). Other algorithms sample correlated random variables applying partially specified multivariate distributions (e.g. L u r i and Goldberg 1998). However the user specifies the marginal moments. The various algorithms also differ i n the degree to which dependencies among variables are specified. Most algorithms require only the correlation matrix, but a few require higher order product moments. Parish (1990) presented a method for generating random variables from multivariate Pearson distribution, with the knowledge o f all product moments to the fourth order.  2.3.4.  Moment Matching  This method relies on describing the marginal distributions by their moments (mean, variance, skewness, kurtosis, etc.) as well as a correlation matrix, and possibly other statistical properties. Hoyland and Wallace (2001) developed a scenario generation algorithm, which constructs multi-dimensional scenario trees with specified moments and correlations, by solving a single, very large, least squares problem. T o improve the speed of the solution procedure, Hoyland et al. (2003) introduced a new algorithm that speeds up the procedure b y decomposing the least squares problem into n univariate random variables, each satisfying a specification for the first four moments. Then, the different marginal distributions were combined so that the joint distribution satisfies the specified correlations  and moments  by applying a Cholesky decomposition and a cubic  transformation i n an iterative procedure. Lurie and Goldberg (1998) applied a similar multivariate decomposition approach but starting with parametric marginal distributions instead o f the marginal moments.  28  Although Hoyland et al. (2003) could not guarantee convergence to their proposed procedure, they concluded that their experience shows that it would converge i f the moment's specifications were possible and there were enough scenarios. They also stated that a potential divergence or convergence to the wrong solution is easy to detect. Accordingly, there is no risk o f ending up using the incorrect tree i n the optimization procedure. In terms o f computer time, they found trees with 1000 scenarios representing 20 random variables took less than one minute.  2.3.5.  Path Based Methods  These methods start by generating several data paths (or fans), which can be done through the use o f parametric or nonparametric methods as suggested by Dupacova et al. (2000). In many application areas there exist advanced continuous and discrete time stochastic models and historical time series that serve to calibrate these models. A global scenario generation can be achieved with the calibrated model, b y simulating many sample paths. These models employ a specified type o f probability distributions. Nonparametric methods can be applied to large families o f probability distributions, which cannot be indexed by a finite dimensional parameter (distribution free methods). The next step is to delineate the initial structure o f the scenario tree, i.e. the number o f stages and the branching scheme. The additional step to build the scenario tree includes applying ad hoc methods, by cutting and pasting the data paths i n an intuitive way. The other possibility, as proposed b y Birge and M u l v e y (1996), is to apply cluster analysis i n a multi-level clustering or bucketing scheme that exploits the whole sequences o f observed/simulated data.  2.3.6.  Optimal Discretization  Pflug (2001) developed a method for constructing a scenario tree with optimal discretization on the basis o f a simulation model o f the underlying stochastic process b y using a stochastic approximation technique. This method is different from other methods  29  described earlier i n that it constructs the whole scenario tree at one time. However, the method deals only with univariate processes.  2.3.7.  Scenario Reduction  This method involves developing a much smaller number o f scenarios, and it determines a scenario subset o f prescribed cardinality or accuracy and a probability measure based on this set that is the closest to the initial distribution in terms o f a natural probability metric. A l l deleted scenarios have probability zero. Romisch and Heitsch (2003) presented two new algorithms for computing optimally reduced probability measures approximately. One advantage o f the reduction concept is its generality. N o requirements  are imposed on the stochastic data processes (e.g. the dependency or  correlation structure o f the scenarios, the scenario probabilities or the dimension o f the process).  2.3.8.  Interior Sampling Methods  Interior sampling is an another class o f sampling methods i n which several samples are used at different steps o f a particular optimization procedure, for example to estimate function values, gradients, optimality cuts, or bounds, corresponding to the second-stage expected value function. Higle and Sen (1991) suggested stochastic decomposition methods. Infanger (1994) applied importance sampling that generates samples within the L-shaped algorithm for stochastic linear programming. Importance sampling is typically presented as a method for reducing the variance o f the expected estimate o f a stochastic variable by carefully choosing a sampling distribution.  2.4. R L Approach Conventional optimal control methods, dynamic programming for instance, suffer from the 'curse o f dimensionality', wherein the large dimensionality o f the system at hand and the exponential growth o f its possible states prohibit the attainment o f an  30  optimal solution even using the fastest computers available today, and most likely in the future. The literature survey conducted has revealed that this area o f research is still very active, as new solution techniques are being investigated and developed.  One possible angle from which the problem can be tackled is through the use o f machine learning techniques from the field o f artificial intelligence ( A l ) , particularly Reinforcement Learning ( R L ) . R L has two key advantages over conventional control methods: the potential for learning how to control a larger system in a shorter time, and the ability to do so with or without a formal model o f the system. Reinforcement learning ( R L ) has adapted key ideas  from various disciplines namely: machine  learning,  operations research, control theory, psychology, and neuroscience to produce some very successful engineering applications (Sutton and Barto 1998).  R L overcomes the curse o f dimensionality through the use o f function approximation, which allows R L to use much larger state spaces than classical sequential optimization techniques such as dynamic programming. In addition, using sampling, R L can be applied to large-scale problems where it is too complicated to explicitly evaluate and enumerate all the state transition probabilities. Modern reinforcement learning could be applied to both trial and error learning without a formal model o f the environment, and to planning activities with a formal model o f the environment, where an estimate o f the state-transition probabilities and immediate expected rewards could easily be evaluated.  Sutton and Barto (1998), Bertsekas and Tsitsiklis (1996) state that: " R L has become popular as an approach to artificial intelligence because o f its simple algorithms and mathematical foundations and also because o f a series o f successful applications". Sutton (1999) concluded that this approach has already proved to be very effective in many applications as it has produced the best o f all known methods for playing backgammon (Tesauro, 1995), dispatching elevators (Crites at al. 1996), job-shop scheduling (Zhang W . and Dietterich 1996), and assigning cellular-radio channels (Singh and Bertsekas 1996).  31  Ernst et al. (2003) applied R L for power systems stability control. Abdulhai et al. (2003) applied R L for true adaptive traffic signal control. In the water resources sector the application o f this approach has been very limited. W i l s o n (1996) applied the R L technique in the real-time optimal control o f hydraulic networks. Bhattacharya et al. (2003) successfully applied the R L technique in real time control ( R T C ) to Delfland water system in the Netherlands, which includes Delft, Hague, and part o f Rotterdam covering an area o f about 367 k m and consisting o f about 60 polders with 12 pumping 2  stations. Bhattacharya et al. (2003) concluded that in all applications involving some sort o f control functions (urban drainage systems, polder water level maintenance, and reservoir operation), R L has substantial potential.  R L is a machine learning approach that can be used to derive an optimal control strategy. R L concerns the problem o f a learning agent interacting with its environment to achieve a goal (Sutton, 1999). The agent continuously maps situations to actions so as to maximize a reward signal. The learner is not told what to do, as i n most forms o f machine learning techniques, but instead must discover which actions yield the most rewards by trying them (Sutton, 1999). These two characteristics, trial and error search and delayed reward, are the two most important distinguishing features o f reinforcement learning.  2.5. Conclusions The literature" review carried out shows that this area o f research is still very active and that different optimization approaches and modeling techniques are being tried for dealing with the reservoir systems optimization problem. The review shows that employing an explicit stochastic optimization approach would be the most advantageous since it provides the best representation o f this complex problem. The main obstacle that needs to be addressed and resolved, however, is the high dimensionality o f the problem.  From this literature review, it can be concluded that D P algorithms remains a very powerful  technique  for  handling the  nonlinear,  stochastic  large-scale  reservoir  optimization problem. A m o n g the numerous efforts attempted to alleviate the curse o f  32  dimensionality problem, which is aggravating the large-scale S D P method, approximation techniques and/or sampling techniques resulted  i n some  function successful  applications in the multireservoir hydropower generation operations planning problem. One promising approach addressing the possibility o f combining these two techniques (function approximation and sampling techniques) within an S D P formulation is the Reinforcement  Learning ( R L ) technique.  The following chapter presents the main  concepts and computational aspects o f R L techniques.  33  3.  THE REINFORCEMENT LEARNING APPROACH  3.1. Introduction Reinforcement  learning ( R L ) is a computational  approach  for learning  from  interactions with an environment and from the consequences o f actions to derive optimal control strategies. R L has adapted key ideas from various disciplines namely: machine learning, operations research, control theory, psychology, and neuroscience (Sutton and Barto, 1998). R L has become popular as an approach to artificial intelligence because o f its simple algorithms and mathematical foundations (Bertsekas and Tsitsiklis, 1996) and because o f a number o f successful applications in different domains, e.g. control problems, robot navigation, economics and management, networking, games,  etc...  (Sutton, 1999).  The successful applications o f R L surveyed and the key advantages that R L offers i n handling large-scale problems provided the motivation to research the possibility o f applying this approach to solve the large-scale problem o f operation planning o f multireservoir systems.  The following sections o f this chapter introduce the main concepts and computational aspects o f the R L methods and presents the distinguishing features and elements o f the R L approach including the trial and error learning o f policies, the concept o f delayed rewards, and the exploration and exploitation o f policies. The chapter also focuses on the three main classes  o f methods to solve the R L problem, namely, (1) dynamic  programming algorithms and its relation with Markovian decision process ( M D P ) and the Bellman principle o f optimality, (2) Monte Carlo methods, and (3) the TemporalDifference learning methods. More advanced R L methods that unify the basic ideas o f the above three methods are also described; these include the eligibility traces and function approximation, and generalization. More comprehensive reviews o f R L can be  34  found in Sutton and Barto (1998), Kaelbling et al. (1996), and Bertsekas and Tsitsiklis (1996).  3.2. Reinforcement Learning Approach versus Dynamic Programming Dynamic Programming (DP) is a very powerful technique for handling sequential, nonlinear, and stochastic optimization problems. D P guarantees the attainment o f optimal solutions to M D P s . However, D P requires that values o f the transition probabilities and the transition rewards o f the system be calculated. The transition probabilities and the expected immediate rewards are often known as the theoretical model o f the system. For large scale systems that involve several stochastic variables, constructing the model o f the environment is quit a difficult task. Gosavi (2003) states that: "Evaluating the transition probabilities often involves evaluating multiple integrals that contain the probability density functions (pdfs) o f random variables. It is for this reason that D P is said to be plagued by the curse o f modeling".  Compared with D P methods, linear programming methods (LP) can also be used to solve M D P s . Sutton and Barto (1998) indicated that L P becomes impractical at a much smaller state space (by a factor o f about 100) and concluded that for the largest problems, D P methods are the only feasible and practical solution methods.  For D P problems, assuming a system with m state discretization and n reservoirs, the computational time and storage requirement is proportional to m". Consider the case o f a system with hundred state discretization for each o f two reservoirs, the number o f possible state combinations i n one period is 100 =10 . This exponential increase in the 2  4  state space is often known as the curse o f dimensionality (Bellman, 1957). Assuming that there are five possible actions for each state, the transition probability matrix o f each action would consist o f 5 x 1 0 x 10 = 5 x 10 elements. A s this simple example shows, 4  4  8  it is obvious that for such problems with large state space, storage o f the transition matrices w i l l be a difficult task indeed.  35  Obviously, the two main problems limiting the capabilities o f D P are the excessive memory needed to store large tables and the very long computation time required to fill those tables. One possibility to tackle these problems is through the use o f machine learning  techniques  from  the  field  of  artificial  intelligence  ( A l ) , particularly  Reinforcement Learning ( R L ) . R L offers two key advantages in handling problems that are too large to be solved by conventional control methods:  1. The ability to solve M D P s with or without the construction o f a formal model o f the system. B y using sampling (simulation), R L can be applied to large-scale problems that are too complicated to explicitly evaluate and enumerate all the transition probabilities and the expected immediate rewards o f the system. This way R L provides a way to avoid the curse o f modeling.  2. The potential for learning how to control a larger system. R L can overcome the curse o f dimensionality through the use o f function approximation methods. For small scale problems, R L stores the elements o f the value function i n lookup tables called QTables (tabular methods). However, as the state space increases, R L can use function approximation methods, which require the use o f a limited number o f parameters to approximate the value function o f a large number o f states. The following figure highlights the difference in the methodology between the D P and R L .  36  Reinforcement Learning  Traditional Dynamic Programming  Inputs (Distribution o f Random Variables)  No Model (Simulator/ Real system)  Model (Generate the transition probability and transition reward matrices) r  R L Algorithm (Q-Learning/ S A R S A )  DP Algorithm (Policy Iteration/ Value Iteration or L P )  r  Approximate - Near Optimal Solution  Figure 3.1.  Optimal Solution  Reinforcement Learning and Dynamic Programming comparison (adopted from Gosavi, 2003)  3.3. Reinforcement Learning Problem Formulation R L concerns the problem o f a learning agent that relies on experience gained from interacting with its environment to improve performance o f a system over time. The learner (or the decision maker) is called the agent and the object it interacts with is called the environment. The environment could be a simulator o f the system or the real system. Both the agent and the environment constitute a dynamic system.  Unlike supervised learning techniques, which require examples o f input-output pairs o f the desired response to be provided explicitly by the teacher, the learning agent is not  37  told what to do. Instead, the agent must continuously learn from the consequences o f its actions. The agent perceives how well its actions perform by receiving a reward signal. This reward signal indicates that the agent should do or how to modify its behavior without specifying how to do it. The agent uses this signal to determine a policy that leads to achieving a long term objective. This trial and error interaction with the environment and the delayed rewards are the two main distinguishing features o f the reinforcement learning method.  3.3.1.  R L Basic Elements  The main components o f a R L algorithm are: an agent, an environment and a reward function. The interaction between the agent and its environment can be modeled within the framework o f M a r k o v Decision processes. The agent and the environment interact i n a sequence o f discrete time steps, t - 0, 1, 2, 3, .... A t each step the agent receives some indication o f the current state o f the environment, s G S, where S is the set o f all states t  and then it selects an action a e A, where A is a finite set containing a l l possible actions. t  The agent interacts with the environment and receives feedback in the form o f a stochastic transition to a new state s +i and receives a numerical reward r(si,a ) as defined t  t  by the reward function. Through this delayed reward and guided search process, the agent learns to take appropriate actions that maximize the cumulative rewards over time (Sutton, 1999). A schematic representation o f the agent-environment  interaction is  presented in Figure 3.2.  Environment (Simulator/Real System)  State  Action a,  Reward  s  t  Agent (Decision Maker")  Figure 3.2.  Agent-Environment interaction in R L  38  3.3.2.  Exploration/ Exploitation  One o f the key aspects o f reinforcement learning is that the learner needs to explicitly explore its environment in its search for good rewards. The feedback that the agent receives from its environment indicates how good the action was, but it does not indicate whether  it was the best or the worst action possible (Sutton and Barto, 1998).  Accordingly, two conflicting objectives arise during the action selection process. One objective is to achieve high-valued short-term rewards by selecting actions that are already known to be good (exploit). O n the other hand, the agent has to explore new actions for better action selections in the future.  Two popular methods for balancing the exploration and exploitation in R L are the £greedy and softmax action selection rules. In the e-greedy method, the learner behaves i n a greedy way most o f the time (by selecting the highest estimated action value), while a non-greedy exploratory action w i l l be taken every now and then with a small probability £ (by selecting a random action uniformly regardless o f its value). The advantage o f the e-greedy method is that, in the limit and as the number o f trials increases, every action is sampled an infinite number o f times. The probability o f selecting the optimal action converges to greater than 1-e (Sutton and Barto, 1998). The disadvantage o f the e-greedy method, however, is that in the exploration process it chooses equally among all actions, regardless o f the estimated value o f the chosen action. The result is that the learner could choose equally between the worst action and the second best action in the exploration process.  In the softmax action selection method, the action selection probabilities are ranked according to their estimated values. The greedy action is then selected as the one with the highest action selection probability, and all other actions are then weighted proportionally to their estimated action values. Frequently, the softmax method uses a Gibb or Boltzmann distribution to choose the probability o f an action a, P{a):  39  where Q(a) and Q(b) are the estimated action values and tis a temperature parameter that controls the probability distribution. Higher temperatures result i n actions with equal probability o f selection (exploration). O n the contrary, low temperatures cause a greater difference i n the probability o f selecting actions according to their estimated values (exploitation). Gradually, the temperature r decreases over time to limit the exploration process.  3.3.3.  Return Functions  One can distinguish two main types o f R L tasks: episodic and continuous tasks. In episodic tasks, the horizon represents a finite number o f steps i n the future. There is a terminal state where the episode ends. O n the other hand, in continuous tasks, the horizon represents an infinite sequence o f interactions between the environment and the agent. The goal o f the agent is to maximize the accumulated future rewards, and the return function R is a long term measure o f such rewards. In the case o f finite horizon tasks, the t  return is the sum o f the rewards from the beginning to the end o f the episode:  R =r t  t + l  +r  t + 2  + r , +.... + r t +  (3.2)  T  where T is the number o f stages i n the episode, R is the reward received after t time t  steps. For continuous tasks, the infinite horizon discounted model takes the long-term rewards into account b y discounting the rewards received in the future by discount factor  y, where 0< y<\. The return function then becomes:  K,=r +yr t+l  t+2  + fr  t + 3  +  (3.3)  40  where k is the number o f time steps in the future. The discount rate determines the present value o f future rewards. If y- 0, the agent is said to be myopic and only considers immediate rewards. A s y increases, the returns increase and the agent gives more consideration to future rewards. In the mathematical formulation o f the reservoir operation planning model presented in chapters 4 and 5, it is assumed that the current period rewards is realized at the end o f each time period, accordingly the discount factor ^is applied to both o f the present period and the future rewards. In the following sections the focus w i l l be on discounted rewards as this approach is more appropriate to reservoir operation problems.  3.3.4.  M a r k o v i a n Decision Process ( M D P )  R L relies on the assumption that the system dynamics can be modeled as a Markovian decision process ( M D P ) . A n environment is said to satisfy the Markovian property i f the signal from its state completely captures and summarizes the past outcomes in such a way that all relevant information are retained. In general, the response o f the environment at time t+1 to an action taken at time t depends on past actions. The system dynamics can be defined by specifying the complete probability distribution:  Ps/r = Pi t i = '> , i =r\s ,a ,r ,s _„a _ ,....,r ,s ,a } s  s  +  (3.4)  r  +  t  t  t  t  t  l  ]  0  0  for all s', r and all possible values o f past events: s , a , r , t  t  t  , rj so, ao-  The M D P framework has the following elements: state o f the system, actions, transition probabilities, transition rewards, a policy, and a performance metric (return function). M D P involves a sequences o f decisions i n which each decision affects what opportunities are available later. The Markov property means that the outcome o f taking an action to a state depends only on the current state. If the state and action space i n a M D P are finite then it is called a finite M a r k o v decision process.  41  If the state signal has the Markov property, then the environment's response at t+1 depends only on the state and the action representations at t. The environment's dynamics of the M D P can be defined by the transition probability:  Pi' = Pi t \ = A ' S  S  +  >' }  = S  a  (-)  = Q  3  5  where p" ., is the probability of moving to state s' for a given state s and action a. and the ss  expected value of the immediate reward:  K' = E{r, x \s, = s,a= a, s  = s'}  t+l  +  (3.6)  These two quantities, p°> and R° . completely specify the most important aspects o f s  the dynamics of an M D P process.  3.4. Reinforcement Learning Algorithms The goal o f R L algorithms is either to evaluate the performance o f a given policy (prediction problems) or to find an optimal policy {control problems). In prediction problems, the value function (state-value function) for a given policy n is estimated as follows:  V*(s) = E {R \s x  t  t  =s}=E„{Y^ r t+k x\st kr  k=(i  =*}  +  (3.7)  where policy ;ris a mapping from states s & S to the probability o f selecting each possible action. This mapping is called the agent's policy 7t, where n(s,a) is the probability o f taking an action a when the agent is i n state s. In control problems, the value function (action-value function) for policy ;ris defined as follows:  Q (s,a) n  = E {R s, =s, =a} K  t  ai  = E (^° /r K  k=0  l+lc+]  \s, =s,a = t  a}  (3.8)  42  where Q\s,a) is defined as the expected return starting from state s, taking action a, and thereafter following policy K. The recursive relationship property o f value functions between the value o f state s and the value o f its possible successor states is:  V*(s) = E„{R s =s} = £ V ( s , « ) £ / & [ / f t + yV*(sj],Vse t  t  S  (3.9)  Equation (3.9) is the Bellman equation, which states that the value o f state s must be equal to the discounted value o f the expected next state plus the expected reward along the way. The value function V\s)  is the unique solution to the Bellman equation. The  policy 7t is better than the policy rt i f V\s)  > V^is) for all s&S. The optimal policy  which has a better value function than other policies is defined as:  V\s)  = maxV*(s),  VseS  (3.10)  K  The optimal action-value function Q*(s,a) in terms o f V*(s) is:  Q * (s, a) = E {r n  /+1  + yV * (s')\s =s,a =a} = Y, K' [K' t  t  +^  (*')]  (3.11)  s'  Once we have the optimal value function V* for each state, then the actions that appear best after a one-step search w i l l be optimal actions. Hence, a one-step-ahead search yields the long-term optimal actions. W i t h Q*, the agent does not have to do a one-step-ahead search: for any state s, it can simply find any action that maximizes Q*(s,a). The action-value function effectively memorizes or stores the results o f all onestep-ahead searches. It provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair (Sutton and Barto, 1998).  The following sections describe the fundamental three classes o f methods for solving the R L problem. These methods are: dynamic programming (DP), Monte Carlo techniques  ( M C ) and the  temporal  difference  learning methods (TD). Dynamic  43  programming is a planning method, which requires a model o f the environment (Modelbased), whereas the M C and T D methods are learning methods, which can learn solely from basic experience without using a model o f the environment (Model-free).  3.4.1.  Dynamic Programming Methods (DP)  Dynamic programming provides the theoretical foundation o f R L algorithms. D P methods are used to solve M D P s , assuming a perfect knowledge o f the model o f the environment. The key idea o f D P methods is the use o f value functions and the Bellman equation o f optimality recursively to guide the search for optimal policies. D P methods are bootstrapping methods, as they update one estimate o f the value function based on the estimate o f a successor states. The two most widely used D P methods for calculating the optimal policies and the value functions are the policy iteration and the value iteration. The following is a brief overview o f these two methods.  The policy evaluation method refers to the iterative computation o f the value functions V" for a given policy K. Initial values o f a l l states are assumed  V£ and  successive approximation o f the value function is obtained b y applying the Bellman equation as a recursive update rule:  <-  + }T *(s%VszS,a k  a  = K(s),k = 0,1,2,...  (3.12)  s'  In practice, and for practical considerations, a stopping criteria for the iterative process is commonly used when the term m a x  |F (5 )-K (5 )| is sufficiently small ,  ,  r e S  t+1  i  (Sutton and Barto 1998).  The estimated action-values are used as a basis to find a better policy. I f the actionvalue Q (s,a) > V (s) K  K  for some a^7i(s),  then action a is taken and the policy is  changed to 7f where n'(s) = a. This process o f taking greedy actions with respect to the current policy is called policy improvement. The new greedy policy ;r'is given by:  44  it\s) <- a r g m a x £ J f t [ / f t + yV*(s')]  where  argmax  (3.13)  /  a  denotes the action a at which the value function is maximized.  a  Combining the policy evaluation step with the policy improvement step yields the policy iteration algorithm. Thus we can obtain a sequence o f improved policies and value E  I E  I  functions: it -^V" ->it —tV** —> Q  x  E  it —>F*. Where E denotes policy evaluation and  I denotes policy improvement.  Another way o f solving M D P s is the value iteration algorithm. Similar to the policy iteration method, the value iteration also combines the policy improvement step and the policy evaluation step. However, in the value iteration algorithm the policy evaluation step is truncated after one sweep over the state space and is followed by a policy improvement step. The value iteration estimates the optimal policy directly as the maximum to be taken over all actions:  V^^m^PfAK'  + rVd^)],  Vse S  (3.14)  The policy iteration and value iteration methods converge in the limit to the optimal value function V*(s) due to the contraction property o f the operator (3.14) (Bertsekas and Tsitsiklis, 1996).  3.4.2.  Monte Carlo Methods (MC)  A s stated earlier, D P methods require that a model o f the environment be available, including transition probabilities and the expected immediate rewards. However, in many cases the exact model o f the system is not known and in other cases, such as large scale systems, constructing the model could be a difficult task indeed. In such cases, learning the value function and the optimal policies directly from experience could be more efficient. M C methods estimate the value function from the experience o f the agent. This  45  experience is gained b y sampling sequences o f states, actions, and rewards from on-line or simulated interaction with an environment.  M C methods use the mean return o f many random samples to estimate the expected value function V and the action-value function Q . A s more samples are observed, their K  K  average return converges to the expected value o f the value function. A sample return is the sum o f the rewards received starting from state s or a state-action pair (s, a), and that follows policy K until the end o f a learning episode. A s complete returns can only be obtained at the end o f such episodes, M C methods can only be defined for finite horizon tasks.  One can design M C control methods b y alternating between policy evaluation and policy improvement for the complete steps o f the episode. Observed returns at the end o f the episode are used for policy evaluation and then for improving the policy o f all visited states i n the episode. There is one complication that arises i n this method however; the experience gained b y interaction with the environment contains samples only for the actions that were only generated b y policy TC but the values o f all other possible actions are not included i n the estimate. Those values are needed for comparing alternatives i n the policy improvement step. Therefore, maintaining sufficient exploration is a key issue in M C control methods. There are two approaches to assure that the agent is selecting all actions often, namely on-policy and off-policy control methods.  In on-policy control method, the agent uses a soft stochastic policy meaning that 7t(s,a) > 0 for all <= 5 and all ae A to evaluate and improve the performance o f the s  same policy. The agent commits to continuous exploration and tries to find the best policy in the process.  The other approach is the off-policy method: the agent uses one policy to interact with the environment and generates a behavior policy. Another policy which is unrelated to the behavior policy is evaluated and improved, and is called the estimation policy. A n advantage o f this approach is that the agent learns a deterministic (e.g., greedy) optimal  46  policy (estimation policy) while following an arbitrary stochastic policy (behavior policy) thereby, ensuring sufficient exploration in the process.  M C methods differ from D P methods in two main ways (Sutton and Barto 1998): First, they operate on sample experience. Therefore, they can be used for direct learning from interaction with the environment without a model. Second, they do not bootstrap; i.e. they do not build their value estimates for one state on the basis o f the estimates o f successor states.  3.4.3.  Temporal Difference (TD)  Temporal difference (TD) learning methods represent a central and key idea to R L . T D methods are considered as a class o f incremental learning procedures specialized where a credit is assigned to the difference between temporally successive predictions (Sutton 1988).  The T D methods combine the ideas o f D P and M C methods. Similar to M C methods, the T D approach can learn directly from real or simulated experience without a model o f the environment. T D methods share with D P the bootstrapping feature i n estimating the value function (Sutton and Barto 1998).  However, T D methods have some advantages over the D P and M C methods. T D methods use sample updates instead o f full updates as i n D P methods. The agent observes only one successor state while interacting with the environment rather than using values o f all possible states and weighing them according to their probability distributions. Accordingly, learning the optimal policy from experience does not require constructing a model o f the environment's dynamics.  In addition, unlike M C methods, T D methods do not need to wait until the end o f the episode to update their estimate o f the value function. The simplest versions o f such algorithms are usually referred to as one-step T D or TD(0). TD(0) wait only for the next time step to update their estimates o f the value function based on the value function o f the  47  observed state transition V(s ) and the immediate rewards received from the environment t  r +j. The following update is performed on each time step: t  V(s )^V(s ) t  + a [r  t  t  M  + yV(s )-V(s )} t+x  (3.15)  t  where Ot is a step-size parameter, 0 < 0Ct < 1 which represents the learning rate. The update rule presented above computes a stochastic approximation o f the V , which states K  that:  New Estimate <— Old estimate + Step size . [Target - Old estimate}  The target for the T D update is r  t+1  (3.16)  + yV(s +i). The term [Target - Old estimate] t  represents the error i n the estimate or the temporal difference between two successive evaluations o f the value function. This error is reduced by taking a step toward the target. The step-size can be defined as lln where n is the number o f samples generated. This stochastic approximation algorithm which produces a sequence o f estimates o f V(s ) such t  that the error —> 0, is based on an old algorithm (Robbins and Monro, 1951).  For any fixed policy 7t, the T D algorithm is proven to converge to V in the limit with K  probability 1 i f the step-size decreases according to the stochastic approximation conditions: ^ " « v = ° ° 0  ond  ^~_ ? a  0  <°°-  The first condition guarantees that the steps  are large enough to overcome any initial conditions or random fluctuations. The second condition  guarantees that eventually the steps become  small enough  to assure  convergence. (Sutton and Barto 1998).  In the case o f the control problem, i.e. estimation o f the action values Q(s,a), T D methods are used for the evaluation or the prediction part o f the policy iteration algorithm. A s with the M C methods,  sufficient exploration is required to assure  convergence which again can be achieved applying either the on-policy or the off-policy approaches.  48  3.4.3.1.  SARSA  The name S A R S A is attributed to the update rule that is applied i n this algorithm which follows the sequence o f events: current state (s ), current action (a ), resulting t  t  rewards (r +i), next state (s +i), and next action (a i). S A R S A is an on-policy T D control t  t  l+  algorithm that learns and improves Q for policy n, which selects and follows its actions. K  K is e-greedy regarding the estimated Q so far. The update rule that is performed at each time step is:  Q(s ,a )<- Q(s ,a ) + a[r + yQ(s , a ) - Q(s ,a,)] t  t  t  t  l+l  t+x  t+x  (3.17)  t  This update is done after every transition from a non-terminal state s . If s +i is t  t  terminal, then Q(s +i, a .i) is defined as zero. Similar to TD(0), S A R S A converges with t  t  probability 1 to an optimal policy and action-value function as long as all state-action pairs are visited for an infinite number o f times and the policy converges in the limit to the greedy policy.  3.4.3.2.  Q-learning,  Q-Learning  first  introduced by Watkins (1989),  is regarded  as  one  of  the  breakthroughs in R L . The simplest form o f the g-learning algorithm, which is the one step tabular ^-learning, is based on the temporal difference method TD(0). In this method, the elements o f the estimated action-value function are stored in a so called Qtable. The agent uses the experience from each state transition to update one element o f the table (Sutton 1999).  The Q-table has an entry, Q(s,a) for each state s and action a. After taking action a  h  the system is allowed to transition from state s to the next state s j. The immediate t  t+  reward received as a feedback from the environment is used to update the (9-table for the selected action. The next time step action value estimate Q(s +i,a +i) used in the update is t  t  selected according to the £-greedy policy. This is achieved by selecting the next state-  49  action as the one with the maximum estimated value most o f the time and, with a small probability ( l - £ ) , a random exploration action is selected.  The following is a procedural list o f the Q-Learning  algorithm:  Initialize Q(s,a) arbitrarily (to any feasible values)  Repeat for each episode:  Initialize s  Repeat for each step in the episode:  Choose a from s using policy derived from Q (e.g. e-greedy)  Take action a, observe r, s'  Q(s, a) <- Q(s, a) + a[r+ ymzx Q(s\a) - Q{s, a)] a  S <— S  until s is terminal  If every state-action pair is visited infinitely often and the learning rate decreased over time, the Q-values converges to Q* with probability 1 (Sutton and Barto, 1998). g-Learning is an off-policy algorithm in the sense that the agent tries to learn the value o f the optimal policy while following an arbitrary stochastic policy which is independent o f the policy followed by the agent. A n example including sample numerical calculations using the g-learning method is presented in Appendix A .  50  3.4.3.3.  Eligibility Traces  Monte Carlo methods perform updates based on the entire sequence o f observed rewards until the end o f the episode. O n the other hand, the one-step T D methods use the immediate reward and the sample next state estimate to perform the update. In between the one step and full episode backup, there are «-step possible backups, based on rc-steps o f discounted truncated returns R" and the discounted estimated value o f the rcth next state f V (s +^). The rc-step return is defined as: t  K  = r  M  +  +  t  + rV, + f'\V ) +n  t+n  (3.18)  The n-step backups are still T D methods as they still change an estimate based on an earlier estimate:  K  M  <- V,(s ) + a[R"-V (s )] l  l  l  (3.19)  One step further is to compute the updates o f the estimated value function based on several n-step returns. This type o f learning is denoted by TD(X) algorithm, where X is an eligibility trace parameter (trace-decay parameter), where 0 < X < I. The TD(A) algorithm averages the «-step backups each weighted proportionally to X"' . 1  The backup o f this A-return is defined as:  Rf={l-^T ^~^  n)  n  (3-20)  It is obvious from the above equation that by setting the X = 1, we get the M C updates, whereas by setting X = 0 we get the one-step TD(0) updates.  K  M  <- VXs^ + c^Rf-VXs,)]  (3.21)  51  A simpler and more efficient way o f implementing the TD(/t) method is the backward view TD(X) learning algorithm. This algorithm introduces an additional memory variable associated with each state at each time step called the eligibility trace (e ). A n eligibility trace is a temporary record o f the occurrence o f an event, t  such as visiting a state or taking an action. This variable specifies the eligibility o f a particular event in updating the value function. A t each time step, the eligibility trace for all states decay by yX except for the visited states in that time step which is incremented by 1 as follows:  (3.22)  In other words, eligibility traces can be thought o f as weighted adjustments to predictions occurring rc-steps in the past; more recent predictions make greater weight changes. The TD(0) error or temporal difference at time step t is denoted as St is calculated as follows:  (3.23)  O n every time step, all the visited states are updated according to their eligibility trace:  V (s )<r-VXs,) l+x  t  + ad e,(s) t  (3.24)  Although eligibility traces require more computation than TD(0) methods, they offer significantly faster learning, particularly when rewards are delayed by many steps. TD(X) is proven to converge under stochastic approximation conditions with probability 1 (Tsitsiklis, 1994). Figure 3.3 presents a schematic comparison o f the D P , M C , and T D ebackup methods and the calculation o f the value function.  52  Dynamic Programming  Figure 3.3.  Simple Monte Carlo  Comparison of Different R L Approaches (Adopted from Sutton and Barto 1998).  3.5. Function Approximation When the state space is finite, the most straightforward D P approach is to use a lookup table to store the value function for each state or state-action value combination. In reality, the state space could be quite large or even infinite and could include continuous variables. B y using model free R L methods one can avoid the need to construct the transition probability matrix. However, this does not solve the problem completely as the large memory and long time needed to fill in the elements o f the lookup-tables still represents a problem. In such cases using look-up tables does not yield practical results.  R L overcomes this problem by applying generalization and function approximation techniques. Here, estimating the  g-values for unvisited state-action pairs require  generalization from those states that have already been visited. Function approximation can be done in a number o f ways, such as: (1) function fitting (neural networks and regression), (2) function interpolation (K-nearest-neighbors and Kernel methods), and (3) state aggregation (Gosavi 2003). Watkins (1989) used the Cerebeller M o d e l Articulation Controller ( C M A C ) and Tesauro (1995) used back propagation for learning the value function in backgammon.  A s an example, consider applying function fitting techniques for a M D P with A actions i n each state, for state s e S:  (3.25)  Q(s,a) -f {s) a  The idea is to store the g-factors for a given action as a function o f the state index. Assumming s is a scalar; the function f (s) can be approximated by: a  f (s)=A+Bs a  + C.s  2  (3.26)  Thus instead o f storing each Q-value for action a, we only need to store the values for the set o f parameters: A, B, C. The Q-value for state s and action a is represented by the  54  value f (s). Obviously, less storage space is needed and a large state space can thus be a  handled.  3.6. On-Line and Off-Line R L methods can be implemented in two modes: on-line and off-line. The on-line mode consists o f using an R L driven agent directly on the real system. This mode is particularly interesting when it is difficult to model the system or when some phenomena are difficult to reproduce in a simulation environment. Moreover, and as the agent is learning continuously, it can adapt quickly to changing operating conditions. The main drawback o f the on-line mode is that the agent may jeopardize system stability because at the beginning o f the interaction no experience is available to the R L agent to adequately control the system.  One solution to this problem is to first let the agent interact with a simulation environment (off-line mode). The R L agent then can be implemented on the real system where it would benefit from the experience it has acquired in the simulation environment and w i l l be able to improve its behavior from interaction with the real system. Alternatively, one may extract off-line learned policies and implement them in the real system without any further learning.  On-line and off-line implementation o f R L should be differentiated from the on-line and off-line R L algorithms. In on-Line R L algorithms, similar to S A R S A , the agent is learning and improving the same policy that it is following in selecting the actions. O n the other hand, with an off-policy algorithm such as Q-learning, the agent is gaining useful experience even while exploring actions that may later turn out to be non-optimal.  55  3.7. Summary This chapter provided an overview o f R L and its main algorithms. The classification of the methods presented is intended to give the reader an idea about the  different  dimensions o f R L . B y using sampling and function approximation techniques, R L has the potential to be applied to larger systems than any other classical optimization technique. Modern reinforcement learning methods could be applied to both trial and error learning without a formal model o f the environment, and to planning activities with a formal model o f the environment, where an estimate o f the state-transition probabilities and immediate expected rewards can be easily evaluated.  The introduction o f R L into the water resources systems domain is relatively new. However, the advantages that R L offers in dealing with large-scale problems, makes it a promising area o f research i n that field. A R L based approach is adopted in this research work to develop a stochastic optimization model for the solution o f the multireservoir operation planning problem as described i n the following chapter.  56  4.  R E I N F O R C E M E N T L E A R N I N G MULTIRESERVOIR  OPTIMIZATION M O D E L (RLROM) D E V E L O P M E N T 4.1. Introduction In general, the reservoir operation planning problem for a hydro dominated power system, such as the B C Hydro system, is to find the optimal operation strategy for each time period during the planning horizon that maximizes the long term value o f resources while serving the domestic load in British Columbia and thereby to minimize the cost o f electricity to the ratepayers in the Province. This objective can be accomplished by coordinating and optimizing the use o f all available generation resources while taking advantage o f market opportunities.  Releasing more water now could result in high immediate benefits, but with less water left in storage for the future there would be less benefit in the future. O n the contrary, releasing less water now could result in gaining more benefits in the future. Accordingly, the decisions taken in any given planning period w i l l affect both the present return and the opportunities that could be available in the future. The challenge then is to link the present decisions with their future consequences. The approach followed in this research work for solving this problem relies on the concept o f the marginal value o f water ( M V W ) . B y definition, the M V W represents the incremental value o f water in storage expressed as dollar per cubic meter second-day ($/cms-day).  Optimal dispatch from hydro plants is established when the trade-offs between the present benefits, expressed as revenues from market transactions, and the potential expected long-term value o f resources, expressed as the marginal value o f water stored i n the reservoirs, are equal. Accordingly, as long as the value o f releasing water is higher than the value o f storing water in the reservoir then the operator's optimal planning decision is to continue generation.  57  Another implication o f applying the water value concept is that hydro plants are dispatched based on their expected water value in storage. W i t h i n the B C Hydro system, the multiyear storage capability o f the G M Shrum D a m on the Peace River and M i c a D a m on the Columbia River projects and their large production capabilities dictate the need for a much higher level o f coordination in planning the operation o f these two particular  basins,  than  with  smaller  projects.  Ignoring  inter-basin  and  system  dependencies on the Peace and Columbia basins could lead to unrealistic operations modeling for the entire system. For this reason, it is important to be able to model the Peace and the Columbia basins and their interaction with the electricity markets in detail to truly reflect an optimal integrated system operation. The two reservoirs operation cannot be optimized separately as the benefits obtained from the operation o f one reservoir cannot be directly specified as a function o f the storage level in that reservoir alone. Rather, it is a function o f both plants ( G M Shrum and Mica). The challenge then is to model this large-scale multireservoir system in an integrated manner while addressing the different sources o f uncertainty in a way that a large-scale stochastic program can handle.  To solve this large-scale problem, a stochastic optimization model for the operations planning o f a multireservoir hydroelectric power generation system is proposed. In this research work, the methodology adopted follows a Reinforcement learning ( R L ) artificial intelligence approach. The following sections describe the methodology followed in the development o f the proposed Reinforcement Learning Reservoir Optimization M o d e l ( R L R O M ) . The details o f the mathematical formulation and the solution algorithm o f the reservoir optimization problem are presented in the following sections.  58  4.2. Mathematical Formulation The primary objective o f the proposed R L R O M model is to develop a set o f system operation planning strategies that maximize the overall value o f B C Hydro resources at every time period over a planning time horizon given the uncertainty i n the forecasted domestic load, the forecasted market prices, and the random natural inflows. The main operation planning decisions consist of: the timing and quantities o f import and export, in addition to the timing, location and quantity o f water to store or draft from the reservoirs. Another significant outcome o f the operation planning model is to establish the M V W in storage for the main multiyear storage reservoirs i n the B C Hydro system. The established M V W obtained from the R L R O M  model can potentially be used for  estimating target storage values in the medium term optimization models. Moreover, the calculated marginal values can be used i n making tradeoff decisions using the clearing prices for short term wholesale energy market transactions. Marginal values o f water are determined from the derivative o f the value function with respect to storage ($/cms-d).  The following notation is used in the mathematical formulation o f the R L model: J  Number o f reservoirs included in the R L M o d e l , where {j e l,...,J} ,  Q  Number o f scenarios o f random variables, where {a)€ 1,...,Q},  T  final time period in a cycle (iteration or episode),  N  Iteration number or the age o f the agent,  N  M a x number o f iterations,  {Sj}  A set o f discretized storage volume for reservoir j, where Sj = {s\,s j,...,s" }  max  2  J  in cms-d, rij  Number o f state discretization for reservoir j ,  d  Number o f actions,  {A}  Set o f actions, where A ={a],a?,...,af}, t  59  t  Current time period (stage) in a cycle (iteration or episode), where t e {1,..., T},  Sj, Sj ,  Storage volume in reservoir j at beginning o f the period t in cms-d, +1  s'  Storage volume in reservoir j at end o f the period t in cms-d, State vector  representing  increments  of  the  a possible  different  combination  reservoirs  within  o f different the  storage  state  space  where,s = (s[ ,s'^,..,s'J), with z\ e {\,...,rij}forj=l,..., J i n cms-d, ]  a,  Action (decision variable): forward sales (pre-sales) at time period t, where  a  e { ai,ci2,..., ad} i n M W h ,  Ij  Local natural inflow to reservoir j in period t for scenario ax where  t  a) e {1,.., £2} i n cms, L®  System load at time period t for scenario co in M W h ,  P®  Market price at time period t for scenario coin $-Canadian/MWh,  Q  Turbine releases from reservoir j during period t in cms,  Q  Spill flow from reservoir j during period t in cms,  T  s  r (s\ a) Rewards o f taking action a and transition to state t  at end o f period t i n $-  t  Canadian, Q, (s\a) Q-value function when the system state is s' and action a is selected at the t  t  beginning o f time period t in cms, /, (s') Expected value function when the system state is s' in the beginning o f time t  period (stage) t in $-Canadian, y  Monthly discount factor o f future rewards,  cXn  Learning rate (step size) parameter i n iteration n, where n e {1,..., N},  e  Vector o f random events that influence the operation o f the multireservoir system, where e = ( 7 , Z , P ) , ( U  e  n  ( U  ( y  Exploitation rate in iteration n (probability o f random action i n E -greedy policy),  60  n  Policy, decision making rule,  nis )  action taken in state s' under deterministic policy 7t, and  1  71 (s\ a) probability o f taking action a in state s' under stochastic policy n.  4.2.1.  S D P Formulation  One possible way to solve this reservoir optimization Markovian Decision Problem ( M D P ) is to apply one o f the traditional S D P techniques, such as the value iteration technique. This formulation involves a sequence o f decisions in which each decision affects what opportunities are available in the future. For any given time period and at any starting state, the outcome from applying an action to a state depends only on the current state. A l s o , the effects o f the outcomes are not deterministic.  The basic structure o f the S D P algorithm can be characterized by: t  Discrete time,  s  State variable,  t  a  Control; decision to be selected from a given set,  r  Reward signal,  e  Set o f random variables, and  T  Horizon.  t  t  The mathematical formulation o f the reservoir optimization problem can be expressed as follows: Objective function: ft 0,') = Max E{y[r (s' ,a„e,) a.  t  t  +f  l+]  (4.1)  (s' )]} +1  Constraints: (4.2) - A s . , <*,,,<*;,,  a\<a,  <a"  V/eJ,  Vae A  VseS  (4.3) (4.4)  61  where  C j - and  Clj  are the elements  o f the matrices representing the hydraulic  connection between the reservoirs in terms o f the turbine outflow and spill outflow respectively from reservoir j' to reservoir j.  C ^ and C ^ 0 i f there is no physical 7  =  hydraulic connection between the reservoirs, C j - and Cfj=\ \fj-f, and C j - and for V/'  C =-\ s  fj  7' and reservoir j is physically connected to reservoir j'.  For any given system state, the objective is then to maximize the expected long term value o f resources resulting from the reservoir operating decisions taken in the current period and for decisions that could be made out in the future.  Equation (4.2) represents the set o f constraints describing the continuity o f flow for each reservoir (/') considering the physical setting o f any reservoir (j ) within the river r  systems. In case the storage exceeds the maximum limit the reservoir spills. The minimum storage is treated as a soft constraint and storage below the minimum storage constraint is penalized.  Equation (4.3) represents the upper and lower limits on the storage constraint. Equation (4.4) represents the constraints on the upper and lower limits on the decision variable.  4.2.2.  R X Formulation  Similar to the conventional S D P methods, the mathematical formulation o f the R L model is cast as a Markovian Decision Problem. The theoretical basis o f the R L approach,  as presented  in the previous chapter,  formed the  foundation  for  the  methodology adopted in the development o f the reinforcement learning multireservoir optimization model ( R L R O M ) . In the proposed algorithm, the agent (controller) first interacts with a simulation environment that models the real reservoir system, to learn the optimal control policies (Figure 4.1). Once the agent learns the optimal value function, it can use its knowledge at any time period and in any given system state to control the  62  system or to estimate the M V W . Alternatively, the agent can keep on learning continuously as it controls the system, and it can adapt quickly to changing operating conditions and keep on updating and enhancing its knowledge about different operating conditions o f the system.  Scenarios o f random variables  Environment ((iOM)  Action  Observe Feedback  a,  r,(s',a), s't+i  ' Agent ( R L Algorithm)  Figure 4.1.  ^  R L agent-environment interface in learning mode  In this research, a reinforcement learning (Q-Learning) algorithm which relies on the stochastic approximation o f the value iteration is adopted. The main idea o f the proposed R L formulation is to apply a sampling technique while performing the iterations rather than computing the expected rewards and the transition probabilities. The transition state s' and rewards r,(s,a) are generated from the state-action pair (s ,a,) by simulation. This l+i  t  R L formulation can be regarded as a combination o f value iteration and simulation. Accordingly, rather than estimating the value function for each state (Equation 4.1), one can compute a value for each state-action pair which is known as the Q-value or Q-factor applying the following Q-Learning formula:  63  Q? ( / , a) = Q?- (s\a) 1  + a, j t f r , ( / , a) + max g V (  a)] - Q?~ ( / , a ) j  (4.5)  l  where N is the iteration number and t is the time period (stage). It should be noted that the discounting is applied to both o f the current and future rewards as it is assumed in the model formulation that the rewards o f each time period are realized at the end o f each period. The details o f the R L R O M  model components and the proposed  solution  algorithm are described i n detail in the following sections.  The proposed R L R O M model is linked to a Generalized Optimization M o d e l ( G O M ) for the hydroelectric reservoir optimization model as described in section (4.2.10). A t each time period, the agent sends a signal to G O M describing the state o f the system s'  t  and the action it has taken a . G O M returns an estimate o f the rewards r (s',a) that t  t  informs the agent how well it did in selecting the action, and it generates a set o f transition states  . The R L agent controls the forward sales (presales) decisions, and  based on these decisions, G O M determines the optimal system operation in terms o f the release  and  generation  strategies. This  agent-environment  interaction  process  is  performed at each time period over the planning horizon and for a set o f generated scenarios that models the uncertainty in electricity load, market prices, and natural inflows. The stochastic modeling o f these random variables is described in detail in section (4.2.8).  While the main concern is to establish the optimal control strategies and the M V W o f water in the two main multiyear storage reservoirs, linking the R L model with the G O M model allowed the inclusion o f more reservoirs i n the optimization process. The model optimizes the operation o f the five main plants within B C Hydro system, namely: G M Shrum, Peace Canyon, M i c a , Revelstoke, and Keenleyside. Therefore, the  operation  planning decisions developed by the R L algorithm encompasses the major plants within the B C Hydro system.  64  4.2.3.  State Space  A t any time period, the state o f the system is typically expressed as the amount o f available storage in each reservoir at the beginning o f that time period. The reservoir storage variable used to represent the system state is defined by a J dimensional space. The continuous storage o f reservoir j is discretized to nj discrete storage values. The state space can be represented as a cartesian product o f all state combinations as follows:  {s\,...,s:}®...®^\,...y/}  It is assumed that the number o f discretized states for reservoir j is constant during the T periods o f the planning study.  4.2.4.  Action (Decision Variable)  The decision variable considered in developing the R L model is the forward sales (pre-sales) to/ from the U S and Alberta markets. The decision variable (a,) is discretized at each time period to d decisions. For each time period, the forward sales are subdivided t  into three categories namely: peak, high, and low to capture the effects o f price variations during the heavy load hours ( H L H ) and light load hours ( L L H ) .  Traditionally, the turbine release from each reservoir is the common choice o f the decision variable in stochastic reservoir optimization models. However, in this form o f the simulation based optimization o f the reinforcement  learning model, the turbine  releases are calculated during the interaction process between  the agent and  the  environment ( G O M ) . In the proposed R L model formulation, the turbine releases from the different reservoirs are based on the current state o f the agent, the presale decision and the adopted scenario o f the random variables.  One o f the main advantages to the choice o f presales as the decision variable is that it reduces the dimensionality o f the problem. This is mainly due to the fact that presale is a  65  system variable which is independent o f the number o f the reservoirs involved in the R L model. In comparison, i f the turbine releases were used as decision variables in the R L framework that increase the dimensionality o f the decision space by many orders o f magnitude.  4.2.5.  Exploration/Exploitation Rate  One possibility for the agent is to pick the action with the highest Q-value for the current state- this can be defined as exploitation. A s the agent can learn only from the actions it tries, it should try different actions ignoring what it thinks is best, some o f the time- and this can be defined as exploration. A t the initial stages o f the learning process, exploration makes more sense, as the agent does not know much. The adopted equation for estimating the probability that the learning agent should select the best action in the algorithm developed herein is proposed by Michael Gasser, (2004). The equation states that: e = \-e^  (4.6)  where: £ = exploitation rate, N = the number o f iterations or the age o f the agent, and ^ - exploitation parameter.  It is clear that as the number o f iterations increases, the exploitation rate increases. Eventually, when the number o f iterations gets larger, the agent w i l l tend to select the greedy actions, and these actions represent the policy learned by the agent. Figure (4.2) presents the exploitation rate as a function o f the agent's age.  4.2.6.  Learning Rate  The learning rate a controls the step size o f the learning process. In fact, it controls how fast we modify our estimates o f the action-value function. Usually, the iterations  66  start with a high learning rate that allows fast changes in the Q-values and then the rate gradually decreases as time progresses - as shown in Figure (4.2). This is a basic condition to assure the convergence o f the Q-Learning approach as identified by Sutton and Barto (1998). The adopted formulation in the developed model is a polynomial learning rate as given by Even-Dar and Mansour, (2003) as follows: a=VN  (4.7)  ¥  where: 7V = Number o f iterations (the age o f the agent), \j/= Parameter, where  (0.50,1.0).  0.5 0.4 0.3 0.2 0.1 0.0 0  20  i  1  1  40  60  80  1 1 1 r 100 120 140 160 180 200  Iteration  Figure 4.2.  4.2.7.  Learning rate and exploitation rate  Rewards  The reward function  r (s',a) is very important in the communication process t  between the environment and the agent. It helps the agent to learn how to perform actions that w i l l achieve  its goals. Depending on the reward  signal received from  the  environment, the agent perceives how well it did in selecting the action. Generally, the  67  objective is to maximize the sum o f the reinforcements received from the environment over time. In the multireservoir problem we are dealing with is an infinite time horizon problem, therefore a discount factor, j, is introduced. In the multireservoir model, the agent should learn the operation policy ^(s ) 1  that maximizes the benefits from taking  action a when the agent is in state s'. t  ^»=Z; A * =0  +  +1  (4.8)  A t each time period, and based on the current state o f the system and the agent's decision regarding forward sales, the optimization model ( G O M ) determines the optimal control strategy that maximizes the benefits. The reward function, which is selected to maximize the value o f resources, is calculated at the end o f each time period. The details o f the G O M optimization model are presented i n section 4.2.10.  4.2.8.  Stochastic Modeling of Random Variables  In modeling multireservoir hydro-electric power generation system, we are interested in establishing the optimal operation planning strategies and the expected value/marginal value o f the resources given the uncertainty in natural inflows, electricity demand, and market prices. This problem is complicated because o f the multidimensional probability distribution o f the random variables and their inter dependencies. One approach to fit the marginal distribution o f the random variables is to use time series autoregressive models. In these models, the autocorrelation o f the random variables are modeled by a continuous Markov decision process.  The methodology adopted in this research relies on approximating the continuous distributions and stochastic processes by discretization to a number o f scenarios rather than the Markov description o f the stochastic variables.  To represent the inflow, electricity load, and market price random variables, a moment matching ( M M ) scenario generation approach developed by Hoyland et al.  68  (2003) has been adopted. In this approach, the marginal distributions are described by their first four moments (mean, variance, skewness, and kurtosis). In addition, the correlation matrix is also specified. The method generates a discrete joint distribution that is consistent with the specified values o f the marginal distributions and the correlation matrix.  The moment matching algorithm first decomposes the multivariate problem to univariate random variables that satisfy the first four moments for each random variable. Then, it applies an iterative procedure that involves simulation, decomposition and various transformations to preserve the original marginal moments and the correlation matrix. The details o f the algorithm are given by Hoyland et al. (2003).  Historical stream flow records for the Peace and the Columbia rivers are used to estimate the distribution properties o f the inflow random variable. Considering the case o f the Peace and the Columbia River systems and assuming a monthly time step in the R L R O M model, the number o f random inflow variables for one year w i l l be twenty four, as inflow in each month is considered as a random variable. First, the first four moments and the correlation matrix for the twenty four variables are calculated. Then, this information is fed as an input to the M M model. The generated inflow scenarios,  If,  represents the cross and serial correlation between the inflows at the different time periods for the two river systems for one year. The outcome o f this process is a reduced (manageable) number o f inflow scenarios that preserve the properties o f the original distributions.  In the Pacific Northwest, electricity price variations are correlated to runoff volumes in the northwest. In dry years, power production falls and prices increase accordingly, while in wet years, there is more power available and prices decrease proportionally. A l s o , prices vary to a large degree across the day. Therefore, the average forecasted market prices are adjusted to reflect the relationship between runoff volumes and market prices. This relationship is represented by a regression relationship between the Dalles monthly runoff volume in the U S and the M i d - C o l u m b i a market price. The following  69  approach was adopted to represent the price variability in the R L R O M model. First, a stochastic variable for the annual Dalles runoff volume was introduced i n the moment matching scenario generation model in addition to the twenty four inflow variables o f the Peace and Columbia inflows. Then, the scenarios generated for the Dalles runoff were correlated to the heavy load hour ( H L H ) and light load hour ( L L H ) price factors (multipliers) using polynomial regression relationships. Finally, the average price forecast for the M i d - C o l u m b i a was multiplied by the H L H and L L H price multipliers to generate the H L H and L L H prices for the scenarios.  B C Hydro's system load forecast was estimated using a Monte Carlo energy model. The load forecast is mainly impacted by: (1) the economic growth measured by the gross domestic product ( G D P ) , (2) electricity prices billed to B C Hydro's customers, (3) elasticity o f the load with respect to electricity prices and economic growth, and (4) energy reduction due to demand side management ( D S M ) . The continuous distribution o f the B C Hydro system load forecast is represented by the values o f the median (P50) and the two quantiles (P5 and P95). Starting from these three points, the expected value and the variance o f the forecasted load are then estimated using the method o f Pearson and Tukey (1965). Then, a Monte Carlo simulation was carried out, using a Log-Normal probability distribution, to generate several thousand scenarios. The generated scenarios were then aggregated in a histogram with a specified number o f intervals that represent the required probably distribution. Based on the generated data, the first four moments and the correlation o f the electricity load with the inflow and price variables were estimated and used in the moment matching algorithm. The details o f the data used, the results, and an analysis o f the scenario generation process is presented i n chapter 5.  4.2.9.  Lookup Tables and Function Approximation  During the experimental and testing phase o f the R L R O M model development, the state space was discretized into a small number o f points. The state space in this case is the reservoir storage levels for the different reservoirs. For this limited number o f state space variables, the learning process for the Q-values was implemented using lookup  70  tables. Lookup tables are used to store and update the function representing the value o f water in storage at each time period. A t each time period, and for each discrete point on the state space, there exist a row o f entries in the lookup table. The elements o f each row are: reservoirs storage s', action a , reward value r (s \ a) and Q-value Q (s', a). t  t  t  In real applications o f multireservoir optimization, the number o f grid points grows exponentially as the number o f reservoirs increases. This results in a much larger state space and the use o f lookup tables becomes impractical. Accordingly, some sort o f function approximation is required, which w i l l enable the calculation o f the future value function ft+i{s ) at any point within the state space without the need to store every value. r  T w o properties o f this function need to be considered in the development process. First, the storage value function is typically a concave nonlinear function. Second, the target storage (end o f period storage) o f each reservoir is a function o f the storage levels in other reservoirs.  In this research work, two alternative techniques for function approximation were investigated. The first approach is a function fitting approach using polynomial regression. A n alternative function fitting approach was tested using a linear interpolation technique. However, instead o f using one function to approximate the entire state space, it was divided into a finite number o f segments or pieces using a piecewise linear ( P W L ) interpolation technique. The following figure illustrates an example o f approximating the concave nonlinear value o f water i n storage function with four linear pieces.  71  Figure 4.3.  P W L approximation of storage value function  A s an example, sij  it  represents the third break point o f reservoir 1 storage at time t for  a given storage level o f the other reservoirs. Similarly, ntjjj is the slope o f the third segment o f the P W L function o f reservoir 1 at time t, and so,3,t represents the intercept o f the curve with the storage axis. During the testing and development phase, the advantages of using P W L formulation over nonlinear formulation were noticeable in terms o f more stability in the results and also faster implementation in the R L R O M model. The faster implementation is mainly attributed to the way that A M P L mathematical programming language handles P W L functions. In A M P L , P W L functions are defined by: a set o f breakpoints (grid points o f the state space), the slope o f the different segments, and the intercept. In the course o f the iterations, for any target storage point on the state space, the P W L function approximation is implemented to calculate the storage value function for the multidimensional state space as presented in Figure 4.4 in a 3D view. The P W L function deals with one state variable at a time. A s an example, consider the case o f the value function being a function o f the storage value in two reservoirs (for example: G M S and Mica). First, P W L functions are constructed for the different storage grid points o f M i c a as a function o f different storage levels in G M S as shown in Figure 4.5. A t the target storage o f M i c a , a P W L curve is constructed as a function o f the different G M S  72  storage grid points. This P W L function was then used to evaluate the value o f water i n storage as shown i n Figure 4.6.  Figure 4.4.  3-D view of the P W L value of storage function as a function of G M S and Mica storage  —• m  - - ~ ~ r.  — j ! — » 1- " •  ---*•""."--•— „  ,  >  •  "  * - - - * - ...---A" •'  _ X -x- — ..---A ..... . - —•  - - " " " "..-•A A"""" .. A " ' ' ' -»-- - " »-' x  . -" ...  —  Mica Target Storage  a"'  <<-—  Mica Storage (cms-d)  Figure 4.5.  2-D view of the P W L value of storage function as a function of G M S and Mica storage  73  GMS Storage (cms-d)  Figure 4.6.  P W L value of storage function as a function of G M S storage at Mica target storage  4.2.10. Model of the Environment The R L agent needs to interact with an environment while learning to control the system. This environment could either be the real system or a model o f the system. In this research work, the B C Hydro generalized optimization model, G O M , was used as the simulation environment o f the real system. The G O M model was adapted from the short term optimization model ( S T O M ) developed by Shawwash et al. (2000). G O M , which incorporates the basic optimization formulation o f S T O M , was developed to give its user the flexibility for a more generalized form so that it can be used over longer time horizons and at various time resolutions. The model is currently used at B C Hydro as an analytical tool that assists the operations planning engineers to simulate and optimize the operation planning o f the integrated B C Hydro system. The primary objective o f the model was to develop optimal system operation schedule given a deterministic forecast o f system inflows, domestic load and market prices while maximizing the value o f B C Hydro resources.  74  G O M is a variable time step model with the capability to model sub-time steps in detail. The time steps may be hourly, daily, weekly, sub-monthly and/or monthly. The G O M system includes a detailed hydraulic simulation that calculates the hydraulic balance and calculates generation and turbine limits for each time step. Sub-time steps may be used to further divide the time step into shorter time periods to reflect different load conditions within a time step, as derived from the load-duration curves. For example, for a time step that is greater than or equal to a week, the load-duration curves are used to represent both weekday and weekend load shapes. The sub-time step thus provides a more detailed view o f the load and resource balance, and the market trade-offs under different load conditions (i.e. super peak load, peak load, heavy load, shoulder load and light load hours). The load-resource balance and trade-off optimization is performed for each sub-time step.  The non-linear power generation function is represented in G O M as a piecewise linear surface function where the generation is calculated as a function o f the forebay level FBj , t  turbine discharge Q p, T  and unit availability Up Gp = /(FBp,  Qrp, Up). This  function was developed with an optimal unit commitment and loading assumption. Accordingly, each point on the piecewise linear surface function represents the maximum generation attainable given the set o f turbine discharge, forebay, and the number o f units available for commitment. The procedure, which was followed to prepare these plant production functions for the B C Hydro plants is described in detail in Shawwash, (2000). The following figure illustrates an example o f the P W L function for a typical hydroelectric power generating plant.  75  Discharge F o r e b a y L e v e l (m)  Figure 4.7.  (m /s) 3  Piecewise linear hydropower production function  The formulation o f the G O M model was modified to run interactively with the R L model as detailed below. A t each time step G O M receives the information from the R L model on the initial state o f the system, it then updates the system constraints and solves the one stage reservoir optimization problem and then it passes the optimized results back to the R L model. During the learning phase o f the R L agent, the information passed to the R L R O M model constitutes the transition state and the reward signal. In the final iteration, the optimal operation  polices are derived, including the plant generation,  market  transactions (import/export), and turbine releases.  Hence, linking the R L R O M with the G O M has the advantage o f capturing the diurnal variation in load, market prices, and generation schedules for the shorter periods within the time step, either during weekdays/ weekends or Heavy load hours ( H L H ) / Light load hours ( L L H ) . The following is a description o f the G O M model formulation including the decision variables, constraints, and the objective function:  76  Decision Variables Q  Turbine release from reservoir k at time period t and sub-time step h, in  T  cms, Q  S p i l l (non-power) release from reservoir k at time period t and sub-time  s  step h, in cms, Gk,t,h  Generation from plant k at time period t and sub-time step h, in M W h ,  Sp°tus  Spot transaction (import/export) to U S market at time period t and subtime step h, in M W h , and  Sp AB ,  Spot transaction (import/export) to Alberta market at time period t and  ot  sub-time step h, in M W h . Constraints  Hydraulic continuity equation  ^=Sj,,-\zlX;M,,,*ci+a 'cjH,*H,\2A  V*,* (4.9)  M  and at t=T: S  jl+i  =5 ,  (4.10)  y>  where:  C  T kj  and C^. are the elements o f the matrices representing the hydraulic connection  between the reservoirs in terms o f the turbine outflow and spill outflow respectively from reservoir k to reservoir j. C  kJ  between the reservoirs, C  and C ^ = l i f j=k, and C  T  T  and C|.=0 i f there is no physical hydraulic connection T  kj  and C --\ for Vj*k s  kj  kj  and  reservoir j is physically connected to reservoir k, H ,h~ number o f hours in sub-time step h at time step t and he(l,2, ...,h„), t  77  H = number o f hours in time step t. t  Storage bounds constraint  S^-A5V,<oV,<6tr  V*,/  (4.11)  where the storage is expressed as a P W L function o f the reservoir forebay Fbk, Sk,t - f  {Fbk,t).  This function, which is not part o f the optimization model, is used to relate  the storage volume to the reservoir elevation (Forebay) within the G O M model. AS*,, is a variable representing the deviation from the minimum storage limit which is penalized in the objective function. In case the storage exceeds the maximum limit the reservoir spills. The minimum storage is treated as a soft constraint and storage below the minimum storage constraint AS , is penalized in the objective function. k  Power generation constraint  Gk, ,h t  =  f( k,t,h > Qr ,' k,,,h) FB  V*,/,A  U  kJJ  (4.12)  Total plant generation constraint  G^^G^+G^OR,  Vk,t,h  (4.13)  where: ORk = the percentage o f operating reserve from plant k. The operating reserve is a specific level o f reserve power should be available at all times to insure reliable electricity grid operation. G  T  - the total o f plant generation and operating reserve from plant k at time period t, in M W h , and h sub-time step.  78  Load resources balance (LRB) constraint  IL „. U-Gf, ,-Spot G  -Spot  +G  4  J  USiU  ABut  =L  tM  Vt,h  (4.14)  where:  G'  th  = the fixed and shaped generation from other small hydro and thermal plants, not  included as a decision variables i n the optimization problem.  G  f  h  = the forward sales; this information is passed at each time period from the R L  model.  L  lh  = the load at time period t, and at subtime step h in M W h .  Spot US Transactions constraint  T ^ S p o t ^ T ^  Vt,h  (4.15)  where Ty = the inter-tie transmission limit from B C to the U S and T^ is the inter-tie m  m  S  transmission limit from the U S to B C .  Spot Alberta Transactions constraint  T?->S ot >T l M  P  where T  ABih  A  Vt,h  = the inter-tie transmission limit from B C to Alberta and T  (4.16)  ax AB  m AB  is the inter-tie  transmission limit from Alberta to B C .  79  Generation limits constraint  G <G  Max  T  *./  Vk,t,h  <G  Min  'k.i.1,  (4.17)  kj  '  v  Objective Function (MaxRev):  Maximize:  {(Spot  + Spot  USj  ABj  )*H  lh  *P + G^ *H ttl  +z^fc-«.)*^rKst K<**i c  where  lh  v  *P } fih  '  '  (4 18)  ^ is the forward market price. The MaxRev objective function maximizes the  value o f power generation at each time period t given a target reservoir level s^' and the 1  marginal value o f water MVW ~ N  k  l  estimated b y the R L R O M model. The objective  function consists o f four terms: The first and second terms represent the sum o f the revenues from the spot transactions to both the U S and the Alberta markets and the forward sales, where P ,h and P t  f  are the spot and forward market prices at time step t  and sub-time step h. The third term accounts for the trade-off between the short and long term values o f water i n storage as the difference between the target (end o f period) storage calculated i n iteration A M and the target storage calculated in the current iteration multiplied by the marginal value o f water calculated i n iteration N-l. The fourth term penalizes the violation o f the minimum and maximum storage limits (ASkj), where c is k  the penalty for violating the storage limits specified in the R L model for reservoir k.  4.3. Solution Algorithm This section presents a detailed description o f the solution algorithm o f the R L R O M model outlined i n the previous sections. First, a description o f the R L R O M solution algorithm using the tabular form for storing the Q-values is presented. This is followed b y a description o f the function approximation algorithm that can be used for larger state space problems. The R L R O M  model is implemented i n the A M P L  mathematical  80  programming language (Fourer et al., 2003). A C P L E X solver ( I L O G Inc.) implementing the simplex algorithm is used to solve the linear optimization problem o f the G O M model.  4.3.1. RLROM Model Algorithm Using Lookup Tables A flow chart illustrating the different steps o f the model algorithm is presented in Figure 4.8. A detailed description o f the implementation o f the R L algorithm in a procedural form is described hereafter:  - Divide the state space {S} to a finite number o f discretized states that covers the range between the minimum and maximum reservoir storage.  - Define the number o f stages T and the set o f actions {A,}.  - Use a graphical user interface (GUI) to process the data sets required in the model runs in the specified time steps and sub-time steps. These data include the load, price, and the transmission limits.  - R u n a batch o f G O M model jobs for each point on the state space grid for each stage in the planning period. Store the results o f the model in a tabular form. These lookup tables inform the agent at each starting state and at each action what would be the transition state and the corresponding rewards.  - Run the Moment Matching technique and the Monte Carlo simulation described earlier to generate a specified number o f scenarios i 2 o f the random variables: the natural inflow 7^, forecasted l o a d / / " , and forecasted market priceP . w  t  - A t the start o f the process, initialize the value function and the Q-values arbitrarily, or alternatively, set the values to zero for all states and state-action pairs. The R L agent moves forward in each iteration from stage 1 to the last stage T. To estimate the  81  Q-values, Q,(s',a), applying the Q-Learning update rule (Equation 4.5); the value function f  (s') = max Q"~ (s', a) at the end o f period state s' , the rewards r , and the ]  l+]  t+]  t  estimate o f the Q-values Q"~ (s, a) are known from previous iteration. x  - Set the model parameters including: the discount factor % the initial values o f the learning rate a and the exploration rate £ according to the adopted formulas described earlier in this chapter.  - Set the number o f cycles to N . Where N max  max  is chosen to be a large number that  satisfies the convergence criteria, as described in detail in the following section.  - Starting at the first stage, initialize the algorithm by randomly sampling the state space (i.e. randomly choose a point in the state space grid s').  - Randomly sample the stochastic variables from the scenarios generated from their probability distribution.  - In the first iteration, the agent chooses action a randomly, as it has not learned yet t  any information about the Q-values. In subsequent iterations, the agent chooses the action a using the e-greedy policy ft {s) derived from the learned Q-values Q^~ (s',a) where: x  t  t  ft is)G t  argmaxQ ~ (s',a) N  a€A  i  '  - The agent interacts with the environment ( G O M model results stored i n lookup tables) and it receives a signal depending on the chosen action and on the sampled scenarios  i n the form o f the next  stage state transition  s'  jl+l  and a numerical  reward r., ( / , a).  - A p p l y the Q-Learning update rule (Equation 4.5) to estimate a new value for the Q values, which can be presented in a general form as:  82  New Estimate <— Old Estimate + Step Size [Target - Old Estimate]  (4.19)  Store the new estimate o f these Q-values as Q (s',a). t  - The agent moves to the selected state in the next stage, sets the target state s'  +1  the first time period to be the initial state s'  t  in  in the second time period. Repeat the  procedure o f sampling the action a using the e-greedy policy and determine the reward t  signal and the transition state until the agent reaches the final stage.  - A t the final stage T, the agent is at state s' . The agent repeats the same procedure as T  in the previous stages and receives the reward signal r (s, a) and moves to the transition T  state s[. The Q-learning update equation for the estimate o f the action-value function applies the following equation for estimating the future value function o f the next stage:  7™  =  (4-20)  - The agent starts a new iteration. In this new iteration and in subsequent iterations, until the termination o f the algorithm, the agent is always using the information it learned so far to update the future value function estimates (i.e. reinforce its learning o f the  re-  values). In the beginning, the agent tends to explore more frequently, with a probability of (1-f), to gain more knowledge about how good or bad it is at taking the actions. Later on, and as the age o f the agent increases, the agent becomes more greedy and it chooses the best action, max Q (s', a), with a probability o f e. However, the agent also explores t+l  a  actions other than the greedy ones with a probability o f (l-£).  A s part o f the convergence  requirements (Sutton and Barto, 1998), the exploitation rate increases with the age o f the agent and as the step size is decreasing (Figure 4.2).  83  - The above steps are repeated until convergence is achieved. The optimal value function J{s') and the optimal generated policies ?r*(s')e arg max Q (s', a) are stored for t  '  a&A  all elements o f the state space.  84  (start)  Obtain Environment Data  Generate Scenarios for Random Variables  Initialize Agent  I Initialize Q-Table Values Arbitrarily  I  Set N=\  Randomly Sample State Space  T  Sample Random Variables According to their Prob. Distribution  Choose -J-Greedy Policy  I  Move to Next Stage State  Lookup Environment Model Results  T Apply Q-Learning Algorithm to Update Q-Estimates  t=t + \  No  N=N + \ No  Generate Optimal Solution  T (  Figure 4.8.  R L R O M  Stop )  model algorithm using lookup tables 85  4.3.2. The RL Model Algorithm Using Function Approximation The algorithm presented above has the advantage o f avoiding the calculation and storage o f the transition probability matrix. However, as the state space and the decision variables increase, the use o f the lookup tables to store the action-value function becomes impractical. This is mainly due to the fact that the memory required to store the Q-values becomes very large. Accordingly, the algorithm is modified to allow for the use o f a larger number o f state and decision variables. Function approximation using a piecewise linear function ( P W L ) approximation technique is used to overcome problems with the storage o f the value functions in the R L R O M model. In this case, the state space is a continuous functional surface rather than a J dimensional grid. One advantage o f this method is that the target storage at every time period is not restricted to the grid points o f the state space. Rather, it can be any value within the state space surface. The other advantage o f this method is that it allows linking the G O M model to interact with the R L model on an on-line basis at each time step. A t each time step, the agent passes the sampled scenarios o f the random variables and the sampled state-action pair to the G O M model. The G O M model optimizes the operation o f the reservoirs and sends back to the agent a signal i n the form o f a next stage transition and the rewards corresponding to the selected action. This flexibility, o f having the target storage take any value, increases the chances o f finding a feasible solution in the G O M model runs. This is unlike the case requiring that an optimal solution at the grid points be found, which in some cases results in infeasible runs for the G O M model. To overcome this problem, a much finer state space grid needs to be generated, which further increases storage requirements for these optimization problems.  Therefore,  function approximation results in a significant  reduction in computer storage requirements and a more robust algorithm implementation of the proposed system.  Figure 4.9 presents a flow chart o f the R L model algorithm using  function  approximation. The flow chart indicates the interaction o f the R L agent with the model o f the environment ( G O M ) online. Figure 4.10 displays a schematic representation o f the R L R O M model algorithm using P W L function approximation.  86  A t each time period, the marginal value o f water, MVW  (s'), is calculated as the  tj  derivative o f the value function j\s') with respect to storage in reservoir j:  MVW  tj  (s) = df 0') / dsj  (4.21)  t  The marginal value o f water is updated after each iteration, and its units is converted from $/cms-day to $ / M W h using a conversion factor, HK  Jyh  storage: HK,, (s) = G (s) I Q Jt  Tj  (s)  (4.22)  where G is the plant generation in M W h and Q jtt  as function o f reservoir  T  is the turbine discharge in cms-day.  87  (start)  Construct Environment Model [GOM]  I Generate Scenarios for Random Variables Initialize Agent  I Initialize Q-Table Values Arbitrarily  Set  N=\ »  —  ^  t=i Randomly Sample State Space  T  Sample Random Variables According to their Prob. Distribution  Choose -*-Greedy Policy  Generate Optimal Solution  ^ (  Figure 4.9.  RLROM  model  algorithm  Stop  using  )  Function  Approximation  88  Tim e Period (t+n  Tim e Period  (0  Notation: /  s' a L ' P'  /'  e [*-« 1 •  r  *  MVW  Time Period State Variable- (Storage) Action (Pre-sale) Load scenario Price Scenario Inflow Scenario Q Value Function State Value Function Rate of Learning Iteration No. Rewards Discount Rate Marginal Value of Water  Generated Scenarios  O - Learning Update Equation:  Updated Q-value = O/rf value+ 4 [New value - Old value] Temporal Difference (Reinforcement)  Figure 4.10. Schematic representation of R L R O M model algorithm using function approximation oc  Convergence  4.3.3.  In stochastic dynamic programming, the optimality condition is reached when the solution is within a specified optimality tolerance. In general, the benefits o f attaining the true optimal solution do not outweigh the time and cost o f reaching the true optimal. The main concern in solving the stochastic reservoir optimization problem is to decide on how many iterations are sufficient to assure a good approximate solution.  Reinforcement learning algorithms rely on stochastic approximation in the evaluation process o f the optimal value function f*(s'). Stochastic approximation is an iterative procedure that produces a sequence o f solutions in such a way that after a finite number of iterations the temporal difference  o f the expected values approach zero with  probability 1 (Gosavi, 2003).  Accordingly, and as long as the Q-values are changing, we should continue to run the algorithm. A s the age o f the agent increases, the step size parameter a decreases. When the step size value becomes smaller than a specified value, the algorithm could be stopped. A t this point the Q-values should stabilize and no change should occur for each state-action pair. After each iteration, o f the R L R O M runs, the absolute (A^)  difference  between the estimated Q-value function in iteration N and iteration N-l  is  calculated at each time period as follows:  = \Q? (S, a) - Qf- (s\ a)\ X  \ft,s\a  (4.23)  The computation terminates when the difference in the Q-values between successive iterations ( A g ^ ) remains constant for several iterations and consequently the Q-values are said to converge to the optimal solution. Gosavi (2003) also suggests other criteria to stop the algorithm when the policy does not change after a number o f iterations, and this could be explored in the future.  90  4.4. Summary The proposed R L approach outlined in the previous chapter, led a practical model, o f the complex multireservoir system. Instead o f modeling a single reservoir , the R L R O M model was developed for the B C Hydro's two main river systems, the Peace and the Columbia Rivers. The model was formulated to establish an optimal control policy for these multiyear storage reservoirs and to derive the marginal value o f water in storage. The R L R O M model is presented with two solution algorithms: the first algorithm relies on the use o f lookup tables to store the Q-values and the second algorithm, which allowed the extension to handle a larger scale multireservoir problem, relies on the use o f the function approximation technique.  The R L R O M model considers several stochastic variables: the domestic load, market prices, and the natural inflows. The use o f the moment matching technique for generating scenarios o f the load, inflow and price variables has the advantage o f using a limited number o f scenarios to represent the random variables' statistical properties, i n particular the moments and the correlation o f extensive historical time series records.  A large-scale hydroelectric reservoir optimization model ( G O M ) based on linear programming was integrated with the R L R O M model. In this way, the optimization process was extended to include the other reservoirs on the Peace and on the Columbia Rivers: the Dinosaur, the Revelstoke, and the Keenleyside reservoirs. This integration allowed more efficient on-line interaction between the agent and the environment to be carried out. It also made it possible to capture the diurnal variation o f the price and load on shorter time periods.  91  5. MODEL TESTING AND IMPLEMENTATION 5.1. Introduction This chapter is structured as follows: F i r s t , a single reservoir optimization problem, representing a test bed for this research work was used to investigate capability o f the R L approach to solve the reservoir operation planning problem and to gain experience with the R L technique. Three cases were considered to test the performance  o f the R L  algorithm on problems o f increasing size.  Second, a two reservoir  problem was  tested using the  multireservoir  model  formulation presented i n the previous chapter. The objective o f this test case was to investigate the potential use o f the G O M model off-line as a model o f the real system. Lookup tables were used to store the feedback information from G O M and the estimated Q-values from the R L algorithm for a subset o f the full state space o f the two reservoirs.  F i n a l l y , the R L R O M model was used to model the full state space and the function approximation R L algorithm was implemented for the B C Hydro two main reservoir systems. The G O M model was linked to run on-line within the R L algorithm. A case study is presented to demonstrate the capability o f the model to solve the large-scale multireservoir operation planning problem. A s both o f the G M S and the M i c a dams have multiyear storage capabilities, the model was run for a planning horizon o f 36 months.  The optimized storage value function, marginal value o f energy for both o f the G M S and the M i c a dams w i l l be presented and discussed. In addition, examples o f the optimal control policies proposed by the R L R O M model w i l l be presented. The model output includes the optimized control policies for: market transactions, plant generation, and turbine releases for the five main plants in the Peace and the Columbia River systems.  Initially, the model was run in training mode, where the R L agent learns the actionvalue function and the optimal control policies. Once the R L agent learns the optimal  92  control policies, the model can then be used to control the monthly operations planning decisions. The use o f the R L model to control the operation planning for reservoir systems w i l l be presented and discussed.  5.2. Single Reservoir Model - Test Case A s a first step in investigating the capability o f the R L approach to solve the reservoir operation planning problem, a single reservoir case was used as a test-bed. The problem was formulated and solved using the Q-Learning technique. The problem was also solved using the value iteration method o f the  stochastic  dynamic programming (SDP)  technique. The established optimized solution using the S D P model was used as a base line to evaluate the R L model results. Working on this problem provided useful insights about the R L approach. In addition, it was possible to gain experience on the sensitivity o f the model results to the various parameters used in the formulation. The following sections present a description o f the case study, S D P and R L test model formulation, establishing the R L model parameters, and the model results.  5.2.1. Problem Description The system considered in this case consists o f one reservoir. The supply from the reservoir is mainly used for hydropower generation. The hydropower producer operating the system generates the electricity to satisfy the local demand and for sale in the open market. The energy yield, depending on the inflows, is variable from period to period throughout the year and is governed by the following probability distribution:  93  Probability of inflow for the different periods  Table 5.1.  Probability  Inflow Volume (Energy units) Period  1  2  3  4  5  6  7  8  9  10  11  12  10 11 12 13 14  0.10 0.20 0.30 0.25 0.15  0.15 0.25 0.30 0.20 0.10  0.30 0.20 0.10 0.15 0.25  0.25 0.20 0.15 0.25 0.15  0.10 0.20 0.30 0.25 0.15  0.15 0.25 0.30 0.20 0.10  0.30 0.20 0.10 0.15 0.25  0.25 0.20 0.15 0.25 0.15  0.30 0.20 0.10 0.15 0.25  0.25 0.20 0.15 0.25 0.15  0.30 0.20 0.10 0.15 0.25  0.10 0.20 0.25 0.15 0.30  The power producer requires 10 units o f energy over each period to satisfy its demand and could store as much as 12 units. A n y energy stored but not used can be stored or sold in the following period. A marketer is willing to pay a premium price for the producer's energy according to the scale below, i f the producer guarantees delivery at a certain time of the year:  Table 5.2. Units of Energy Price ($)  Forward market price 0 . 1 0 1200  2 1000  3 800  4 800  If the producer contracts too much o f its production, leaving less than 10 units for its own needs, then the shortfall must be made by purchasing energy on the spot market (spot buy) at $1500 per unit. A n y energy held at the end o f the month, which the storage facilities can't store w i l l be sold on the spot market (spot sell) for $500 per unit. The producer limits transactions on the spot market to those that are absolutely necessary. The objective is to maximize the expected discounted profit over the foreseeable future, with an effective annual interest rate o f 7%.  94  5.2.2. SDP mathematical formulation The following notation was used in the mathematical formulation o f the single reservoir problem: t Time period, where te l , . . . , r i n month/s, a Forward sale (contract), where a GA in units o f energy, s Reservoir storage, where s e S in units o f energy, i Inflow volume, where iGl i n units o f energy, L Domestic load=10 units o f energy, U Upper storage limit in units o f energy, Spot_Buy/ SpotSell  Spot market transactions (buy/sell respectively) in unit o f  energy, Spot_Buy_Cost/ Spot_Sell Rev  Cost o f buying / revenue from selling in $,  Exp Spot_Buy_Cost/Exp_ Spot_Sell Rev  Expected value o f buy cost/sell revenue in  $, Cont_Rev Forward sale (contract) revenue in $, Pspotjuyl Ps ot_seii P  Pj(a)  Price o f spot buy / sell in $/unit o f energy,  Price o f forward (contract) sale in $/unit o f energy,  R Rewards in $, y Discount rate, TP(s ,it,a,s +i) Transition probability o f moving to state s +i for a given state s,, action t  t  t  a, and inflow /, TPS(s ,a,s +i) Transition probability o f moving to state s +i for a given state s and t  t  t  t  action a, where Exp _ Spot _ Buy _ Re v(s, ,a) = Y ^.TPis^-s^L TPS(s,, a, s ) = | TP(s t+l  l+x  -s,+L  ^TPis^-s^L  Spot _ Sell(s,, /,, a) * P _ spol  + a) + a)  V + a)  V  * TP(s,, /,, a, s  M  )  s >U !+l  U> s  l+l  V  se!l  > 10  *, <10 +I  and f,(s ) is the value function for state s in time period t. t  95  The mathematical formulation o f the SDP problem is as follows:  Policy Rewards Calculation:  R(s,, a) = -Exp _ Spot _ Buy _ Cost(s,, a) + Exp _ Spot _ Sell _ Re v(s,, a) + Cont _ Re vis,, (5-1)  where: i  Exp_Spot_Buy_Cost  P - yi t^t^ )* o,-bu  (s„a) =  Exp _Spot _Buy _Rev(s,,a) Contract _RQv(s a) n  =£  S  ot Bu  s  a  p sp  *  y  Spot _Sell(s ,i,,a)  *P _  t  spol  sell  T  (s,,i ,a,s )(5.2)  P  t  * TP{s,,i,,a,s ) t+l  = P (a)*a  !+1  (5.3) (5.4)  f  Constraints:  Constraint on reservoir storage (state variable):  U>s,  >10  (5.5)  Constraint on forward sales (decision variable): 0 < a < 4  (5.6)  Spot transactions (sell) constraint:  i f s +i>U  Spot_Sell - s +i - U  (5.7)  Spot transactions (buy) constraint:  i f s +i<lQ  Spot_Buy = 10 - s +i  (5.8)  Hydraulic continuity constraint:  and U>s„,  t  t  s  t+l  t  t  =s +i -L -a t  t  t  t  t  (5.9)  >10  96  Objective Function  (s )  /,  l  = mJiR(s a)  +  lt  The optimal policy n  t  r*Y  U i(0*rPS(s,,a,* )]}  u  f+1  +  (5.10)  is:  n] (s) - arg max.fc„fl)+r*2;, a  (5.11)  l/ i(*i .)* c 7P5  l+1  W  +  5.2.3. RL mathematical formulation The same set o f constraints, as described above in the S D P formulation was used in the R L model. However, the rewards  and the objective function were calculated  differently. A s described earlier in chapter 3, the R L Q-Learning algorithm relies on sampling the state, action, and random variables to calculate the rewards instead o f calculating the expected values o f rewards as described above for the case o f the S D P algorithm. Accordingly, there was no need to calculate the transition probabilities. Rewards in the R L formulation were calculated as follows:  Rewards Calculation:  r (s,a) = -Spot _ Buy _ Cost(s , i , a) + Spot _ Sell _ Re v(s , /,, a) + Cont _ Re v(s,, a) t  t  t  (  (5.12)  where:  Spot _ Buy _ Cost{s , i , a) = Spot _ Buy{s , i , a) *spot-buy P  (5.13)  Spot _ Buy _ Re v{s , i , a) = Spot _ Sell (s i , d) * spot-sell P  (5.14)  t  t  t  t  t  t 9  t  t  s  97  Contract _ Re v(s, ,a) = P (a)* a  (5.15)  f  Objective F u n c t i o n  The Q-Learning update rule was used to calculate the state-action value function (QValue):  Q?(s,,a) = Qr (s,,a) + a,{r  (s„a) + y*maxQ?J(s ,a)]-Q?~ (s,,a)} {  t  l+l  (5.16)  a  From this we get:  f,(s ) = m xQ (s ,a)) !  l  a  l  l  n](s,)e argmaxQ (s,,a) N  (5.17)  (5.18)  a  where  (s,a) is the state-action value function at time period t and iteration N, r(s,a) is  the sampled rewards for state s and action a as calculated in Equation 5.12.  The problem was formulated i n A M P L and was solved using the Q-Learning algorithm described earlier i n section 3.4.3.2. For comparative and evaluation purposes, the S D P problem was also solved using the value iteration algorithm.  Three cases were modeled for this test problem, as follows: Case 1: Considering three states (10-12) and four stages. Case 2: Considering three states (10-12) and twelve stages. Case 3: Considering ten states (10-19) and twelve stages.  98  5.2.4.  Solution Steps of Q-Learning Algorithm  The following is a presentation o f a step by step procedure o f the Q-Learning solution algorithm applied in solving the single reservoir problem: Step 1  Set the model parameters including: the annual discount rate y= 0.07, the  exploration rate exponent £ = 0.000045, and the learning rate exponent  0.61. The  ^ and y/ constants are used to calculate the exploitation rate and the learning rate respectively (equations 4.6 and 4.7). Set the number o f iterations N  max  Step 2  - 10,000.  Initialize the state-action value function (Q-Tables) to zero for all state-action  pairs, Q,(s,a) - 0.  Step 3  Starting at the first stage; initialize the algorithm by randomly sampling the  state space, i.e. randomly choose a point from the discretized reservoir storage (state space), for examples, e {10,11,...,19}. 1  Step 4  Randomly sample the inflow variable according to the probability distribution  in time period one, where/, e {10,11,...,14}.  Step 5  The agent chooses action ai (forward sales or contracts) randomly, where  a e {1,...,4}. In the first iteration, for all the stages te {1,...,12}the agent chooses the x  action at random as it has not yet learned any information about the Q-values.  Step 6  The agent interacts with the model o f the environment. The chosen action and  the sampled inflow scenario are simulated. The agent receives a signal i n the form o f the transition to the next stage state s  l  numerical rewardr, (s,a)  2  based on the hydraulic balance equation (5.9) and a  as calculated in equation (5.12). The rewards are calculated as  the sum o f the spot and contract sales.  99  Step 7  The learning rate a is calculated using equation (4.7).  Step 8  A p p l y the Q-Learning update rule (equation 5.16) to calculate a new estimate  for the Q-value as follows:  2," (s,, a) = Q?~ {s ,a) + a, \r x  t  t  {s„a) + y* max  (s , a)] - Q?~ (s , a)} l  M  t  a  The first term o f the equation represents the initial action-value function (Q-value) i n this iteration (N=l) and at this time period (t=l), Q*(s ,a). The second term represents x  the reinforcement achieved to the initial estimate o f the Q-value. This is calculated as the difference between the new Q-value{y*r {s ,a) + maxQl(s ,a)) and the initial estimate x  x  2  a  o f the Q-value (Q\{s ,a)) multiplied by a step size (learning rate, a,). This equation can x  be represented in general form as follows:  New Estimate <— Old Estimate + Step Size [Target (New Estimate) - Old Estimate]  The New estimate explained above is the sum o f the rewards achieved i n this time period t and the discounted value function at the transition state in time period t+1.  Store the New Estimate o f the Q-values as Q ' ( i „ a ) . A t this point, for stage 1 and time period 1 we have a new state-action estimate (Q-Value) for the visited state-action pair. The other state-actions remain unchanged (zero). Next visit to this state-action pair, may be i n iteration 50, the agent uses the stored Q-value Q\{s ,a) as the Old Estimate and x  apply the same update rule to calculate the New Estimate(s ,a). x  The best policy  (corresponding to the maximum Q-value) is always updated after each visit to a state and calculating the Q-values for the sampled state-action.  Step 9  The agent moves to the selected state i n the next time period, s\. The  procedure o f choosing random action a and determining the reward signal and the t  transition state is repeated until the agent reaches time period T.  100  Step 10  The agent starts a new iteration (N=2). A t the first time period (t=\), sample  the state space randomly, sf. Calculate the exploitation rate £ applying equation 4.6. The agent chooses the action ai using the e-greedy policy. Accordingly, with probability £, the agent is choosing the best action estimated so far, max Q? ( j , , a). In the beginning, a  the exploitation rate is small and the agent tends to explore more frequently, with a probability o f (1-e), to gain more knowledge about how good or bad it is at taking the actions.  Later on, and as the age o f the agent increases, the exploitation rate increases  and the agent becomes more greedy and it chooses the best action with probability £. Steps 3 to 9 are repeated until convergence is achieved. However, starting from iteration 2 and in subsequent iterations, until the termination o f the algorithm, the agent is always using the information it has learned so far to update the Q-value function estimates (i.e. reinforce its learning o f the Q-values) applying the e-greedy policy rather than randomly selecting the actions as in the first iteration. The optimal value function J[s) and the optimal generated policies 7T*(s)e a r g m a x £ ) , ( . s , a ) are stored for all elements o f the state space. Prior to illustration and discussion o f the results, the following section presents the process followed in establishing the parameters that were used in the R L algorithm.  5.2.5. Establishing the parameters of the RL Algorithm In the initial tests o f using the Reinforcement Learning technique to solve the single reservoir problem, a series o f runs was performed to establish the appropriate settings o f the R L algorithm parameters. The following sections present examples o f the runs, which were performed for establishing the exploitation rate and the learning rate R L parameters. Case 2 (considering 3 states and 12 stages) was selected as an example to present the results for a fixed number o f iterations N = 8000 i n all runs.  101  5.2.5.1.  Exploitation rate  The effect o f changes in the exploitation rate parameter was investigated first using a constant exploitation rate parameter e ranging from 0.2 to 1.0. For the rest o f the cases a variable exploitation rate was used. The variable rate was expressed as a function of the age of the agent as outlined in Equation 4.6 with the value o f parameter  ranging from  0.00001 to 0.0001. For both cases (constant and variable exploitation rate) a learning rate parameter y/= 0.5 (Equation 4.7) was used. Figure 5.1 presents the results for this set o f runs for both constant and variable exploitation rates. The maximum error in the value function is presented as a percentage o f the maximum difference from the solution derived by the S D P model. The results o f these runs indicate that varying the exploitation rate with time appears to provide the best performance.  UJ  .Q.01 \ 0  , 1  , 2  , 3  , 4  , 5  6  ,  ,  7  8  , 9  ,  , 10  Run Number  Figure 5.1.  Effect of varying the exploitation rate  Examining the convergence behavior o f the solution (as an example, at time period t=5 and state 5=11), Figure 5.2 shows that the expected rewards increase with increasing experience of the learning agent. A t the beginning o f the iterations, the estimated value function increases significantly. However, the rate o f increase gradually decreases until it diminishes, as the algorithm converges to the optimal solution. It is clear that when a  102  greedy policy is consistently followed, it results in a poor quality solution, compared with other e-greedy policies (using either a constant or variable exploitation rate with e smaller than 1). Following the greedy policy, the value function starts to increase and levels off earlier and at a lower level compared to other policies (at about 0.95 o f the optimal solution). O n the other hand, the e-greedy method is capable o f reaching close to the exact solution. The e-greedy method performs better as the agent continues to choose the action randomly at a small fraction o f time and therefore it has a better chance to find a better approximation to the optimal solution. Figure 5.3 illustrates the robustness o f the Q-Learning method over a wide range o f values for the exploitation rates parameter £  Iteration  Figure 5.2.  Convergence  of  Q-Learning  solution  applying  different  exploration-exploitation policies  103  Iteration  Figure 5.3.  5.2.5.2.  Performance of different e-greedy policies  Learning rate  To establish the best value for the learning rate parameter, and to determine the sensitivity o f the solution to the variation in the learning rate, sets o f runs were performed using constant learning rates a o f 0.15 and 1.0. A l s o , a polynomial learning rate was used (Equation 4.7) with the exponent  yr ranging in value from 0.5 to 1.0. The same  exploitation rate was used throughout  the runs assuming  0.0001. Figure 5.4  demonstrates that by gradually reducing the learning rate as the number o f iterations increases, the model convergence to a better quality solution (lower error in the value function). The best results were obtained by setting ^=0.6 and the accuracy o f the solution deteriorates i f y/ is either increased or decreased. Figure 5.4 shows that the error in the value function could have a V shape relationship with the learning rate, which stresses the importance o f experimentation with these parameters prior to the adoption o f the values for use in the R L algorithms.  104  0.10 -. 0.09 c  0.08 -  0.00 -I  1  0  2000  :  —  1  4000  6000  <  8000  Iteration  Figure 5.4.  5.2.6.  Performance of different learning rates  Results  To examine the quality o f the solution obtained using the R L approach for the single reservoir problem, the results o f the three test cases described earlier were evaluated. First, the R L model parameters for each o f the three test cases were established using a variable exploitation rate parameter ^ a n d learning rate parameter y/. The following table presents the parameters and the number o f iterations used for each o f the test cases.  Table 5.3. Test Case  Test cases study parameters Iterations  Exploitation rate parameter  Learning rate parameter  (O 0.59  Case 1  15,000  0.000065  Case 2  15,000  0.000050  0.58  Case 3  20,000  0.000045  0.61  The convergence o f the value function is presented in Figure 5.5. The results indicate that for all 3 test cases, the R L model was capable o f converging to an optimal solution with a very little difference from the solution derived by the S D P model. The error i n the value function is defined as:  105  (5.19)  Error = max  where f is) is the value function obtained using the S D P model.  0.100  c o c  0.080  S UL  ii I  0.060 - - - C a s e 1(3 s t a t e s a n d 4 s t a g e s )  C  'I  0.040  LU  0.020  —  C a s e 2(3 s t a t e s a n d 12 s t a g e s ) C a s e 3(10 s t a t e s a n d 12 s t a g e s )  0.000 5000  15000  10000  20000  Iteration  Figure 5.5.  Convergence of the value function  The maximum relative error i n the value function derived b y the R L model for each o f the three cases was: 0.0009, 0.0005, and 0.0009 respectively. For each stage, the mean relative error (MRE) in the R L solution was calculated and listed in Table 5.4.  The MRE was calculated as:  MRE,=—Y  (5.20)  The results indicate that the mean relative error does not exceed 0.04%.  Table 5.4.  Mean relative error (%) for the single reservoir R L model 1  2  3  4  5  6  7  8  9  10  11  12  Case 1  0.03  0.04  0.03  0.04  -  -  -  -  -  -  -  -  Case 2  0.02  0.02  0.03  0.02  0.03  0.03  0.01  0.03  0.02  0.04  0.03  0.02  Case 3  0.03  0.02  0.03  0.03  0.02  0.02  0.03  0.03  0.03  0.03  0.03  0.04  Period  106  The optimal value function derived b y R L Q-Learning and S D P models were superimposed and compared. Figure 5.6 presents the study results for stages 6 and 12 o f the single reservoir test case 3. The results indicate a close match o f the value function obtained by the R L and the S D P models.  335,000  Stage 6  334,000  = 330,000 CD >  329,000 328,000 327,000 13  335,000  14 15 Storage  Stage 12  334,000 333,000 332,000 -I 331,000 330,000 329,000 328,000 327,000 14  Figure 5.6.  15 16 Storage  Value function obtained by R L and SDP at time periods 6 and 12  Figure 5.7 displays the optimal forward sale policies derived by both o f the R L and S D P models for stage 6 and stage 12. The operating policies suggested by the S D P and R L model were identical. A s the results presented suggest, it is more profitable to contract more energy, as the storage in the reservoir increases.  107  Stage 6 4 n  Storage  Stage 12  10  11  12  13  14  15  16  17  18  18  19  Storage  Figure 5.7.  Optimal policies derived by the R L and SDP models at time periods 6 and 12  The results obtained indicate that for all o f the three test cases the performance o f the R L is stable and provides a good approximation to the optimal solution. The results also, interestingly, show that the number o f iterations required to reach convergence are not affected by the size o f the state space. This also provides a good measure o f the robustness o f the R L approach. Consequently, the single reservoir test case offers some useful insights on the potential use o f the R L algorithms in handling the larger scale multireservoir problem, which w i l l be discussed next.  108  5.3. Two Reservoir Model - Test Case This test case was used as a building block for the  development  and  the  implementation o f approach to the large-scale multireservoir problem. This case study provided a way to test the performance o f the algorithm in handling larger scale problems. The R L Q-Learning model was run for a finite number o f iterations until it converged. Convergence was assessed in terms o f the difference between the current solution and the previous iteration solution, as suggested by Gosavi, 2003. The derived R L model optimal solutions were then compared with the S D P model results.  5.3.1.  System Modeling  The system considered in this test case consisted o f the G M S generating plant at the Williston reservoir and the M C A generating plant at the Kinbasket reservoir on the Peace and Columbia Rivers respectively. The state variables include a subset o f the full G M S and M C A storage volumes. The G M S storage state variable was discretized to 30 increments covering the range from 260,000 to 410,000 cms-day, and 5 increments for M i c a ranging from 245,000 to 285,000 cms-day. Accordingly, the state space for each time period consists o f 150 storage combinations for both reservoirs. In addition, a provision was made in the formulation to account for the system electricity demand in the state space. Table 5.5 presents the forecasted system electricity load and the peak load in the four time periods considered in the model runs, each consisting o f three months.  Table 5.5.  System load and peak demand  1  2  3  4  Load - M W h  4,983  5,362  5,804  5,872  Peak - M W  8,820  9,720  10,458  10,219  Period  109  To test the performance o f the proposed R L multireservoir model, the historical inflow data for water years 1964 -1968 were used in five scenarios o f the random inflow variable. The monthly inflow data to the reservoirs on the Peace and the Columbia system and the assumed scenario probability associated with each o f the inflow sequences are presented in Table 5.6:  Table 5.6.  Scenario  Peace system inflow scenarios (cms) Columbia  Peace  Prob.  Period  CO 1 2 3 4 5  0.10 0.20 0.30 0.25 0.15  1  2  3  4  1  2  3  4  1180 1100  787 644  354  326  788 1150  551 579  358 275  258 307 499  570 582  308 330 357  272  278  1000 673 694  692  342  208  823  759  437 501  718  492  291  267 288 285  312  264  Monthly price forecasts for the U S and Alberta markets were used to represent the market conditions i n the northwest. Table 5.7 presents the heavy load hours ( H L H ) and light load hours ( L L H ) price forecast corresponding to the five inflow water years.  Table 5.7.  US and Alberta market price forecast ($/MWh) CO  PUS  CO  1 2 3 4 5  CO  P AB  HLH  LLH  HLH  LLH  32.78 41.41 36.08 39.20 34.18  32.93 40.61 36.35 38.97 34.53  34.53 43.62 38.01 41.30 36.01  34.69 42.78 38.29 41.06 36.37  The range o f the forward transactions decision variable was discretized to 5 decisions. Each decision was distributed between the heavy load hours and light load hours.  110  Table 5.8.  Range of forward transactions  Period LLH-EVIP GWh HLH-EXP GWh  Minimum Maximum Minimum Maximum  1  2  3  4  57 287 97 484  32 158 107 535  48 239 84 419  48 239 111 553  where I M P and E X P are the imports and exports respectively. 5.3.2.  Results and Analysis  The problem formulation was described in detail in the previous chapter. The R L model was run on a three month time step for four time periods. To simulate the behavior o f the real system under different inflow and release conditions, the G O M model was used off-line in batch mode prior to performing the R L model runs. The G O M model was run on a daily time step with three sub-time steps during the weekdays and one time step during the weekend. The purpose o f running G O M on a finer time step than the R L model was to capture variations in the load and prices that occur in a typical day. First, a batch o f 15,000 runs was carried out to cover all possible combinations of: the storage state variable, inflow, and forward sale (decision variable) for the four time periods (30*5*5*5*4=15000). For each run, the G O M model generates the optimal system rewards and the transition state, which are then stored in lookup tables for use in the R L model runs.  Values for the R L model parameters o f y/- 0.5 and ^- 0.000025 were adopted for the model runs. The model was set to run until it converged, which was achieved after 80,000 iterations as shown in Figure 5.8.  Ill  250,000,000  200,000,000  UJ  150,000,000  • | 100,000,000 re 50,000,000  0  10000 20000 30000 40000 50000 60000 70000 80000 Iteration  Figure 5.8.  Convergence of the R L model  The R L model results were compared to the optimal solution derived by the S D P model. Figures 5.9 presents a sample o f the optimal value functions estimated by the R L and the S D P models. The storage value function for different G M S storage levels is given on the abscissa for five M C A storage levels. The graph demonstrates the capability o f the R L model to converge to near optimal solution with a mean relative error not exceeding 0.0256. The results also indicate the stability o f the model for the different time periods. The M R E for each o f the four stages is given in the following table:  Table 5.9. Period M.R.E  (%)  Percentage of Mean relative error for the two reservoirs R L model 1 2 2.35 2.56  3 4 2.44 2.37  112  Figure 5.9.  G M S storage value function as a function of M C A storage  The optimal operating policy obtained by the R L model was also  evaluated.  Assuming different initial storage levels for G M S and M C A and for several inflow scenarios, a number o f simulation runs were carried out using the optimal policies at each stage for both the R L and the S D P model. For the different inflow scenarios and starting conditions, the simulation results were found to be identical to those derived by the S D P model. T w o examples o f the simulation run results are illustrated in Figure 5.10.  Stage - Month  Figure 5.10.  Stage - Month  Simulated optimal operation planning policy  113  Results from the two reservoir test model clearly demonstrated the capability o f the R L model in solving the problem and obtaining a good approximation o f the optimal value function. Using the G O M model, in batch mode has the merit o f providing the feedback information needed for the R L model in an off line setting. This resulted in a speed up o f the  R L model runs.  However, some  infeasibility difficulties were  encountered with this mode o f G O M runs, as the model could not always find the optimal solution at the grid points for the storage state space as it had a coarse grid structure. This problem has occurred since the feasible optimal solution may exist at any point i n the state space that is not necessarily at an exact point o f the coarse grid. Consequently, there was a need to use a finer grid structure. For the state space used in this study, a batch o f 15,000 G O M runs was performed. T o cover the whole state space, the amount o f data management and C P U time would increase significantly and it would become impractical for use in this study. This problem was dealt with during the implementation phase o f the R L R O M as w i l l be described in section 5.4, through the use o f function approximation instead o f lookup tables.  5.3.3.  Efficiency of the R L Algorithm  A series o f runs were conducted to assess the efficiency o f the R L model i n terms o f the computational speed. The run time ( C P U time) o f the R L model was compared to that of the S D P model. Assuming s state discretization, t time periods, a decision variables, e 1  scenarios o f random variables, then the problem size becomes (s.t.a.e). The two models were run for different problem sizes ranging from 900 to 178200. Figure 5.11 displays the variation in computational time needed for the R L and the S D P models with problem size. The figure clearly demonstrates that the S D P model run time appears to be increasing quadratically as the size o f the problem increases whereas the R L model has a linear run time relationship with the problem size. These results suggest that the R L model is more efficient in terms o f computational time as the size o f the problem increases larger than 150000.  114  If the S D P were to be used to solve the two reservoir problem covering: the whole state space assuming 59000 states, 36 time periods, 10 scenarios, and 5 actions, the problem size is in the order o f 10 . Using a polynomial regression relationship established from the results presented i n Figure 5.11 indicates that the estimated time for S D P to solve a problem o f 10 i n size would be impractical. It can be concluded from this 6  analysis that the R L overcomes the dimensionality problem through the potential use o f sampling and function approximation techniques as w i l l be presented next.  500 450 400 350 300 250 200 150 100 -| 50 0  --•--SDP —.  RL  200000  400000  600000  800000  1000000  Problem Size  Figure 5.11.  Comparison of the R L and SDP C P U time  115  5.4. R L R O M Multireservoir Model Implementation - Case Study The storage and generation facilities on the Peace and Columbia Rivers dominate the system operation planning function at B C Hydro. It is w e l l recognized by the system operation planners that proper planning o f the system should consider the coordinated operation o f these two systems, and that they should be integrated together in one model. This section presents the results o f implementation o f the R L R O M model on B C Hydro's main reservoir systems on the Peace and the Columbia Rivers while addressing the uncertainties that are usually encountered by the system operator, namely: the natural inflow, the system electricity load, and the market price. A case study w i l l be presented to illustrate the capability o f the R L R O M model to provide answers to this complex operation planning problem. The main outcomes o f the R L R O M model are: the optimal control planning policy for the system, the estimated value o f water in storage, and the marginal value o f water in the main reservoirs o f the B C Hydro system. The capabilities of the R L model to handle the dimensionality problem are also demonstrated.  Several enhancements were made to the R L model and to the approach that was presented  in  the  previous  section.  The  moment  matching  scenario  generation  methodology described in the previous chapter was used to generate a finite set o f scenarios that portrays the properties o f the random variables. T o deal with the larger state space, the R L algorithm with function approximation was used instead o f lookup tables. The storage value function w i l l be represented by a continuous piecewise linear function, fls'). In addition, the formulation o f the R L R O M model was modified to allow for on-line interaction with the G O M model at each stage in the iterations. This setup has provided the G O M model with the flexibility to find the optimal solution at any point on a continuous state space. In addition, this setup has avoided the infeasibility problems that were encountered in the G O M batch runs using the grid structure o f the state space as described in section 5.3.  116  5.4.1. Description of the Case Study This case study considers the operation planning o f B C Hydro's hydroelectric generating plants on the Peace River and on the main stem o f the Columbia River. These two systems account for about 65% o f the total B C Hydro generation capacity o f about 11,000 M W . The production schedules for all other hydro projects were fixed at their normal operation. The most significant o f these resources include hydro projects on the Bridge River, Campbell River, and other small hydro systems and thermal generation plants such as Burrard and Island Cogeneration plant (ICG).  The hydroelectric system on the Peace River is located in the northern interior o f British Columbia and it consists of: (1) The W . A . C . Bennett dam (Williston Reservoir) and the G . M . Shrum generating station ( G M S ) and (2) The Peace Canyon dam (Dinosaur Reservoir) and the Peace Canyon generating station ( P C N ) . The Peace system supplies approximately one third o f B C Hydro's annual energy production.  The powerhouse o f the G M S generating station is equipped with 10 generating units with a total capacity o f 2,730 megawatts ( M W ) . The average annual inflow to the Williston reservoir is about 1,080 m /s and its drainage area is about 68,900 k m . 3  2  Williston reservoir is a multiyear storage reservoir and is the largest in the Province, with a live storage capacity o f 39,471 million m . The reservoir reaches its lowest operational 3  level (645 m) in A p r i l and M a y and its maximum water level (672 m) i n September and October.  The Peace Canyon dam is located 23 kilometers downstream o f the Bennett dam with a generating capacity o f 694 M W with 4 generating units. The Dinosaur reservoir inundates 9 k m with a very small local natural inflow and is normally operated between 2  elevations 500 m and 502.9 m. The G M S and P C N generating stations are typically operated in a hydraulic balance and the flow is tightly controlled during the winter season to manage downstream ice conditions on the Peace River.  117  The Columbia basin is situated in the southern interior o f British Columbia. The main hydroelectric facilities on the Columbia River comprise: (1) M i c a dam (Kinbasket reservoir) and M i c a generating station ( M C A ) , (2) Revelstoke dam and reservoir and the Revelstoke generating station ( R E V ) , and the Hugh Keenleyside dam (Arrow lakes reservoir) and A r r o w lakes generating station ( K N A ) .  M i c a and the Hugh Keenleyside projects were constructed under the Columbia River Treaty and are operated to maximize the mutual benefits to Canada and the U S , with respect to flood control and power generation. The M i c a generating station consists o f 4 generating units with a generating capacity o f 1805 M W . The Kinbasket reservoir, which is the second largest multiyear storage reservoir within B C Hydro system, has a live storage capacity o f 14,800 million m . It receives an average annual inflow o f 586 m /s 3  3  from a drainage basin o f 21,000 k m . The Revelstoke dam is located about 130 k m south 2  of the M i c a D a m . The Revelstoke generating station also has 4 generating units with a total generating capacity o f 2000 M W . The drainage area upstream o f the Revelstoke dam is about 5,500 k m , and the average annual local inflow is 221 m /s, which is 2  3  predominantly snowmelt driven. The maximum operating elevation is 573.0 m with a normal daily fluctuation o f about 1.0 m. The A r r o w lakes are about 230 k m long downstream o f the Revelstoke D a m . Their drainage basin is about 10,000 k m . The 2  inflows to the A r r o w lakes are regulated by the M i c a Dam. A r r o w lakes generating station has 2 generating units with a capacity o f 185 M W .  Throughout this section, the Peace and the Columbia main five reservoirs/dams/ generating stations w i l l be referred to as: G M S , P C N , M C A , R E V , and K N A .  5.4.2.  Input Data  This section presents a summary o f the historical/ forecasted data that were used as inputs in the case study. A l s o , the physical and the operating limits for the five reservoirs considered in this case study are outlined. The case study is set to run on a monthly time step, which was then subdivided to three sub-time steps for weekdays ( W K ) and one time  118  step for the weekend ( W E ) . Weekends include Sundays and holidays. The weekday time steps were split to: peak ( W K P K ) , high ( W K H I ) , and the low ( W K L O ) sub-time steps with 2, 14, and 8 hours respectively. The subdivision into sub-time steps enhances representation o f the hourly variation in system load and market prices. Table 5.10 presents the monthly hours i n each sub-time step for typical months in a non-leap year.  Table 5.10.  Monthly Hours in each sub-time step Total  Sub-time step  Month WKPK  WKHI  WKLO  364  208  48 50 52 48 52 48 50 52  336 350 364 336 364  192  336 350 364  200 208 192 208 192 200 208  August  52 50  364 350  208 200  September  50  350  200  52  October November December January February March April May June July  5.4.2.1.  WE  120 144 144  744  120 96  744 672 744  120 144 144  '  720 744  720 744  96 120 144  720 744 744  120  720  Inflows  The characteristics o f the local natural inflows to the five reservoirs on the Peace and Columbia Rivers are given i n Table 5.11. The historical records considered covers 60 water years for the period o f 1940-2000. Figure 5.12 presents the average monthly inflow and the cumulative monthly inflows. The inflows, which mainly result from snowmelt, typically increase in A p r i l and M a y , peak in June and July, and taper off in January. Figure 5.13 presents the normalized cumulative distribution o f the average monthly inflows for the five reservoirs in this study.  119  Table 5.11. River Peace Columbia  Annual average reservoir local inflow characteristics in cms Reservoir  Dam  Average  Minimum  Maximum  Std.Dev.  Williston  G.M. Shrum  GMS  12,940  9,158  17,197  1,750  Dinosaur  Peace Canyon  PCN  321  1  1,618  353  Kinbasket  Mica  MCA  6,891  5,578  8,726  678  Revelstoke  Revelstoke  REV  2,821  2,112  4,030  379  Arrow  Keenleyside  KNA  4,256  2,141  5,518  834  4000  Oct  Nov D e c Jan  Feb Mar Apr M a y Jun  Jul  Aug Sep  M onth  Figure 5.12.  Average monthly reservoirs local natural inflows  Figure 5.13. Normalized cumulative monthly distribution of the average annual reservoirs local natural inflows  120  Market Prices  5.4.2.2.  Table 5.12 and Figure 5.14 present the average monthly market price forecast for each sub-time step. It was assumed that the peak and the high weekday prices are equal and that the weekend market prices were equal to the ( W K L O ) weekday sub-time step.  Table 5.12.  Average monthly market price in each sub-time steps ($/MWh)  Month October November December January February March April May June July August September Minimum Average Maximum  WKPK  WKHI  WKLO  WE  54.23  54.23  52.39  52.39  58.95  58.95  54.92  54.92  63.02  63.02  49.57  49.57  66.13  66.13  53.08  53.08  56.43  56.43  51.93  51.93  54.86  54.86  50.60  50.60  54.17  54.17  46.11  46.11  50.78  50.78  36.54  36.54  50.12  50.12  36.83  36.83  54.63  54.63  43.20  43.20  75.13  75.13  45.76  45.76  56.89  56.89  50.51  50.51  50.12  50.12  36.54  36.54  57.94  57.94  47.62  47.62  75.13  75.13  54.92  54.92  80  20 1  1  Oct  1  1  1  1  1  1  1  Nov D e c Jan F e b Mar Apr May Jun  1  1  Jul  1  Aug  Sep  Month -•-WKPK-WKHI  Figure 5.14.  -•—WKLO-WE  PRICE  Average monthly market price in each sub-time step  121  5.4.2.3.  The  Electricity System Load  average monthly electricity load forecast considered in the case study is  presented in Table 5.13. This electricity load should be met mainly by the generation from the five hydropower plants on the Peace and Columbia Rivers and other system resources. The other system resources were considered fixed; however, they were shaped according to the historical operation pattern. Figure 5.15 displays the average monthly percentage o f the annual load, based on 15 years o f B C Hydro historical records.  Table 5.13.  Average monthly electricity system load in each sub-time step  WKHI  WKPK  Month  2,374 2,437 2,659 2,798 2,562 2,583 2,168 2,128 2,189 2,184 2,135 2,148  339 348 380 400 366 369 310 304 313 312 305 307  October November December January February March April May June July August September  10  9.69  Total  WE  WKLO  1,357 1,393 1,519 1,599 1,464 1,476 1,239 1,216 1,251 1,248 1,220 1,227  (GWh) 4,853 5,222 5,652 5,719 5,124 5,280 4,645 4,523 4,330 4,464 4,539 4,418  783 1,044 1,094 922 732 852 929 875 577 720 879 736  9.77  9.5  c c  < o 0)  8.5  8.99  8.91  9  8.67  -H 7.90  8  7.59 7.5  - •  Oct  Nov  Dec  Jan  Feb  Mar  Apr  7.70  7r28  4X,  1  Jun  Jul  Aug  7.45  I n . i l h II  May  Sep  Month  Figure 5.15.  Historical monthly distribution of % of system load  122  5.4.2.4.  Transmission limits  The monthly transmission limits used in the study for the imports and exports to the U S and Alberta markets are given in Tables 5.14 and 5.15. These limits represent the capacity o f the transmission lines between B C and the U S and Alberta based on historical records.  Table 5.14.  B C - U S transmission limits  WE  WKPK  WKHI  WKLO  1,800  2,600  2,600  2,600  2,600  1,500  1,500  2,600  2,600  2,600  2,600  1,500  1,500  2,600  2,600  2,600  2,600  2,600  2,600  2,600  WKPK  WKHI  WKLO  October  1,800  1,800  1,800  November  1,500  1,500  December  1,500  1,500  Month  Export ( M W h )  Import ( M W h )  Month  WE  January  1,500  1,500  1,500  1,500  2,600  February  1,500  1,500  1,500  1,500  2,600  2,600  2,600  2,600  March  1,500  1,500  1,500  1,500  2,300  2,300  2,300  2,300  April  1,800  1,800  1,800  1,800  2,300  2,300  2,300  2,300  May  1,800  1,800  1,800  1,800  2,300  2,300  2,300  2,300  June  1,800  1,800  1,800  1,800  2,300  2,300  2,300  2,300  July  1,926  1,949  1,957  1,960  2,600  2,600  2,600  2,600  2,600  2,600  2,600  2,600  August  1,926  1,948  1,956  1,931  2,600  2,600  September  1,926  1,942  1,948  1,930  2,600  2,600  Table 5.15.  BC-Alberta transmission limits Export ( M W h )  Import ( M W h )  Month WKPK  WKHI  WKLO  W E HI  WKPK  WKHI  WKLO  WKHI  October  560  560  560  560  700  732  744  700  November  640  640  640  640  725  741  747  725  December  560  560  560  560  725  757  769  725  769  767  January  560  560  560  560  750  764  February  560  560  560  560  725  757  769  725  March  640  640  640  640  725  741  747  725  April  560  560  560  560  560  586  595  569  May  560  560  560  560  560  573  578  562  June  560  560  560  560  560  586  595  560  July  435  435  435  435  560  582  590  587  August  435  435  435  435  580  606  615  584  September  435  435  435  435  560  586  595  565  123  5.4.2.5.  Forward Sales  Based on historical records for the period o f January 2000 to June 2005, a range o f the pre-sales in the U S and Alberta markets was used in the case study. Table 5.16 lists the minimum and maximum monthly forward sales for each sub-time step. The negative sign means pre-export and the positive sign means pre-import.  Table 5.16.  Forward transactions (Pre-export and Pre-import)  WKPK  WKHI  WKLO  October November December January February March April May  71 42  498 293  169 134  36 40  249 281  51 86 80 82  June  95 31 16  359 601 560 574 663 214 115  68 97 118 179 133 176 199 143 148  82  571  205  July August September  Maximum (GWh)  Minimum (GWh)  Month  WKPK  WKHI  WKLO  105 84 42 61 74 112 83  -50 -57 -41 -8 -1  -353 -396  -27 -73 -98 10 2  110 124  16 12  90 93 128  WE  63 20  -285 -59 -10 441 141  116 73 95 42  WE -17 -46 -61 6 1 73 46  -45 -66  109 83 -315 -460  -118 -36  60 26 -74 -22  -77  -539  -67  -42  124  5.4.2.6.  Operation Constraints  The operation and the physical limits related to reservoirs storage, turbine discharge, plant discharge, and plants generation are given in the Table 5.17.  Table 5.17.  Reservoirs operation and physical limits GMS  Storage - cms-d Turbine Discharge - cms  PCN  KNA  REV  MCA  Max  475,000  2,695  285,000  60,500  104,825  Min  55,000  2,440  125,000  58,500  7,850  Max  1,858 0  2,025  1,140 0  1,743 0  845  Min  0  0  Plant Discharge- cms  Max Min  1,959 43  1,982 283*  Generation - M W  Max  2,730  700  1,800  2,000  190  Min  58  50  0  0  0  * P C N minimum ice flow releases are HOOcms in December and 1500 cms in January.  To represent the Columbia River Treaty operation, the monthly average releases from K N A were used in this case study as listed in Table 5.18.  Table 5.18.  Average outflow releases from Keenleyside  Month  Oct  Nov  Dec  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  KNA Outflow -cms  1101  1410  1664  1812  1226  761  712  718  830  1678  2159  1742  125  5.4.3.  Scenario Generation  5.4.3.1.  Inflow, Price, and Load Scenarios  The moment matching scenario generation technique was used to develop a number of scenarios that represent the same statistical properties o f the historical data records for the inflow, price, and load variables. Based on 60 years o f records, the first four moments (mean, standard deviation, skewness, and kurtosis) and the correlation o f the monthly inflow for the Peace and Columbia Rivers were estimated. The Peace inflows represent the sum o f the G M S and the P C N local inflows. The Columbia inflows represent the aggregated local inflows o f M C A , R E V , and K N A . The statistics presented in Tables 5.19 and 5.20 define the properties o f the Peace and the Columbia inflows 24 random variables.  Table 5.19.  Monthly Peace system inflow statistics  Peace Inflow  Statistics  Mean - cms  Oct  Nov  Dec  Jan  817  570  375  324  Mar  Apr  May  Jun  Jul  Aug  Sep  279  270  545  2512  3751  2002  1012  804  72  235  685  906  613  337  234  1.08  0.22  0.76  0.36  1.25  0.30  1.55  2.07  1.63  3.06  1.59  Feb  SD -cms  a  243  179  95  93  71  Skew  Y  0.24  0.54  -0.02  0.822  0.342  Kurtosis  K  Table 5.20.  1.56  1.80  1.50  2.18  1.62  0.67 1.94  2.67  Monthly Columbia system inflow statistics  Columbia Inflow Oct  Nov  Dec  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Mean -cms  H  659  474  337  285  264  283  637  2044  3298  2845  1811  1030  SD - cms  a  172  144  80  69  69  80  190  490  645  606  340  234  Skew  Y  1.20  0.74  0.71  0.28  0.14  2.05  2.01  1.58  2.03  1.52  1.19 2.90  1.65  2.93  0.39 1.66  0.73  K  1.23 3.00  0.72  Kurtosis  0.39 1.65  2.00  4.23  126  A correlation analysis between the different variables was also carried out for the moment matching method. However, the results indicate a weak correlation between the Peace and Columbia inflows, and this weak correlation protects  against  drought  conditions at the system level.  Operations planners at B C Hydro carried out a regression analysis  to establish a  relationship to relate the electricity prices in the heavy load hour ( H L H ) and light load hour ( L L H ) prices with the annual inflow volume at the Dalles dam (located on the Columbia River in the U S ) . This relationship was then expressed in the form o f price multipliers to inflate or deflate the average electricity market price forecast. Therefore, an additional random variable was added to the moment matching algorithm to generate scenarios representing the Dalles inflow volumes, and a set o f the H L H and L L H price multipliers were then used to generate the market price scenarios. Table 5.21 lists the statistics o f the Dalles inflow volumes.  Table 5.21.  Dalles Inflow  Statistic Mean SD  (bm) 3  (bm ) 3  Skew Kurtosis  Inflow volume statistics at the Dalles Dam  11  a  171 32  Y  -0.03  K  1.50  A s shown in Figure 5.16 a low correlation exists between the Peace and the Columbia inflow at K N A and the inflow volume at the Dalles during the fall and winter months. O n the other hand, a higher correlation exists between the Columbia inflow at K N A and the Dalles inflow volume during the summer months as compared with other months o f the year.  127  Figure 5.16.  Correlation between the Dalles inflow volume and the monthly Peace and Columbia inflows  To express the monthly variability o f the load forecast, the 5 , 50 , and 9 5 th  th  th  percentiles o f the annual electricity load forecast as presented in Table 5.13 were considered. Table 5.22 presents the results o f using the three-point approximation method of a continuous variable as described by Pearson and Tukey, 1965, to estimate the mean and standard deviation o f the electric system load.  Table 5.22.  Results of the three point approximation of the mean and  the  variance  Statistic  Percentile  Load (GWh)  5th Percentile  Ps  56,046  Median  Pso  58,769  95th Percentile  P95  61,480  Mean  in  58,767  Standard Deviation  a  1,652  A  Monte Carlo simulation was then used to generate 10,000 load  scenarios.  Figure 5.17 presents the system electric load cumulative probability distribution function for a log-normal distribution. The probability distribution was divided into 60 increments  128  to match the number o f historical inflow data sets for the Peace, the Columbia Rivers, and at the Columbia River at Dalles. These sixty load values were then used to estimate the first four moments and the load correlation with the other random variables.  S Q.  0.40 0.30 0.20 0.10 0.00 — 54,000  56,000  58,000  60,000  62,000  64,000  System Load - GWh  Figure 5.17.  5.4.3.2.  Probability distribution function of forecasted system load  Results of the Moment Matching Algorithm  The moment matching algorithm developed by Kaut et al. (2003) was used to generate the set o f scenarios used in this study. The input data included the first four moments and the correlation matrix o f the 26 variables as described above (twelve month Peace and Columbia inflows, the Dalles inflow volume, and the electricity load forecast). The algorithm was run to generate a number o f scenario sets: ranging from 10 to 100 scenarios for each o f the 26 variables. To evaluate the quality o f the generated scenarios, a comparison between the statistics o f the generated scenarios and the historical data was carried out. The absolute mean relative error ( M R E ) metric was used in this analysis. Figure 5.18 presents the mean relative error ( M R E ) for the estimated four moments. Overall, the graph indicates that the generated scenarios preserve the statistical properties of the original data for the 26 variables with high degree o f accuracy. The results o f the mean are a 100% match for the mean o f the historical data, even for the low number o f 10 scenarios case. For the other statistics (standard deviation, skewness, and kurtosis),  129  Figure 5.18 demonstrates that the M R E decreases with the increase in the number o f generated scenarios. The M R E decreases to less than 5% as the number o f scenarios reaches 30 and it further reduces to lower values as the number o f generated scenarios increases, but with a lower rate o f decrease.  20 i  Figure 5.18. Testing the quality of the different numbers of generated scenarios  A s an example, a comparison o f the statistics o f the 30 generated scenarios o f the Peace River inflows with the historical records are displayed in Figure 5.19. The graph clearly demonstrates the high degree o f accuracy o f the estimated four moments o f the generated scenarios.  130  Mean 4000  •  3500  Scenario  Generation  Historical  Data  3000 2 500 2000 1500 1000 500 0  — *  OCT  NOV  JAN  DEC  FEB  M  Standard 1000  I  I  900 700  JUN  JUL  A U G S E P  JUN  JUL  A U G  SEP  JUN  JUL  A U G  SEP  Deviation I  vjitsi l e i O L  Hist Drical  800  MAY  onth  I  I  idi l u  A P R  MAR  IUII  Data  600 50 0 400 300 200 100 0  •  NOV  OCT  DEC  —  J A N  FEB  MAR  A P R  M A Y  M o nth  Skewness 1.4  •  1.2  Scenario  Generation  •  h t s t o r i c a l D E) t a  1.0  o.a 0.6 0.4 0.2 0.0 -0.2  OCT  JAN  NOV  DEC  I  I I scenario  FEB  M A R  A P R  M A Y  Kurtosis 3.5  Historical  3.0  I I Generation Data  •  2.5 2.0 1.5 1.0  OCT  N O V  D E C J A N  F E B M A R A P R M A Y J U N  J U L A U G  S E P  M o n t h  Figure 5.19. Comparison between the Peace inflow historical data and the generated scenarios statistics  131  5.4.4.  Results of the R L R O M M o d e l Runs  The input data for the case study presented in section 5.4.2 and the set o f scenarios o f the random variables described in section 5.4.3 were used to run the R L R O M model. The model was set to run on a monthly time step for thirty six months starting from the beginning o f the water year in October. The results o f using the model for the case study are presented and discussed in the following sections. To assess the quality o f the operation planning policies derived b y the model, the results are presented and analyzed. Finally, the model was run in control mode and the results o f a simulation run are presented and discussed.  The following parameter setting was adopted in this case study: the exploitation exponent parameter  0.0125, learning rate exponent yr- 0.845, and annual discount  rate o f 8%.  5.4.4.1.  Value of water in storage  A s stated in the previous chapter, the objective function maximizes the expected net revenue  from  the  spot  and  forward  transactions.  Figures  5.20  to  5.22  present  three dimensional (3-D) views o f the value function in Canadian dollars, as a function o f G M S and M C A storage volumes. The 3-D plots present the shape o f the value o f water in storage surface function for different months o f the year, namely: December, M a y , and August. The choice o f the three months was meant to represent: the winter, spring (freshet), and summer respectively. The graphs clearly demonstrate the dependence o f the value function on the storage level in both o f G M S and M C A reservoirs. It can also be seen that this dependence on the storage level varies from one month to the other. For example, the value function i n M a y is less sensitive to the storage level i n the other reservoir. It can also be seen that the minimum value o f water in storage occurs during the freshet when the storage level is rapidly increasing, and the market price is low.  The slope o f the value function surface represents the dollar value o f each unit o f storage in dollars per cubic meter second-day ($/cms-d). The derivative o f the value  132  function with respect to the storage volume in M C A represents the marginal value o f water in M C A . Similarly, the derivative o f the value function with respect to G M S storage represents the marginal value o f water in G M S . The marginal value o f water is also called the shadow price or the Lagrange multiplier o f the mass balance equation for a reservoir (Equation 4.9).  The steeper slopes o f the storage surface value functions for December and August, as shown in Figures 5.20 and 5.22, indicate a high marginal value o f water ($/cms-d) in those months compared with the corresponding values in M a y . This is mainly attributed to the high market prices and high demand during the winter months as shown in Figure 5.14 and Figure 5.15. Whereas in summer, the demand is low and the inflow hydrograph is i n a receding trend, and the high marginal values is a result o f the higher market opportunities during the heavy load hours ( H L H ) as shown in Table 5.12. O n the contrary, high inflows, low demands, and low market prices during the freshet results in flatter slopes o f the storage value function as shown in Figure 5.21.  133  Value ofWater in Storage (M$) 285,000 205,000 125,000  Kinbasket (MCA)storage (cms-d)  Williston (GMS)-Storage (cms-d) I 500-1,000  Figure 5.20.  • 1,000-1,500  • 1,500-2,000  •  2,000-2,500  Storage value function in December  2,5002,000 . .... x 1,500 Value ofWater in Storage (M$) 1  0  0  Q  500  285,000 205,000  0 Williston (GMS)-Storage (cms-d) I 0-500  Figure 5.21.  ^  1500-1,000  >,000  S'  n 1,000-1,500  • 1,500-2,000  Kinbasket (MCA)s t o r a g e (cms-d)  I 2,000-2,500  Storage value function in May  V a l u e of W a t e r In S t o r a g e (M$>  n b a s k e t (MCA)t o r a g e (cms-d)  Williston (GMS)-Storage (cms-d)  0 0-500  Figure 5.22.  •500-1,000  0 1,000-1,500  • 1.500-2,000  Storage value function in August  •2,000-2,500  To examine the value function, the December storage value is plotted, as an example, in a 2-D view as shown in Figure 5.23 for different increments o f G M S storage. The results show the decreasing slope o f the value function with the increase o f storage in M i c a . This indicates the effect of the available storage volume on the value o f water in M i c a reservoir: as the M i c a reservoir level rises the marginal value o f the  water  decreases. A l s o , the graph demonstrates the decreasing marginal value o f water in M i c a as more storage is available in G M S . The decreasing gap between the curves o f the different G M S storage increments indicate a reduction in the marginal value o f water in G M S for a given storage level in M i c a as shown in Figure 5.23.  Figure 5.23.  Mica Storage value function in December  Figure 5.24 presents the monthly value o f water in storage for different G M S reservoir storage increments with a storage level at 50% in M i c a reservoir (205,000 cmsd). The results indicate that the highest storage values occur in the period o f November to January and the lowest storage values in the period o f A p r i l to June. W i t h low storage in G M S in December and January, the graph demonstrates that the marginal water values (slope o f the first segment o f the storage value function) tend to be higher because: the electricity demand is high, inflows are low, and the Peace River ice flow constraint is imposed in December and January.  135  2,400,000,000 2,000,000,000 —  1,600,000,000  >  1,200,000,000  CD  |  800,000,000 400,000,000 j 0 J, 50,000  ,  ,  ,  ,  ,  ,  ,  ,  100,000  150,000  200,000  250,000  300,000  350,000  400,000  450,000  Mica Storage (cms-d) —•— October —I—April  Figure 5.24.  —«—November May  December — J a n u a r y June July  —*— February August  —•—March September  Monthly G M S storage value function with Mica reservoir at 50% storage level  5.4.4.2.  Marginal value of energy  One o f the most significant pieces o f information obtained from the model results is the marginal value o f energy, expressed in $ per megawatt-hours ( $ / M W h ) . A s stated earlier, the significance o f this information is that the operation planning dispatch decisions are based on a comparison o f the energy value with market prices. The marginal value o f energy is obtained by converting the marginal value o f water in $/cmsd to $ / M W h using the H K factor as expressed in Equation 4.22.  Tables 5.23 and 5.24 present the monthly range o f the marginal value o f energy for M C A and G M S in ( $ / M W h ) . For each month, the marginal value o f energy is given for the combination o f empty or full M C A storage (125,000-285,000 cms-d) and empty or full storage in G M S (55,000-475,000 cms-d). For the case o f empty M C A reservoir, the summary results in Table 5.23 clearly demonstrate the impact o f the storage level in G M S on the marginal value in M C A in almost all months with the exception o f the period from A p r i l to June where the M C A marginal value o f energy is not sensitive to the storage level in G M S . The distinctly high marginal values o f energy in certain months could be attributed to the cost o f satisfying the high K N A outflow releases required in those  136  months by the Columbia treaty operation (average K N A outflow values were presented in Table 5.18).  When M C A reservoir is full, it can be seen that the marginal value of energy is less sensitive to the available storage level in G M S . The same independence in M C A marginal values o f energy is noticeable during the freshet even for empty G M S storage. However, the influence o f G M S storage on M C A marginal value of energy during the freshet is most noticeable in the middle range of M C A storage as shown in Figure 5.28.  Table 5.23. Month  M C A monthly range of marginal value of energy GMS Storage cms-d  October November December January February March April May June July August September  MCA Marginal Value ($/MWh) Empty  Full  125,000cms-d  285,000 cms-d  55,000  295.87  47.14  475,000  173.32  43.66  55,000  338.65  48.6  475,000  224.7  44.93  55,000  264.9  49.49  475,000  141.2  43.85  55,000  335.18  44.67  475,000  126.65  37.36  55,000  186.78  36.69  475,000  86.61  27.03  55,000  83.5  25.88  475,000  53.79  14.96  55,000  48.72  7.62  475,000  42.7  4.68  55,000  50.04  3.33  475,000  49.72  1.72  55,000  60.56  14.35  475,000  53.62  10.18  55,000  106.71  36.47  475,000  103.53  35.02  55,000  161.64  45.29  475,000  138.72  44.19  55,000  199.16  46.5  475,000  141.63  44.51  137  The summary results presented in Table 5.24 show the higher G M S marginal value of energy when the reservoir is at low storage levels during the period September to January. This is mainly due to the high cost associated with meeting the Peace ice flow constraint in December and January (1100 and 1500 cms respectively). The lowest G M S marginal value occurs in the months of A p r i l and M a y where the marginal value drops to values close to zero $ / M W h , when both of G M S and M C A are full.  Table 5.24. Month  G M S monthly range of marginal value of energy MCA Storage  Empty  Full  55,000cms-d  475,000 cms-d  125,000  231.78  41.87  285,000  96.70  41.74  125,000  440.05  42.12  285,000  267.82  41.78  cms-d  October November December January February March April May June July August September  GMS Marginal Value ($/MWh)  125,000  461.40  41.21  285,000  297.25  41.49  125,000  349.27  37.84  285,000  54.81  37.00  125,000  200.78  26.96  285,000  64.85  24.34  125,000  94.50  11.77  285,000  50.91  8.66  125,000  47.53  1.77  285,000  45.16  0.20  125,000  48.89  0.86  285,000  48.42  0.40  125,000  53.64  23.42  285,000  47.93  11.58  125,000  65.76  42.06  285,000  51.83  40.63  125,000  64.14  42.45  285,000  55.46  40.88  125,000  128.93  42.50  285,000  61.15  42.19  138  Figures 5.25 to 5.28 present typical examples o f the marginal value o f energy for M C A and G M S in winter, spring, and summer months - December, M a y , and August. Figure 5.25 demonstrates the sensitivity o f the marginal values of energy in M C A to the available storage in the reservoir particularly when the reservoir drops below 205,000 cms-d. This sensitivity is more prominent when the G M S reservoir storage is at or below 150,000 cms-d.  When M C A and G M S reservoirs are at their lowest levels (i.e. at 125,000 and 55,000 cms-d respectively), M C A marginal value o f energy increases rapidly. The high marginal value is due to the penalty for violating the minimum storage constraint to meet the system electricity load or the Columbia treaty operation outflow constraint at K N A . In general, the results suggest that at the low range o f G M S and M C A storage levels, the marginal value o f energy is higher than the average L L H market price in December. In such a case it would be more valuable to store water than to release it.  275  i  125,000  ,  145,000  165,000  185,000  205,000  225,000  245,000  265,000  285,000  Mica Storage (cms-d) — • — G M S V=55,000 - * — GMS"V=250,000 ——GMS~V=450,000  F i g u r e 5.25.  —•— GMS V-100,000 -•—GMS~V=300,000 G M S V=475,000  GMS V=150,000 —t— GMS_V=350,000  -*-GMS_V=200,000 — GMS_V=400,000  M C A m a r g i n a l value of energy i n December  O n the other hand, Figures 5.26 and 5.27 show that the marginal value o f energy i n G M S is more sensitive to the variation in M C A storage below a G M S storage volume o f 250,000 cms-d. The dependence o f G M S marginal value o f energy on M C A storage is  139  apparent when the storage in M C A is at or below 185,000 cms.d. The higher value o f G M S energy in December is mainly attributed to the higher electricity load and the high market prices accompanied with the lower inflow conditions in winter as compared with the freshet period. The high outflow releases downstream o f Peace Canyon dam ( P C N ) in December and January, are required to maintain the winter ice flow constraint and this also contributes to the high marginal values particularly when M C A is at or below 145,000 cms-d. Figure 5.27 demonstrates that the variation o f the G M S marginal value for different M C A storage is decreasing as G M S reservoir is above 50% o f its full storage volume (250,000cms-d). This is apparent in the decreasing gap between o f the marginal value o f energy curves for different values o f M C A storage as G M S storage increases.  Figure 5.26.  G M S marginal value of energy in December  140  Figure 5.27.  G M S marginal value of energy in December  Figures 5.28 and 5.29 display the marginal value o f energy for M C A and G M S i n the month o f May. In general, the marginal value o f water is low during this time o f the year for the full range o f storage in both o f M C A and G M S reservoirs. The highest M C A and G M S marginal values o f energy occur when the M C A and G M S reservoirs are empty and the value o f energy is about 50.0 $ / M W h . O n the other hand, the value o f energy drops to almost zero when the reservoirs are at full pool level. The low marginal value o f water is attributed to the fact that additional storage in these reservoirs would have a high probability o f spill during this period. Figure 5.28 demonstrates the significance o f the change in G M S storage levels on the M C A marginal values o f energy in the range from 165,000 to 265,000 cms-d. Figure 5.29 portrays the influence o f variation in M C A storage on G M S marginal value o f energy, in particular with G M S storage in the range from 250,000 to 400,000 cms-d.  141  125,000  145,000  165,000  185,000  205,000  225,000  245,000  265,000  285,000  Mica Storage (cms-d) -GMS_V=55,000 - * - G M S V=100,000 GMS V=150,000 -GMS V=250,000 —•— GMS~V=300,000 —(— GMS V=350,000 GMS V=450,000 GMS V=475.000  x  Figure 5.28.  M C A marginal value of energy in May  Figure 5.29.  G M S marginal value of energy in May  GMS V=200,000 GMS V=400,000  Figure 5.30 demonstrates the marginal value o f energy for G M S i n August. The higher inflows and lower electricity demand in August, as compared with December, result i n a noticeably lower marginal value when both o f M C A and G M S are at high storage levels. Despite the lower load in August, as compared to the winter months, the high marginal values when the reservoirs are low could be attributed to the high California market opportunities in August, particularly during heavy load hours (see  142  Table 5.12 for market price structure). Figure 5.30 illustrates the significant influence o f M C A storage level on the G M S marginal value o f energy particularly when the G M S storage drops below 300,000 cms-d. Above 300,000 cms-d, the marginal value is less dependant on the G M S storage level.  50,000  100,000  150.000  200,000  250,000  300,000  350,000  400,000  450,000  GMS Storage (cms-d) — • — M C A _V=t25,000 _V=145,000 — • — M C A _V=225,000 — —M M CC A A V=245.000 — aI —  Figure 5.30.  M C A _V=«5,000 M C A _V=185,000 — « — MCA_V=205.000 M C A _V=265,000 — • — M C A _V=285.000  G M S marginal value of energy in August  Figures 5.31 and 5.32 display the monthly M C A marginal value o f energy for different G M S storage increments with M C A storage at 125,000 and 285,000 cms-d respectively. Figure 5.31 displays on the secondary y-axis the average Keenleyside ( K N A ) outflow releases, as stipulated by the Columbia River Treaty. The impact o f these high release requirements in the winter months combined with high electricity load, high market prices, and low inflows are reflected in the high marginal values during those months. This effect is less noticeable in the summer as the load is not high and higher inflows are available as compared with winter months.  The high marginal value of energy in the summer is mainly due to better market opportunities, particularly in August and September as shown earlier i n Table 5.12. Figures 5.31 and 5.32 demonstrate that, whatever the available storage in G M S and M C A  143  reservoirs, the lowest marginal values occur in A p r i l and M a y . Obviously, high natural inflows and lower electricity demand are the main reasons. A l s o both figures point to the increased sensitivity o f the M C A marginal values to the incremental change in G M S storage in the fall and winter months.  600 500  CD c LU  400  "5 cu 3  5 7000 8000 Oct  B  —  —•  N o v D e c J a n F e b Mar Apr May J u n Jul A u g S e p Month G M S V=100000 — • — G M S V=55000 K N A Outflow -~K— G M S "V=200000 —5*— G M S "~V=250000 G M S V=150000 — - — G M S "" V = 4 0 0 0 0 0 G M S V=300000 — 1 — G M S " "V=350000 G M S " "V=475000 GMS~V=450000  Figure 5.31.  M C A monthly marginal value of energy for a low storage level at M C A ( M C A storage=125,000cms-d)  Figure 5.32.  M C A monthly marginal value of energy for a high storage level at M C A ( M C A storage=285,000cms-d)  144  Figure 5.33 demonstrates the effect o f the high Columbia Treaty release requirements on the K N A marginal value o f water in $/m when the M C A reservoir is at low and at 3  high storage levels. The marginal values are extracted from the shadow price o f the hydraulic continuity constraint (storage mass balance) o f the G O M model. The results indicate that when M C A storage level is low, the high flow releases result in significantly high marginal values o f water at K N A . These marginal values propagate and cause an increase in the M C A marginal o f water/energy as presented in Figure 5.31, in particular when the storage level at M C A is low.  Figure 5.33.  Marginal value of K N A storage constraint  Figure 5.34 displays the monthly results o f G M S marginal values o f energy for a G M S storage level o f 55,000 cms-d. Similar to M C A ' s marginal values, the results indicate higher marginal values during the winter months. In addition, the results indicate the dependence o f the marginal values in G M S on the available storage i n M C A during the winter and fall months. The highest G M S marginal values occur in November, December, and January in response to the high releases that are required to satisfy the Peace River ice flow constraint and to high market prices. The graph also indicates that G M S marginal value is independent o f M C A storage in the spring and in early summer months as the storage level in the reservoir increases and when the demand is low.  145  Figure 5.34.  G M S monthly marginal value of energy for an empty G M S reservoir storage (GMS storage=55,000cms-d)  Figure 5.35 displays the monthly results o f G M S marginal values o f energy for a full G M S storage o f 475,000cms-d. Similar to M C A , the G M S lowest marginal values occur in A p r i l and in M a y . However, when the G M S reservoir is full, the marginal values are not sensitive to changes in M C A storage levels almost all year around. The graph also demonstrates that the marginal values during July and September are in the same order o f magnitude as those in October and December.  60  Oct  Nov  Dec  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Month • M C A V=125.000 .-x—MCA~V/=185.000 —I—MCA^V=245,000  Figure 5.35.  M C A V=145.000 —*—MCA~V=205,000 MCA^V=265,000  — • — M C A V=165,000 —•—MCA"^/=225,005 MCA3/=285,000  G M S monthly marginal value of energy for a full G M S reservoir storage (GMS storage=475,000cms-d)  146  To demonstrate the variation in the G M S marginal value o f energy i n the different months, the cases o f empty and full M C A storage are presented. For these two cases, Figures 5.36 and 5.37 portray the details o f the monthly variation in G M S marginal values o f energy for the range o f storage increments at G M S . The figures demonstrate the significant difference in G M S marginal value o f energy for G M S storage below a level o f 150,000 cms-d, for low and full M C A storage. This difference is apparent for most o f the year, with the exception o f the period from A p r i l to June. Figures 5.38 and 5.39 illustrate a magnified region for the results for G M S storage values above 150,000 cms-d. Comparing the results presented in the two figures, it can be seen that higher marginal values are noticeable in G M S when the M C A storage is low. This effect on G M S marginal value varies from month to month and decreases where G M S storage level increases. It also shows that the effect o f M C A storage on G M S marginal value is minimal when the G M S storage level is high. In both cases, the highest G M S marginal values occur in the period from October to January. The lowest marginal values occur during the months o f A p r i l and M a y .  50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 GMS_Storage - cms-d -•—Oct  -•—Nov  May ——- Jun  Figure 5.36.  Dec Jul  —*— Jan  —* — Feb  Aug  Sep  —•—Mar  -H—Apr  G M S monthly marginal value of energy for a low storage level at M C A ( M C A storage=125,000cms-d)  147  300  T  50,000 -•—Oct —I—Apr  Figure 5.37.  150,000 —m—Nov May  250,000 350,000 GMS_Storage - cms-d Dec —— Jun  — x — Jan  Jul  -*-Feb Aug  450,000 —•— Mar Sep  G M S monthly marginal value of energy for a high storage level at M C A ( M C A storage=285,000cms-d)  Figure 5.38.  G M S monthly marginal value of energy for a low storage level at M C A ( M C A storage=125,000cms-d)  148  Figure 5.39.  G M S monthly marginal value of energy for a high storage level at M C A ( M C A storage=285,000cms-d)  5.4.4.3.  Optimum Control Policy Analysis  In this section, the optimized control policies for operation planning derived by the R L R O M model are presented and discussed. The optimized control policy includes the net market transactions and the operation planning o f the five main reservoirs on the Peace and the Columbia Rivers, including the optimized plants' generation and turbine releases. T o illustrate the optimal control policies proposed b y the R L R O M model, the model was run using the market prices, electricity system load, and inflow scenario data presented in Table 5.25.  Figures 5.40 and 5.41 present 3-D views o f the net market transactions in January and in August. It can be seen in Figure 5.40 that the model recommends a net import policy when the storage volumes i n M C A and G M S are low. For a M C A storage o f 125,000 cms-d, the results recommend a net import transaction regardless o f the storage level at G M S . A s the storage levels at M C A and at G M S exceed 200,000 cms-d, the model suggests a net export policy since the marginal value o f energy drops below the market price and it would be more profitable to export.  149  Figure 5.41 shows the high export policies recommended by the model in August as the market price in California is high in the heavy load hours. The G M S / M C A marginal values o f energy for the case o f full reservoirs are $44.51 and $42.19. W h e n both reservoirs are at the 50% storage level (i.e G M S and M C A storage are: 205,000 cms-d and 200,000 cms-d), the marginal values o f energy are: $45.93 and $50.70 respectively. Given that the market prices for H L H / L L H August market prices are $91 and $53, it can be observed that the market prices are higher than the marginal value o f energy even during L L H s . This explains why the suggested export policies by the R L R O M model are almost at the limit o f the inter-tie transfer capability o f both o f the U S and Alberta markets which is usually set at 2500 G W h in August. When the storage levels in G M S and M C A are low, the model recommends a m i x o f import and export depending on the hour o f the day. Nevertheless, the system is still a net exporter even with the l o w storage levels in G M S and M C A , as the electricity demand at this time o f the year is relatively low.  150  Table 5.25.  Month  Market price, system load, and natural inflow scenario data  Market Price ($) W K P K WKHI W K L O WEHI  WKPK  System Load WKHI W K L O  Local Inflow WEHI  GMS  PCN  MCA  REV  KNA  October  65  65  60  60  7,682  7,116  5,474  6,215  810  23  279  127  179  November  71  71  63  63  8,693  7,901  6,152  6,819  372  16  244  125  223  December  76  76  57  57  9,121  8,253  6,426  7,377  236  17  118  59  125  January  80  80  61  61  8,940  8,217  6,500  7,743  367  30  86  38  103  February  68  68  60  60  8,906  8,238  6,371  7,014  197  18  90  39  103  March  66  66  58  58  8,209  7,692  6,048  6,634  189  19  79  36  99  April  65  65  53  53  7,382  7,005  5,549  6,031  371  20  320  183  429  May  61  61  42  42  6,949  6,659  5,296  5,616  3,421  51  582  323  548  June  60  60  42  42  6,710  6,428  5,070  5,519  5,001  61  2,064  897  1,264  July  66  66  50  50  6,747  6,418  5,030  6,038  1,236  18  1,259  436  544  August  91  91  53  53  6,977  6,675  5,175  5,556  859  19  994  305  331  September  69  69  58  58  6,986  6,683  5,077  5,611  497  11  538  186  231  Figure 5.41.  August net market transactions  152  For the cases when the storage levels in G M S are either low or full, Figures 5.42 and 5.43 portray the monthly net market transactions, for different M C A storage volumes. The results show that the net market transactions are largely dependent on the storage volume available in each reservoir. Figure 5.42 demonstrates that for a low G M S storage level, the system w i l l switch to a net importer o f energy mode for the whole range o f M C A storage in M a r c h and up to storage level o f 265,000 cms-d in February. For other months, the net transactions are largely dependent on the available storage in M C A . When G M S reservoir is full, Figure 5.43 shows that the system would be a net importer of energy in October, January, March, and A p r i l , when the storage level in M C A is low. Figure 5.43 also illustrates that when the M C A reservoir is full, the highest export occurs in the period from August to February, whereas the lowest export transactions occur in the period from March to June. This is attributed to the high H L H and L L H price margins in the summer and winter periods, as compared with those in the spring. The high level o f net export transactions occurs in the winter months for this  scenario run as the load is  lower than the available resources and there is a room to export.  2500  Mica Storage - cms-d • October m February June  Figure 5.42.  — M — November —•— March —»-July  —*— December —i—April —m—August  —K—January May September  Monthly net market transactions for G M S storage of 55,000 cms-d  153  165000  185000  205000  225000  245000  265000  285B00  -1000 -1500  Mica Storage - cms-d - October - February -June  - November - March -July  - December -April -August  January - May September  Figure 5.43. Monthly net market transactions for G M S storage of 475,000 cms-d  Figures 5.44 and 5.45 display the monthly variation in net market transactions with different storage levels in G M S and M C A . Figure 5.44 demonstrate that for low M C A and G M S storage levels, the system would be a net importer o f energy for most o f the year. When the G M S reservoir is full, the system would be a net exporter o f energy in all months except for January and March. Considering the case o f a full M C A reservoir, Figure 5.45 shows that the system would be a net exporter o f energy in all months with the exception o f M a r c h when storage in G M S is low. In addition, Figure 5.45 illustrates that when the M C A reservoir is full, the net transactions are sensitive to the amount o f G M S storage in all months except for August and January.  154  2500  -1500 -2000 Month GMS_V=55000, MCA_V=125000  Figure 5.44.  —•—GMS_V =475000, MCA_V=125000  Monthly net market transactions for M C A storage of 125,000 cms-d  2500 - i  — —  1  -1000 Month —•—GMS_V=55000, MCA_V=285000  - * — GMS_V=475000, MCA_V=285000  Figure 5.45. Monthly net market transactions for M C A storage of 285,000 cms-d  A s indicated i n the previous chapter, the R L R O M optimizes the operation o f the five main plants on the Peace and the Columbia Rivers. The results for these plants for the months o f January are presented next. Figures 5.46 and 5.47 display the five plants generation i n January for different M C A storage when G M S reservoir is at low storage  155  and high storage levels respectively. Figure 5.46 shows that when the storage levels at M C A and at the G M S reservoirs are low, G M S generation was maintained at about 1300 G W h in order to release water that is needed to meet the Peace River ice flow constraint in January (1500 cms-d), as illustrated in Figure 5.48.  Figure 5.47 illustrates that when the G M S storage is full, M C A generation increases gradually as the available storage in M C A increases. The Revelstoke generation increases, as M C A generation increases, to maintain the hydraulic balance with M C A reservoir as shown in Figures 5.48 and 5.49. Keenleyside is generating up to the maximum limit to minimize the spill flow needed to meet the Columbia Treaty release requirements. The spill flow from Keenleyside is the difference between the required treaty releases and the maximum turbine discharge.  1,500  X  1,250 CD  1,000 750  TO L _  0>  500  c  CD  O  250 f 0 125,000 145,000 165,000 185,000 205,000 225,000 245,000 265,000 285,000 Mica Storage - cms-d -•—— G-GMS  Figure 5.46.  Plants  ••G-PCN  generation  G-MCA  in  *  January  G-REV  at  -G-ARD  different  M C A storage  increments with G M S storage at 55,000 cms-d  156  2,000 1,750  125,000 145,000 165,000 185,000 205,000 225,000 245,000 265,000 285,000 Mica Storage - cms-d -G-GMS  Figure 5.47.  - -G-PCN  Plants  • -G-MCA  generation  -G-REV  in January  at  -G-ARD  different  M C A storage  increments with G M S storage at 475,000 cms-d  1,600 •  L400  § 1,200  S> 1,000  "I  8 0 0  «  600  I  400  H  200 0 125,000 145,000 165,000 185,000 205,000 225,000 245,000 265,000 285,000 Mica Storage - cms-d ~*  Q-GMS  Figure 5.48.  I---Q-PCN  Q-MCA  *  Q-REV  -Q-ARD  Plants average turbine releases in January at different M C A storage increments with G M S storage at 55,000 cms-d  157  2,000 1,750 E o  |  1,500  o> 1,250  3Z 1,000 cc CD  !o i —  =! I—  125,000 145,000 165,000 185,000 205,000 225,000 245,000 265,000 285,000 Mica Storage - cms-d -i  Figure 5.49.  Q-GMS  •Q-PCN  Q-MCA  -Q-REV  -Q-ARD  Plants average turbine releases in January at different M C A storage increments with G M S storage at 475,000 cms-d  5.4.4.4.  Using the R L R O M model in Control Mode  Once the R L agent learns the optimal control policy, it can then be used to control the operation planning o f the different reservoirs given the inflow, market prices and electricity load at any given time period. The R L R O M  model is also capable o f  predicting the operation policy for an extended period o f time (for example for several years) based on a given forecast. T o achieve this, the R L R O M model was modified to run in control mode, rather than i n learning mode, using the learned optimal control policies. A t any time period and for a given starting storage level, the target storage is linearly interpolated between the target points o f the state space. The model uses the learned target storage volumes, optimal forward sales, and the marginal values o f water to control the system operation. G i v e n this information and the forecasted inflow, market price, and the electricity load, the model solves the optimization problem at each time period and returns the optimal hydro generation schedules, turbine releases, and market transactions. To illustrate the results o f implementing the R L R O M model i n control mode, a data set o f inflows, price, and load forecast and initial storage (or forebay) levels conditions were considered in the analysis.  158  Figure 5.50 presents the monthly generation o f the five main plants in the study. The graph demonstrates that despite the lower electricity load in the months o f August and September the plants' generation is high to take advantage o f the export opportunities available in these months. O n the other hand, in June the plants' generation reduces to low levels, and the model recommends the system to store more water during the freshet and early summer months, as there is little market opportunity, and to generate heavily during the winter and late summer months to take advantage o f high market prices during these periods.  5,000  Oct  Nov  Dec  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Months • GMS  F i g u r e 5.50  BPCN  • MCA  rjREV  • Keenleyside  M o n t h l y plants generation  Figure 5.51 displays the forecasted heavy load hours ( H L H ) and light load hours ( L L H ) market price and the marginal value o f energy for G M S and M C A i n $ / M W h . For the periods from October to February and from August to September, the marginal value of energy for G M S is lower than the L L H market price. Consequently, it is more profitable to release water from G M S than to store it. This is reflected in the high generation policy adopted in these months. In the period from M a r c h to July, it is more profitable to generate less at L L H , as the marginal value o f energy is higher than the market price and to generate more during H L H where the market price is higher than the  159  marginal value o f energy. The lower load and market prices explain the low generation during those months as it is more profitable to import and store water in periods o f good market opportunities i n the summer and in the winter months. The results presented in Figure 5.51 also indicate that the marginal value o f energy for M C A is slightly higher than that for G M S for most o f the year. The marginal values o f energy for M C A in the period from March to July are even higher than the H L H market prices. This explains the M C A low generation pattern during this period as shown in Figure 5.50.  65 60  S  55  £  50  S  45  2  40  • •* ~ - •  35 30  Oct  Nov  Dec  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Month MVW GMS  Figure 5.51.  MVW MCA  •Price HLH  •Price LLH  Market price and marginal value of energy  Figure 5.52 illustrates the load resource balance, where it can be seen that the system is in a net import mode for the period extending from February to July, while it is in a net export mode for the rest o f the year.  160  8,000 7,000  Months H G M S 1  IRFU  • M  T o t a l T r a n s a c t i o ns  F i g u r e 5.52.  ••••PCN  1  • m  1  Keenleyside Load  IMUA 1 S m a l l HyHrr,-IPP-Thorrr,f>l  L o a d resource balance  Figure 5.53 shows that about 5 1 % o f the total energy is generated by the Peace River system ( G M S and P C N ) and 49 % was generated by the Columbia River system ( R E V , M C A , and K N A ) . About 80% o f the Peace generation was produced by G M S and the remaining 20% from P C N . The generation percentages from M C A , R E V , and K N A plants on the Columbia system were 42%, 52%, and 6% respectively. A detailed analysis of the results shows that combining the Peace and the Columbia five main plants, together generate about 65% o f the energy needed to meet the system load. These percentages closely match the percentages o f actual generation o f the B C Hydro system (Making the Connection, 2000), which gives confidence in the model results.  161  Keenleyside, 2.94%  PCN, 10.74%  Figure 5.53.  Annual distribution of the five main plants generation  To study the effect o f changes in the market price forecast on the optimized trading policies, three price scenarios (average, low, and high) were considered with an average annual price of $46, $41, and $51 respectively. Using the moment matching algorithm, three samples were generated for the stochastic variables and were used in this analysis. The results are displayed in Figure 5.54, and it suggests that, for the low price scenario, the system would be importing and exporting in different months o f the year, but it would be in a net importing mode (4695 G W h ) . Futher analysis o f the results shows that the system was a net exporter in January and in August, and a net importer for the rest o f the months. This is attributed to the fact that the L L H market prices are lower than the marginal value o f energy in all months and the H L H was also lower than the marginal value during the spring and fall periods where the system load is low and the market price is low. O n the other hand, for the case o f the high price scenario, the system would be in net exporter mode. The system is a net exporter in all months with the exception o f March to June. A s discussed earlier, when the marginal value o f energy is higher than the market price, the model tend to release more water for export purposes.  162  1500  -2500  J  Month • Scenario 1-Average Price  Figure 5.54.  • Scenario 2-Lovv Price  • Scenario 3-High Price  Net market transactions for three price scenarios  5.5. Summary The R L approach methodology, as discussed in chapter 4 was first tested for a small scale single reservoir operation problem, and then on a two reservoir operation problem. The test and convergence results were encouraging and demonstrated the potential capability o f the R L approach to handle a larger scale multireservoir operation planning problem. The R L model was then expanded and linked to interact with the generalized optimization model ( G O M ) as the environment that represents (simulates) the real system. A piecewise linear function approximation technique was adapted to approximate the present value functions. Using this function approximation technique allowed the handling o f a larger state space and gave the flexibility to solve the optimization problem at hand. The target storage levels were approximated as a continuous function rather than grid points in the state space, as is usually done in the conventional S D P formulation. Three random variables; natural river inflows, electricity load, and market prices; were considered. A scenario generation-moment matching technique was adopted to generate the inflow, load, and price scenarios. In this way, the serial and cross correlations o f the  163  inflows, load, and price random variables were considered in the scenario generation process.  The R L R O M model was used to optimize the operation o f the B C Hydro main reservoir systems on the Peace and the Columbia Rivers. T w o state variables for the G M S and M C A storage were considered. The objective o f the model was to estimate the value o f water in storage, marginal value o f water/energy, and the optimal operation planning decisions, or control policies, including market transactions. In addition, the model optimizes the operation o f the five main plants: G M S , P C N , M C A , R E V , and K N A . The model was run on a monthly time step with a number o f sub-time steps to capture the heavy load hour ( H L H ) and light load hour ( L L H ) variation i n load and in market prices. The R L R O M model results were presented and discussed, including the storage value function and the marginal value o f energy/water. The optimized control operation policy established was also presented and discussed. A modified version o f the model was developed which allows for use o f the results from the learning phase i n control mode. Examples o f the results o f using the R L R O M model in the control mode were also presented.  The case study results indicate the dependence o f the marginal value o f energy in each reservoir on the available storage in other reservoirs. However, there are periods o f time and ranges o f storage levels where this dependence is not significant. A l s o , it was shown that the marginal value o f energy is largely affected by the constraints imposed on the system operation. The impact o f the ice flow constraint in the Peace River on the G M S marginal value o f energy was clearly demonstrated. Similarly, the influence o f the Columbia Treaty operation on the marginal value o f energy for M C A was presented and discussed. The results o f using the R L R O M model in system control mode indicate that the G M S and the M C A reservoirs are typically drawn down by A p r i l to May, just prior to the spring freshet. The drawdown process generally begins in September and October when inflows are low. The reservoir levels begin to increase in M a y when the turbine discharges are reduced due to lower system demands and increased local inflow. It reaches its highest annual levels in the August to September period.  164  6.  CONCLUSIONS AND RECOMMENDATIONS  6.1. Summary Strategic  planning o f multireservoir operation  involves taking decisions  with  uncertain knowledge o f future supply (natural inflows), demand (domestic electricity load), and market conditions (market prices). In this research, different  modeling  approaches and optimization techniques for the operation o f multireservoir systems operation have been reviewed. A n explicit stochastic modeling approach was found to be the most appropriate way to address the different sources o f uncertainty. However, a major obstacle to this approach, that needs to be addressed and resolved, is the high dimensionality o f the problem.  The research  carried out,  for this thesis,  concluded that stochastic  dynamic  programming (SDP) is still a very powerful technique for capturing the essence o f the sequential, nonlinear, and stochastic reservoirs optimization problems. However, S D P requires that the values o f the transition probabilities and the transition rewards (model o f the system) are to be calculated. For large-scale systems that involve several stochastic variables, constructing the system model can be a very difficult task. It is for this reason that D P is said to be plagued by the curse o f dimensionality. One possibility to tackle these problems is through the use o f machine learning techniques from the field o f artificial intelligence ( A l ) , particularly the Reinforcement Learning ( R L ) technique.  This thesis has explored the use o f R L artificial intelligence approach to solve the large-scale operations planning problem o f a hydroelectric power multireservoir system. R L is a computational approach for learning from interaction with an environment and from the consequences o f actions to derive optimal control strategies. R L has adapted key ideas from various disciplines namely: machine learning, operations research, control theory, psychology, and neuroscience (Sutton and Barto, 1998). The application o f the R L technique in the water resources systems domain is relatively new. However, the advantages that R L offers in dealing with large-scale problems, make it a promising area  165  of research in that field. Modern reinforcement learning techniques can be applied to both trial and error learning, without a formal model o f the environment, and to planning activities with a formal model o f the environment, where an estimate o f the statetransition probabilities and immediate expected rewards could easily be evaluated. A detailed review o f the R L approach and its main algorithms was conducted and presented in this thesis.  A R L based approach is adopted in this research work to develop a stochastic large-scale medium/ long term multireservoir operation planning model ( R L R O M ) . The objective o f the model is to develop operating policies that maximizes the expected revenues from energy transactions in a competitive market, while considering the uncertainties, such as future inflows to the reservoirs and the integrated system operation and physical constraints. In this research, the dispatch o f the hydropower generation from the different available resources was based on the incremental production costs, which were based on the concept o f the marginal value o f water in storage. Accordingly, the value o f water ($/cms-day) /energy value ($/MWh) was derived as a function o f the storage volume o f the two major multiyear storage reservoirs on the Peace and Columbia Rivers ( G M S and M C A ) . The uncertainties in the main random variables were expressed in the form o f scenarios. A scenario generation-moment matching approach was adapted to generate a set o f scenarios for the inflow, market prices, and electricity loads, that preserved the statistical properties, in particular the moments and the correlations o f these multiple stochastic variables.  A hydroelectric reservoir optimization model ( G O M ) based on a linear programming algorithm, which had been previously developed and was in use by B C H y d r o , was integrated with the R L R O M model. Instead o f interacting with the real system, the R L agent used the G O M as a model o f the environment to provide feedback on its actions and it used this information dynamically to learn the optimum control policy.  166  6.2. Conclusions The developed R L R O M  model was successfully used to handle a large-scale  multireservoir operation planning problem. The developed model was applied to B C Hydro's main reservoirs on the Peace and the Columbia Rivers. The R L R O M model was designed to run for a multiyear planning horizon (36 months for the case study considered). This allowed the development o f realistic planning policies, given the large amount o f multiyear storage in the Peace and Columbia River reservoir systems. The results demonstrated that the model was capable o f providing realistic answers to the strategic planning questions, including: What is the value o f water in storage in each o f the multiyear storage reservoirs? What is the marginal value o f water /energy in each o f the multiyear storage reservoirs? What are the optimal system control decisions including when to buy or sell energy and how much to buy or sell; and how much energy to generate from each o f the five plants in the system. The R L R O M model, which is presented in this thesis, considers the stochastic variables: domestic load, market prices, and natural inflows. The moment matching technique, which was used for generating scenarios o f the inflow, price, and electricity load variables has the great advantage o f using only a limited number o f scenarios to represent the properties, correlations and cross correlations o f an extensive time series based on historical records.  The marginal value o f water/energy for M C A and G M S obtained from the R L R O M model represent key information that controls the strategic store/release decisions. In addition, this information feeds to other B C Hydro medium and short term optimization models (for example: G O M and S T O M ) . The shorter term models use this information together with the target storage in the trade-off decisions between the shorter term and longer term benefits.  A t the beginning o f the runs, the model operates in a training mode and the agent is progressively learning the optimal value function and the optimal control policies for the system. Once the model results convergence, the agent knows the value function and the optimal control policies. A t this point, the R L R O M model can be used on a continuous  167  time bases to maintain optimum control over any time period. Another useful and practical application o f the model is to examine the system behavior under certain forecasts o f the random variables and to estimate the corresponding marginal value o f energy. This sensitivity analysis is useful for understanding the effects o f the changes i n any o f the random variables on the operation policy and on the marginal value o f water/energy. For example, being able to estimate the effects o f multiyear droughts on system operation is o f particular significance to strategic planning for operation o f the system.  Reinforcement Learning offered two key advantages i n handling the large-scale multireservoir control problem:  (1) R L provides a way to avoid the curse o f modeling. B y using sampling (simulation), R L can be applied to large-scale problems without an explicit evaluation and enumeration o f all the transition probabilities and their expected immediate rewards.  (2) R L overcomes the curse o f dimensionality through the use o f function approximation methods. Those methods require a limited number o f elements to specify a value function, instead o f possibly millions o f states.  A hydroelectric reservoir optimization model ( G O M ) based on linear programming was integrated with the R L R O M model. In this way, the optimization process was extended to include the other reservoirs on the Peace and the Columbia namely, the Dinosaur, Revelstoke, Keenleyside reservoirs, i n addition to the main ones at G M S and M i c a . This link, has allowed an efficient on-line interaction between the agent and the environment (the operator and the model) during the iterations. A l s o , it made it possible to capture the effects o f the diurnal variation o f the price and load on shorter time periods. In addition, the optimal plant generation schedules which are based on the optimal unit commitment and loading (represented i n G O M b y piecewise linear plant generation functions) is captured in the R L R O M model.  168  The results o f the case study implemented using B C Hydro system demonstrated the following:  - The major influence o f the available storage i n the G M S and M C A storage reservoirs are on the storage value, marginal value, and optimum control policy. However, the study also demonstrated that there are periods o f time and ranges o f storage levels in G M S and M C A where this influence is not significant.  - The Peace Canyon River minimum ice flow constraint has a major effect on the release policies and on the marginal values i n G M S .  Similarly, for the Columbia River  system the required Columbia Treaty operation influences the marginal value o f energy in M C A due to the high releases required from the Keenleyside dam during certain periods o f the year.  - The model backs-off plants generation levels to store water, during light load hours ( L L H ) , to take advantage o f the difference in heavy load hours ( H L H ) and light load hours ( L L H ) price margins and it exports more when there is a market opportunity with high market prices. During the summer, when the domestic loads are low, however, the model is heavily exporting to take advantage o f the good market opportunity, particularly in California.  - The moment matching scenario generation results are very encouraging. A sensitivity analysis on the effect o f the number o f generated scenarios on the mean relative error ( M R E ) i n the moments and correlation determined the appropriate number of scenarios that should be used in the R L R O M model runs. The number o f generated scenarios, which were derived, demonstrated that this is a very practical approach to obtaining good results, without excessive amounts o f computing.  In summary, it can be concluded that the R L approach is a very effective approach to the strategic planning for the operation o f large-scale hydroelectric power generation systems. The developed methodology has the potential to significantly improve the operational planning for the integrated B C Hydro system.  169  6.3. Future Work There are a number o f ways in which the techniques, developed in this thesis, could be enhanced and areas into which it could be expanded. The following is a proposed list o f future related research work that could be carried out to extend and enhance this work:  Investigate the use o f multi-agent (independent/cooperative  agents) and to use  parallel processing for running the model to speed up the learning process and to decrease C P U time.  Investigate  other R L approaches including S A R S A , Dyna-Q and other R L  techniques that could potentially integrate the planning and learning process.  Future research is needed to generalize the use o f the function approximation approach, outlined in this research. Neural networks are a possible alternative to the currently adopted piecewise linear function approximation technique.  Automate the process o f selecting the values o f the R L parameters (learning rate and exploration/exploitation rate).  Further develop the R L stochastic optimization model to be used for on-line real time control model.  Extend the model to include other state variables, such as additional plants and check their effects on the marginal values o f energy.  M o d e l the Columbia River Treaty operation constraints i n more detail.  Include regional transmission limits, when modeling regional loads and resources in the B C Hydro system.  170  REFERENCES Abdulhai B . , R. Pringle, and G . J. Karakoulas, "Reinforcement learning for True Adaptive Traffic Signal Control," Journal of Transportation  Engineering,  ASCE,  129 (3), 278-285, 2003. Arnold, E . , P. Tatjewski, and P. Wolochowicz, " T w o Methods for Large-Scale Nonlinear Optimization  and  Their  Comparison  Optimization," Journal of Optimization  on  a  Case  Study  of  Theory and Applications,  Hydropower  81(2), 221-248,  1994. Barros, M . , Tsai F., Y a n g , S.-L., Lopes, J., and Y e h , W . "Optimization o f Large-Scale Hydropower System Operations," Journal  of Water Resources  Planning  and  Management, A S C E , 129(3), 178-188, 2003. B C Hydro, "Making  the Connection:  the BC Hydro  electric  system and how it is  operated," 2000. Bellman, R . E . , Dynamic  Programming,  Princeton University Press, Princeton, N e w  Jersey, 1957. Benders, J. F., "Partitioning Procedures for Solving M i x e d Variables Programming Problems," Numerishe Mathematik, 4, 238-252, 1962. Bertsekas,  D . and J. Tsitsiklis, Neuro-Dynamic  Programming,  Athena  Scientific,  Belmont, Mass., 1996. Bertsekas D . P., "Neuro-Dynamic Programming." Encyclopedia  of Optimization,  Kluwer,  2001. Bhattachaya B . , A . H . Lobbrecht, and D . P. Solomatine, "Neural Networks and Reinforcement Learning in Control o f Water Systems," Journal of Water Resources Planning and Management, A S C E , 129(6), 2003. Birge, J. R., "Decomposition and Partitioning Methods for Multistage Stochastic Programs," Oper. Res., 33, 989-1007, 1985.  171  Birge J. R. and M u l v e y J. M . , "Stochastic Programming," In Mathematical  Programming  for Industrial Engineers, eds. M . A v r i e l and B . Golany (Dekker, N e w Y o r k , 1996), 543-574, 1996. Can, E . K . and M . H . Houck, "Real Time Reservoir Operations by Goal Programming," Journal of Water Resources Planning and Management,  A S C E , 110(3), 297-309,  1984. Cario M . , B . Nelson, "Modeling and Generating Random Vectors with Arbitrary Marginal Distributions and Correlation Matrix," Technical Report, Department o f Industrial  Engineering  and  Management  Sciences,  Northwestern University,  Evanston, Illinois, 1997. Changchit, C , and M . P. Terrell, " C C G P M o d e l for Multiobjective Reservoir System," Journal of Water Resources Planning and Management, A S C E , 115(5), 1989. Crawely, P. and G . Dandy, "Optimal Operation o f Multiple Reservoir Systems," Journal of Water Resources Planning and Management, A S C E , 119(1), 1-17, 1993. Coloroni, A . , and G . Fronza, "Reservoir Management V i a Reliability Programming," Water Resources Research, 12(1), 1976. Crites, R. H . and Barto, A . G . , "Improving elevator performance using reinforcement learning," In D . S. Touretzky, M . C . Mozer, M . E . H . , editor, Advances Information Processing Systems: Proceedings  in Neural  of the 1995 Conference, pages 1017-  1023, Cambridge, M A . M I T Press, (1996). Dantzig, G . B . and G . Infanger, "Intelligent Control and Optimization under Uncertainty with Application to Hydro Power," European Journal of Operational Research, 97, 396-407, 1997. Davis, R. E . and R . Pronovost, " T w o Stochastic Dynamic Programming Procedures for L o n g Term Reservoir Management," paper presented at I E E E Power Engineering Society summer meeting, I E E E , San Francesco, California, July 9-14, 1972. Devroye, L u c , Non-Uniform  Random  Variate Generation,  Springer-Verlag, N e w Y o r k ,  1986.  172  Dorigo, M . , E . Bonabeau., and G . Theraulaz, " A n t Algorithms and Stigmergy," Future Generation Comput. Systems, 16, 851-871, 2000. Dupacova J., Giorgio Consigili, and Stein W . Wallace, "Scenarios for Multistage Stochastic Programs," Annals of. Operations Research, 100:25-53, 2000. Druce, D . J., "Decision Support for Short Term Export Sales From a Hydroelectric System," In Computerized  Decision Support Systems for Water Managers, Labadie  J., L E . Brazil, I. Corbue, and L E . Johanson, (eds)., A S C E , 490-497, (1989). Druce, D . J., "Incorporating D a i l y Flood Control Objectives Into a Monthly Stochastic .Dynamic Programming M o d e l for a Hydroelectric Complex," Water Resources Research, 26(1), 5-11, 1990. Ernst Damien, Mevludin Glavic and Louis Wehenkel,. "Power Systems Stability Control .Reinforcement  Learning  Framework,"  Transactions  Accepted  on  for  publication  in  Power  IEEE  Systems,  http://www.montefiore.ulg.ac.be/~ernst/IEEEtrans-2003.pdf, 2003. Evan-Dar, E . and Y . Mansour, "Learning Rates for Q-Learning," Journal  of  Machine  Learning Research," 5, 1-25, 2003. Gasser,  Michael,  "Reinforcement  learning:  exploitation  and  exploration",  http://www.indiana.edu/~q320/Notes/rl4.html. 2004. Giles, J. and W . Wunderlich, "Weekly Multipurpose Planning M o d e l for T V A Reservoir Systems," Journal of Water Resources Planning and Management, A S C E , 107(2), 1-20, 1981. Goldberg, D . , "Genetic Algorithms  in Search Optimization  and Machine  Learning,"  Addison-Welsy Publishing Company, Inc., Reading, Massachusetts, 1989. Gosavi A . , "Simulation-Based  Optimization,"  Kluwer Academic Publishers, 2003.  Grygier, J. and J. Stedinger, " Algorithms for optimizing Hydropower System Operation," Water Resources Research, 21(1), 1-10, 1985.  173  Foufoula-Georgiou E . and P. K . Kitanidis, "Gradient Dynamic Programming for Stochastic Optimal Control o f Multidimensional Water Resources Systems," Water Resources Research, 24(8), 1345-1359, 1988. Fourer, R., D . M . Gay, B . W . Kenigham, "AMPL: Programming,"  A Modeling Language of  Mathematical  The Scientific Series Press, 2003.  Halliburton, T. "Development o f a M e d i u m Term Planning M o d e l for B P A , N S R Information, 1997. Hiew, K . , J. Labadie, and J. Scott, "Optimal Operational Analyses o f the Colorado-Big Thompson Project,"  in Computerized  Decision  Support  Systems foe  Water  Managers, J. Labadie et al., (eds.), A S C E , N e w Y o r k , 632-646, 1989. Higle J. L . and Sen S., "Stochastic Decomposition: an algorithm for two-stage linear programs with recourse," Mathematics of Operations Research, 16:650-669, 1991. Hogan, A . J., J. G . , Morris, and H . E . Thompson, Decision Problems Under Risk and Chance Constrained Programming: Dilemmas in the transition,  Management  Science, 27(6), 698-716, 1981. Howson, H . R. and N . G . F. Sancho, A N e w Algorithm for the Solution o f Multistate Dynamic Programming Problems, Math. Programming,  8, 104-116, 1975.  Hoyland K . , M . Kaut and S. W . Wallace, " A Heuristic for Moment-Matching Scenario Generation," Computational  Optimization  and Applications,  24(2-3),  169-185,  2003. Hoyland K . , and S. W . Wallace, "Generating Scenario Trees for Multistage Decision Problems," Management Science, 47(2), 295-307, 2001. Huang, W - C , R. Harboe, and J. J. Bogardi, "Testing Stochastic Dynamic Programming Models Conditioned on Observed or Forecasted Inflows," Journal  of Water  Resources Planning and Management, A S C E , 117(1), 28-36, 1991. Infanger  G . , Planning Under Uncertainty: Solving Large-Scale Stochastic Linear  Programs, B o y d and Fraser, Danvers, 1994.  174  Jacobson, H . and Q. Mayne, Differential  Dynamic Programming,  Elsevier, N e w Y o r k ,  1970. Jacobs, J. G . , J. Grygier, D . Morton, G . Schulz, K . Staschus, and J. Stedinger, "SOCRATES:  A  System  for  Scheduling Hydroelectric Generation  Uncertainty," in Models for planning  under uncertainty,  Under  Vladimirou et al. (eds.),  Baltzer Publishers, Bussum, The Netherlands, 1995. Jalali, M . R., A . Ashfar, and M . A . Marino, "Improved A n t Colony Optimization," Scientia Iranica, 13(3), 295-302, 2006. Johnson, S.A., J. R. Stedinger, C . A . Shoemaker, Y . L i , and J. A . Tejada-Guibert, "Numerical Spline Stochastic Dynamic Programming," in Proceedings Annual Conference on Water Resources Planning and Management,  of the 16th A S C E , 137-  140, 1989. Johnson, S.A., J. R. Stedinger, C . A . Shoemaker, Y . L i , and J. A . Tejada-Guibert, " Numerical Solution o f Continuous-State Dynamic Programs U s i n g Linear and Spline Interpolation," Oper. Res., 41(3), 484-500, 1993. Kaelbling P. Leslie,  M . L . Littman, and A . W . Moore, "Reinforcement Learning: A  Survey," Journal of Artificial  Intelligence Research, 4, 237-285, 1996.  Kaut M . and S. W . Wallace, "Evaluation o f Scenario-Generation Methods for Stochastic Programming," http://work.michalkaut.net/CV and studv/SG evaluation.pdf 2003 Kelman, J., J. R . Stedinger, L . A . Cooper, E . H s u , and S. Y u a n , "Sampling Stochastic Dynamic Programming applied to reservoir operation," Water Resources  Research,  26(3), 1990. K o , S - K , D . Fontane, and J. Labadie, "Multiobjective Optimization o f Reservoir Systems Operations," Water Resources Bulletin, (28(1), 111-127, 1992. Kumar, D . N . and M . J. Reddy, " A n t Colony Optimization for Multi-Purpose Reservoir Operation," Water resources Management,  20(6), 879-898, 2006.  175  Kracman, D . R., D . C . M c K i n n e y , D . W . Watkins, and L . S . Lasdon, "Stochastic Optimization o f the Highland Lakes System in Texas," Journal of Water Resources Planning and Management, A S C E , 132(2), 62-70, 2006. Labadie, J. W . , "Dynamic Programming with the Micro-Computer," in Encyclopedia Microcomputers,  of  A . Kent and J. Williams, eds., V o l . 5, 275-337, Marcel Dekker,  Inc., N e w Y o r k , 1990. Labadie, J . W . , "Reservoir System Optimization M o d e l s , " Water Resources  Update  Journal, 107, Universities Council on Water Resources, 1998. Labadie, J. W . , "Optimal Operation o f Multireservoir Systems: State-of-the-Art Review," Journal of Water Resources Planning and Management, A S C E , 130(2), 93-111, 2004. Lamond, B . F. and A . Boukhtouta, "Optimizing Long-Term Hydropower Production Using Markov Decision Processes," International Transactions in Operational Research, 3, 223-241, 1996. Lamond, B . F. and Boukhtouta, A . , "Water reservoir applications o f Markov decision processes," Handbook of Markov Decision Processes: Methods and Applications, Eds. Feinberg and A . Schwartz, Kluwer, 537-558, 2001. Lamond, B . F., "Stochastic optimization o f a hydroelectric reservoir using piecewise polynomial approximations," INFOR, 41(1), 51-69, 2003. Lamond, B . F. and A . Boukhtouta, " A Neural Approximation for the Optimal Control o f a Hydroplant with Random Inflows and Concave Revenues," Journal of Energy Engineering, A S C E , 131(1), 72-95, 2005. Larson, R., State Increment Dynamic Programming,  Elsevier, N e w Y o r k , 1968.  Loganathan, G . and D . Bhattacharya, "Goal-Programming Techniques for Optimal Reservoir Operations, Journal  of Water Resources Planning  and  Management,  A S C E , 116(6), 820-838, 1990. Loucks, D . and P. Dorfman, " A n Evaluation o f Some Linear Decision Rules in ChanceConstrained Models for Reservoir Planning and Operation," Water Resources Research, 11(6), 1975. Loucks, D . P . , Stedinger, J. R., and Haith, D . A . , Water Resources System Planning  and  Analysis. Prentice-Hall Inc., Englewood Cliffs, N e w Jersy 1981.  176  Loretan M . , "Generating Market Risk Scenarios Using Principal Component Analysis: Methological and Practical Consideration," In The Measurement  of  Market Risk, C G F S Publications N o . 7, 23-60. Bank for International  Aggregate Settlements,  http//www.bis.org/publ.ecsc07.htm. 1997 Lund, J . and I. Ferreira, "Operating Rule Optimization for Missouri River Reservoir System," Journal of Water Resources Planning and Management, A S C E , 122(4), 287-295, 1996. L u r i , P. M . and M . S. Goldberg, " A n Approximate Method for Sampling Correlated Variables from Partially Specified Distributions," Management Science, 44(2), 203218, 1998. Martin, Q . W . , "Optimal D a i l y Operation o f Surface Water Systems," Journal of Water Resources Planning and Management, A S C E , 113(4), 453-470, 1986. Murray, D . and S. Yakowitz, "Constrained Differential Dynamic Programming and its Application to Multireservoir Control," Water Resources Research,  15(5), 1017-  1027, 1979. Mousavi, S., M . Karamouz, and M . B . Menhadj, "Fuzzy-State Stochastic Dynamic Programming for Reservoir Operation," Journal of Water Resources Planning  and  Management, A S C E , 130(6), 460-470, 2004. Nash, G . , "Optimization o f the Operation o f a Two-Reservoir Hydropower System, Ph.D. Thesis, Department o f C i v i l Engineering, University o f British Columbia," 2003. Parish,  Rudolph  S.,  "Generating  Distributions," Computational Pearl, J., "Heuristics:  Random  Deviates  from  Multivariate  Pearson  Statistics and Data Analysis, 283-295, 1990.  Intelligent Search Strategies for  Computer Problem  Solving,"  Addison-Wesley, 1984. Pearson, E . S . and J.W. Tukey, "Approximate means and Standard Deviations Based on Distances between Percentage Points o f Frequency Curves," Biometrika,  52 (3/4),  533-546, 1965.  177  Pennanen T. and K o i v u M . , "Integration quadratures i n Discretization o f Stochastic Programming," E-Print Series, http://www.speps.info, 2002. Pereira, M . V . F., "Stochastic Optimization o f a Multireservoir Hydroelectric System: A Decomposition Approach," Water Resources Research, 21(6), 779-792, 1985. Pereira, M . V . F., "Optimal Stochastic Scheduling o f Large Hydroelectric Systems," Electrical Power & Energy Systems, 11(3), 161-169, 1989. Pereira, M . V . F. and Pinto, L . M . V . G . , "Multi-stage Stochastic Optimization Applied to Energy Planning," Mathematical  Programming,  52(3), 359-375, 1991.  Pereira, M . V . F, Campodonico, N . , Kelman, R., "Application o f Stochastic Dual D P and Extensions to Hydrothermal Scheduling," PRSI Technical Report, http://www.psrinc. com.br/reports. asp. 1999. Pflug G . C . , Scenario Tree Generation For Multiperiod Financial Optimal discretization," Mathematical Programming, 89(2), 251-271, 2001. Piekutowski, M . R . , Litwonowicz, T. and Frowd, R. J., "Optimal Short-Term Scheduling for a Large-Scale Cascaded Hydro System," IEEE Transactions  on Power Systems,  9(2), 292-298, 1993. Picacardi,  C,  and  R.  Soncini, "Stochastic  Analysis Program  ( S A P ) for  SDP  Optimization," Dep. O f Environ. Eng., Cornell Univ., Ithaca, N . Y . , 1991. Pronovost, R. and J. Boulva, "Long-Range Operation Planning o f a Hydro-Thermal System Modeling and Optimization," Meeting the Canadian Electrical Association, Toronto, Ont. M a r c h 13-17, 1978. Raman, H . and V . Chandramouli, "Deriving a General Operating Policy for Reservoirs Using Neural Networks," Journal of Water Resources Planning  and  Management,  122(5), 342-347, 1996. Reis, L . F. R., G . A . Walters, D . Savic, and F. H . Chaudary, "Multi-Reservoir Operation Planning U s i n g Hybrid Genetic Algorithm and Linear Programming ( G A - L P ) : A n Alternative Stochastic Approach," Water Resources Management, 19(6), 831-848, 2005.  178  Reznicek, K . , and Cheng, T . C . E . ,  "Stochastic Modeling o f Reservoir Operations,"  European Journal of Operational Research, 50, 235-248, 1991. Robbins, H . and S. Monro, " A Stochastic Approximation Method," Ann Math. Statict., (22), 400-407, 1951. Romisch  W.  and  H.  Heitsch,  "Scenario  Reduction  Algorithms  in  Stochastic  Programming," Computational Optimization and Applications, 24(2-3), 187-206, 2003. Rotting, T. A . and Gjelsvik, A . , "Stochastic Dual Dynamic Programming for Seasonal Scheduling in the Norwegian Power System," I E E E Transactions  on Power  Systems, 7(1), 273-279, 1992. Russell,  S.  O., and  P.  E . Campbell, "Reservoir  programming," Journal  Operating  of Water Resources Planning  Rules  with  and Management,  fuzzy ASCE,  122(3), 165-170, 1996. Saad, M . , A . Turgeon, P. Bigras, and R. Duquette, "Learning Disaggregation Technique for the Operation o f Long-Term Hydrotechnical Power Systems," Water Resources Research, 30(11), 3195-3202, 1994. Singh, S. and D . Bertsekas, "Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems," Advances in Neural information Processing Systems, Conf. Proc, M I T Press, Cambridge, Mass., 1996 Shawwash Z . K . , T. Siu, and S. Russel, "The B C Hydro Short Term Hydro Scheduling Optimization M o d e l " IEEE Transactions on Power Systems, 2000. Shawwash Z . K . "A Decision Support System for Real-Time Hydropower  Scheduling in a  Competitive Power Market Environment,'''' P h D Thesis, Dept. o f C i v i l Engineering, U B C , 2000. Sutton, R . S . , "Reinforcement Learning," In Rob W i l s o n and Frank K e i l (Eds.), The MIT Encyclopedia  of  the  Cognitive  Sciences,  MIT  Press,  1999,  ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/Sutton-99-MITECS.pdf  179  Sutton, R. S. and A . G . Barto, "Reinforcement Learning - An Introduction, " M I T Press, Cambridge, Massachusetts, 1998. Sutton, R . S . , "Learning to Predict by the Methods o f Temporal Difference," Machine Learning, Kluwar Academic Publishers, 3: 9-44, 1988. Takeuchi, K . , " Chance-Constrained M o d e l for Real-Time Reservoir Operation U s i n g Drought Duration Curve," Water Resources Research, 22(4), 1986. Tejada-Guibert, J., J. Stedinger, and K . Staschus, "Optimization o f the Value o f C V P ' s Hydropower Scheduling," Journal of Water Resources Planning and  Management,  A S C E , 116(1), 53-71, 1990. Tejada-Guibert, J. A . , S. Johnson, and J. R. Stedinger, "Comparison o f T w o Approaches for Implementing Multireservoir Operating Policies derived Using  Stochastic  Dynamic Programming," Water Resources Research 29(12), 3969-3980, 1993. TejadaGuibert, J. A . , S. Johnson, and J. R . Stedinger, "The Value o f Hydrologic Information in Stochastic Dynamic Programming Models o f a Multireservoir System," Water Resources Research 31(10), 2571-2579, 1995. Tesauro, G . J., "Temporal difference learning and T D - G a m m o n , " Communications ACM(38),  of the  58-68, 1995.  Tilmant, A . , E . Persoons, M . Vanclooster, " Deriving efficient reservoir operating rules using flexible stochastic dynamic programming," Proc. I ' Int. Conf. on Water s  Resources Management, W I T Press, U . K . 353-364, 2001. Tsitsiklis, J . N , "Asynchronous Stochastic Approximation and Q-Learning," Machine Learning (16), 185-202, 1994. Turgeon A . , " Optimal operation o f Multireservoir Power Systems with Stochastic Inflows," Water Resources Research, 16(2), 275-283, 1980. Turgeon, A . , " Optimal Short-Term Hydro Scheduling F r o m the Principle o f Progressive Optimality," Water Resources Research, 17(3), 481-486, 1981. Turgeon, A . , " A n Application o f Parametric Mixed-Integer Linear Programming to Hydropower Development," Water Resources Research, 23(3), 1987.  180  Turgeon A . , "Solving a Stochastic  reservoir management problem with multilag  autocorrelated inflows," Water Resources Research, 41(12), W12414, 2005. Unver, O. and L . Mays, " M o d e l for Real-Time Optimal Flood Operation o f a Reservoir System," Water Resources Management, 4, 21-46, 1990. Valdes, J. B . , Montburn-Di Filippo, J., Strzepek, K . M . , and Restepo, P. J., "AggregationDisaggregation Approach to Multireservoir Operation," Journal of Water Resources Planning and Management, 118(4), 423-444, 1992. V a n Slyke, R. and R . J . - B . Wets, " L-Shaped Linear Programs with Applications to Optimal Control and Stochastic Programming, SIAM J. Appl. Math., 17, 638-663, 1969. Vogel, R. M . and J. R. Stedinger, "The value o f streamflow models i n overyear reservoir design applications," Water Resources Research, 24(9), 1483-1490, 1988.  Wardlaw R. and M . Sharif, " Evaluation o f Genetic Algorithms for Optimal Reservoir System Operation," Journal  of Water Resources  Planning  and  Management,  125(1), 25-33, 1999. Watkins, C . J. C . H . Learning from Delayed Rewards. Ph.D. Thesis. K i n g ' s College, University o f Cambridge, Cambridge, U K , 1989. Wen-Ceng Huang, Y u a n Lun-Chin, and Lee C h i - M i n g , " L i n k i n g Genetic Algorithms with Stochastic Dynamic Programming to Long-Term Operation o f a Multireservoir System," Water Resources Research, 38(12), 1304, 2002.  Wilson, G . , "Reinforcement Learning: A N e w Technique for the Real-Time Optimal Control o f Hydraulic Networks," Proc. Hydroinformatics  Conf, Balkema,  The  Netherlands, 1996.  Yakowitz  S., "Dynamic Programming Applications in Water  Resources,"  Water  Resources Research, 18(4), 673-696, 1982.  181  Y e h , W . W - G . , " Reservoir management and operation models: A  State-of-the-art  Review," Water Resources Research, 21(12), 1797-1818, 1985.  Young, G . K . , "Finding Reservoir Operation Rules," J. Hydraul. Div., A S C E , 93(HY6), 297-321, 1967.  Zhang, W . and Dietterich, T. G . , "High-performance job-shop scheduling with a t i m e delay T D (1) network," Advances Proceedings  in Neural  of the 1995 Conference,  Information  Processing  Systems:  pages 1024-1030, Cambridge, M A . M I T  Press, 1996.  182  APPENDIX A  ^ L E A R N I N G NUMERICAL EXAMPLE  Let us define a Markov chain with two states sj, s and at each state two actions aj, a 2  2  are allowed: States = {sj, s } 2  Actions: A(s,) - {a ,  a}  n  n  , A(s ) = {a 2  a)  2h  22  Rewards: R(s,a) = r (si, a , , ) , r (si, ai ), r (s , a i), r (s , a ) t  t  2  t  2  2  t  2  22  1  2  The transition probability matrix ( T P M ) associated with action 1 and 2 are P and P : t  0.7  0.3  0.9  0.1  0.4  0.6  0.2  0.8  t  1  2  The transition reward matrix ( T R M ) associated with action 1 and 2 are R and R : t  R  6  -5  7  12  Rf  10  17  -14  13  t  The objective is to maximize the expected rewards. A discount factor o f 0.8 is assumed in this example. The learning rate oris defined as: B/n(s,a), where B is a constant assumed 0.1 and n is the number o f state-action pair visits. The exploration rate £ is assumed C/t, where C is a constant assumed 0.9 and t is the number o f time periods. In the beginning, the learning agent tends to explore more and select non-greedy actions. A s such, it randomly select with a probability £ a different action from the action suggested by the policy learned so far Ms), where tt(s) = argmax Q*(s, a). a  A s the number o f episodes is getting larger, the learning agent w i l l gradually turn to be greedy and select the best action estimated at state s most o f the time with probability 1- £. Calculations o f the first ten steps are presented in Table A . l . The following is a detailed description o f the calculation procedure i n this (2-Learning algorithm example:  183  Period 1: initialize a l l the state-action value function (g-Table) to 0 randomly select an initial state s, the selected state is sj calculate the exploration rate = C/t = 0.9/1.0 = 0.9 randomly sample an action a with probability 0.9, the selected action is ai update the number o f visits to state si and action ai, n(si, aj) = 1 simulate action 1 and observe the transition to next state and the associated rewards. The next state is S2 and the reward r(s 1,(22) is -5. the learning rate a=B/n(s ,  cu) = 0.1/1.0 =0.10  }  update the Q-table using the equation: 2 0 , a) <- Q(s, a) + a[r + ymax Q(s', a') - Q(s, a)] a  Q(sj, ai)^0  + 0.1 [-5 + 0.8* max{0,0} - 0] = -0.50  Period 2: the current state is S2. Calculate the exploration rate e-C/t-  0.9/2.0 = 0.45  sample the actions to select the best (greedy) action with probability 0.45 and a random action with probability 0.55. The selected action is update the number o f visits to state S2 and action a2, n(s2, a ) = 1 2  simulate the action and observe the transition to next state and the associated rewards. The next state is S2 and the immediate reward r(s2,a2) is 13 the learning rate a = B/ n(s , a^) = 0.1 /1.0 =0.10 2  update the g-table: Q(s , a ) <- 0 + 0.1 [13 + 0.8* max{0,0} - 0] = 1.30 2  2  184  .Period 10: the current state is S2. Calculate the exploration rate e - C/t = 0.9/10.0 = 0.09 sample the actions to select the best (greedy) action with probability 0.91 and a random action with probability 0.09. The best action at state S2 is ai as (0.7>0.607),. select action ai update the number o f visits to state S2 and action aj, n(s2, ai) - 2 simulate the action and observe the transition to next state and the associated rewards. The next state is si and the immediate reward r(s2,ai) is 7 the learning rate a-B/n(s , 2  a) 2  = 0.1/2.0 =0.05  update the g-table: Q(s , a / ) ^ 0 . 7 + 0.05[7 + 0.8*max{-0.103,3.016} - 0 . 7 ] = 1.136 2  185  Table A . l . Time State Period Si  Sample calculations for the Q-Learning example Probability Matrix  al SI  I  S2  a2 SI | S2  Exploration Rate  Reward Martix  Rl SI | S2  R2 SI | S2  Simulate Action & Observe Feedback  Visits  DG reedy Action Selection  a1  a2  I  0  n u 0  1 0  0 1  1 1  0 1  1 1  1 1  1 1  1 2  2 1  1 2  2 1  2 2  2 1  3 2  2 1  4 2  2 2  4 2  Trans, state  reward  Learning Rate  Q-V alues  n  al  a2  0.100  -0.500 0  0 0  ' 0.709933 1  2  0.7 0.4  0.3-<CO.9 0.6 0.2  0.1 0.8  0  s1 s2  0.7 0.4  0.3 0.6  0.9 0.2  0.1  6  0.8*:  s1 s2  0.7 0.4  0.3 0.6  0.9 0.2  s1 s2  7  -5 12  10 -14  A  "7  1/ 13  f\ ar\r\r\  U.9UUU  ai 0.1668252  f  -5 12  10 -14  13  0.1 0.8  6 7  -5 12  10 -14  17 13  0.3000  6 7  -5 12  10 -14  17 13  0.2250  a2 0.2584773  3  a1 0.0484106 a2  4  s1 s2  0.7 0.4  0.3 0.6  0.9 0.2  0.1 0.8  5  s1 s2  0.7 0.4  0.3 0.6  0.9 0.2  0.1 0.8  6 7  -5 12  10 -14  17 13  0.1800  6  s1 s2  0.7 0.4  0.3 0.6  0.9 0.2  0.1 0.8  6 7  -5 12  10 -14  17 13  0.1500  0.9477175 a1  7  s1 s2  0.7 0.4  0.3 0.6  0.9 0.2  0.1 0.8  6 7  -5 12  10 -14  17 13  0.1286  0.1759355 a2  -5 12  10 -14  17 13  0.1125  0.8248643 a2 0.9071856 a2  8  s1 s2  0.7 0.4  0.3 0.6  0.9 0.2  0.1 0.8  6 7  9  s1 s2  0.7 0.4  0.3 0.6  0.9 0.2  0.1 0.8  6 7  -5 12  10 -14  17 13  0.1000  10  s1 s2  0.7 0.4  0.3 0.6  0.9 0.2  0.1 0.8  6 7  -5 12  10 -14  17 13  0.0900  13  0.100  -0.5 0  0 1.30  s1  7  0.100  -0.5 0.70  0 1.3  0.9748682 s2  17  0.100  -0.5 0.7  1.804 1.3  s1  -14  0.050  -0.5 0.7  1.804 0.607  0.6368037 s1  6  0.050  -0.103 0.7  1.804 0.607  0.8324002 s1  10  0.050  -0.103 0.7  2.286 0.607  0.1913321 s1  10  0.033  -0.103 0.7  2.604 0.607  0.9812344 s2  17  0.025  -0.103 0.7  3.016 0.607  7  0.050  -0.103 1.136  3.016 0.607  0.0694871  0.9071856 a1  S2 0.3560084  0.1620204  0.2377849 a2  •0.4037938  s1  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0063269/manifest

Comment

Related Items