UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Deep reinforcement learning approaches for process control Pon Kumar, Steven Spielberg 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2018_february_ponkumar_stevenspielberg.pdf [ 3.34MB ]
Metadata
JSON: 24-1.0361156.json
JSON-LD: 24-1.0361156-ld.json
RDF/XML (Pretty): 24-1.0361156-rdf.xml
RDF/JSON: 24-1.0361156-rdf.json
Turtle: 24-1.0361156-turtle.txt
N-Triples: 24-1.0361156-rdf-ntriples.txt
Original Record: 24-1.0361156-source.json
Full Text
24-1.0361156-fulltext.txt
Citation
24-1.0361156.ris

Full Text

Deep Reinforcement LearningApproaches for Process ControlbySteven Spielberg Pon KumarA THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Chemical and Biological Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2017© Steven Spielberg Pon Kumar 2017AbstractThe conventional and optimization based controllers have been used in pro-cess industries for more than two decades. The application of such controllerson complex systems could be computationally demanding and may requireestimation of hidden states. They also require constant tuning, developmentof a mathematical model (first principle or empirical), design of control lawwhich are tedious. Moreover, they are not adaptive in nature. On the otherhand, in the recent years, there has been significant progress in the fields ofcomputer vision and natural language processing that followed the success ofdeep learning. Human level control has been attained in games and physicaltasks by combining deep learning with reinforcement learning. They werealso able to learn the complex go game which has states more than num-ber of atoms in the universe. Self-Driving cars, machine translation, speechrecognition etc started to gain advantage of these powerful models. The ap-proach to all of them involved problem formulation as a learning problem.Inspired by these applications, in this work we have posed process controlproblem as a learning problem to build controllers to address the limitationsexisting in current controllers.iiLay SummaryThe conventional controllers like Proportional, Proportional Integral con-trollers, etc. requires mathematical expression of the process to be controlled,involves controller tuning. On the other hand, optimization based controllerslike Model Predictive Control are slow for controlling complex systems espe-cially with a large number of inputs and outputs because of the optimizationstep. Our work aims at addressing some of the drawbacks existing in con-ventional and optimization based process control approach. We have usedan artificial intelligence approach to build a self-learning controller to tacklethese limitations. The self-learning controller learns to control a process justby interacting with the process using artificial intelligence algorithm. Someof the advantages of this approach are, it does not require the mathematicalexpression of the process, does not involve controller tuning and it is fastsince it does not have optimization step.iiiPrefaceThe thesis consists of two parts. The first part which is discussed in Chapter3, discusses about the new neural network architecture that was designed tolearn the complex behaviour of Model Predictive Control (MPC). It over-comes some of the challenges of MPC. The contributions of this part havebeen submitted in:Referred Conference Publication:Steven Spielberg Pon Kumar, Bhushan Gopaluni, Philip Loewen, A deeplearning architecture for predictive control, Advanced Control of ChemicalProcesses, 2018 [submitted].The contributions of this part have also been presented in:Conference Presentation:Steven Spielberg Pon Kumar, Bhushan Gopaluni, Philip Loewen,A deeplearning architecture for predictive control, Canadian Chemical EngineeringConference, Alberta, 2017.The second part of the work explained in Chapter 4, discusses the learn-ing based approach to design a controller to be able to control the plant.The motivation for this approach is to solve some of the limitations presentin the conventional and optimization based controllers. The contributions ofivPrefacethis part have been published in:Referred Conference Publication:Steven Spielberg Pon Kumar, Bhushan Gopaluni and Philip D. Loewen, DeepReinforcement Learning Approaches for Process Control, Advanced Controlof Industrial Processes, Taiwan, 2017The contributions of this part have also been presented in:Presentation:Steven Spielberg Pon Kumar, Bhushan Gopaluni, Philip Loewen, Deep Re-inforcement Learning Approaches for Process Control, American Institute ofChemical Engineers Annual Meeting, San Francisco, USA, 2016.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xviDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3viTable of Contents1.5 Encouraging progress . . . . . . . . . . . . . . . . . . . . . . 41.6 Contributions and Outline . . . . . . . . . . . . . . . . . . . . 52 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 112.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.1 Forward Pass . . . . . . . . . . . . . . . . . . . . . . . 302.3.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . 302.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 332.4.1 Simple Neural Networks . . . . . . . . . . . . . . . . . 342.4.2 Recurrent Neural Networks . . . . . . . . . . . . . . . 372.4.3 Long Short-Term Memory . . . . . . . . . . . . . . . . 402.5 Model Predictive Control . . . . . . . . . . . . . . . . . . . . 432.5.1 MPC Strategy . . . . . . . . . . . . . . . . . . . . . . 442.5.2 MPC Components . . . . . . . . . . . . . . . . . . . . 462.6 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 512.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 512.6.2 Markov Decision Processes . . . . . . . . . . . . . . . 522.6.3 Learning Formulation . . . . . . . . . . . . . . . . . . 532.6.4 Expected Return . . . . . . . . . . . . . . . . . . . . . 542.6.5 Value-Based Reinforcement Learning . . . . . . . . . . 552.6.6 Policy based Reinforcement Learning . . . . . . . . . . 562.6.7 Policy Gradient Theorem . . . . . . . . . . . . . . . . 562.6.8 Gradient Ascent in Policy Space . . . . . . . . . . . . 572.6.9 Stochastic Policy Gradient . . . . . . . . . . . . . . . 57viiTable of Contents2.6.10 Deterministic Policy Gradient . . . . . . . . . . . . . . 582.7 Function Approximation . . . . . . . . . . . . . . . . . . . . . 592.7.1 Temporal Difference Learning . . . . . . . . . . . . . . 602.7.2 Least-Squares Temporal Difference Learning . . . . . . 612.8 Actor-Critic Algorithms . . . . . . . . . . . . . . . . . . . . . 622.9 Exploration in Deterministic Policy Gradient Algorithms . . . 632.10 Improving Convergence of Deterministic Policy Gradients . . 632.10.1 Momentum-based Gradient Ascent . . . . . . . . . . . 642.10.2 Nesterov Accelerated Gradient . . . . . . . . . . . . . 642.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 Offline Policy Learning - A deep learning architecture forpredictive control . . . . . . . . . . . . . . . . . . . . . . . . . . 683.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.3.1 Long Short Term Memory (LSTM) . . . . . . . . . . . 743.3.2 Neural Network (NN) . . . . . . . . . . . . . . . . . . 753.3.3 LSTMSNN . . . . . . . . . . . . . . . . . . . . . . . . 753.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . 773.4.1 Physical system . . . . . . . . . . . . . . . . . . . . . 773.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . 773.4.3 Data collection . . . . . . . . . . . . . . . . . . . . . . 773.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 803.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89viiiTable of Contents4 Online Policy Learning using Deep Reinforcement Learning 904.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.2 Policy Representation . . . . . . . . . . . . . . . . . . . . . . 934.2.1 States . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.2.2 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 934.2.3 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . 934.2.4 Policy and Value Function Representation . . . . . . . 954.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.4 Implementation and Structure . . . . . . . . . . . . . . . . . 1004.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 1014.5.1 Example 1- SISO Systems . . . . . . . . . . . . . . . . 1014.5.2 Example 2- MIMO System . . . . . . . . . . . . . . . 1044.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . 114Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116ixList of Tables2.1 Sample input output data . . . . . . . . . . . . . . . . . . . . 83.1 MPC details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.2 Details of comparison models . . . . . . . . . . . . . . . . . . 803.3 Performance comparison by testing different data set . . . . . 813.4 Performance comparison between methods . . . . . . . . . . . 82xList of Figures1.1 Success of Deep learning and motivation for learning basedcontroller [43],[44],[45],[46] . . . . . . . . . . . . . . . . . . . . 52.1 Plot of data given in table 2.1 . . . . . . . . . . . . . . . . . . 92.2 Plot of data along with learnt line using optimization technique 102.3 Machine learning framework. . . . . . . . . . . . . . . . . . . . 112.4 left: Learnt linear function on data (red dot); right: exampleof overfitting [47] . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Data flow diagram of supervised learning framework in ma-chine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.6 Neural Netowk representation for Linear regression on 1 di-mensional data set. . . . . . . . . . . . . . . . . . . . . . . . . 182.7 Logistic regression on 1 dimensional input data set. . . . . . . 192.8 Loss function vs parameter. . . . . . . . . . . . . . . . . . . . 222.9 Figure shows how training and validation set are chosen fork-fold cross validation. The folds get extended to the right tillthe last fold is reached. [48] . . . . . . . . . . . . . . . . . . . 272.10 A Neural networks on 1 dimensional input data set. . . . . . . 28xiList of Figures2.11 A different representation of neural network as used in litera-ture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.12 Compact way of representing neural networks with just inputlayer, hidden layer and output layer. . . . . . . . . . . . . . . 352.13 The real neural network in the brain is merged and connectedwith the artificial neural network in this figure. [49], [50] . . . 362.14 A recurrent neural network . . . . . . . . . . . . . . . . . . . . 382.15 Block representation of RNN. RED: input layer, GREEN: hid-den layer, BLUE: output layer. . . . . . . . . . . . . . . . . . 392.16 LSTM internal structure [30] . . . . . . . . . . . . . . . . . . . 402.17 Sample RNN architecture for time series prediction. . . . . . . 422.18 MPC strategy. [52] . . . . . . . . . . . . . . . . . . . . . . . . 442.19 Basic structure of MPC. [52] . . . . . . . . . . . . . . . . . . . 462.20 Prediction and Control Horizon . . . . . . . . . . . . . . . . . 472.21 Reinforcement learning flow diagram . . . . . . . . . . . . . . 513.1 Computation flow of the training process . . . . . . . . . . . . 733.2 Computation flow using the trained model to control the system. 733.3 Architecture of LSTM-only model. . . . . . . . . . . . . . . . 743.4 Architecture of NN-only model . . . . . . . . . . . . . . . . . 753.5 LSTMSNN model . . . . . . . . . . . . . . . . . . . . . . . . . 763.7 The different types of interaction. . . . . . . . . . . . . . . . . 783.8 Comparison of NN only model and LSTMSNN model to tracka set point of 5 . . . . . . . . . . . . . . . . . . . . . . . . . . 833.9 Comparison of NN only model and LSTMSNN model to tracka set point of 10 . . . . . . . . . . . . . . . . . . . . . . . . . . 84xiiList of Figures3.10 LSTMSNN’s system output under various initial condition.Green line is the target output. Other colors denote differentsystem outputs under different initial conditions. . . . . . . . . 853.11 LSTMSNN’s system output under varying target output. Greenline is target output. Blue line is LSTMSNN’s system output. 873.12 LSTMSNN’s system output under varying target output. Greenline is target output. Blue line is LSTMSNN’s system output. 884.1 Transition of states and actions . . . . . . . . . . . . . . . . . 944.2 Learning Overview . . . . . . . . . . . . . . . . . . . . . . . . 964.3 Learning Curve: Rewards to the Reinforcement Learning agentfor system (4.6) over time. Here 1 episode is equal to 200 timesteps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024.4 Output and Input response of the learnt policy for system (4.6) 1034.5 Output and input response of the learnt system with outputnoise of σ2 = 0.1 for SISO system (4.6) . . . . . . . . . . . . . 1054.6 Output and input response of the learnt system with outputnoise of σ2 = 0.3 for SISO system (4.6) . . . . . . . . . . . . . 1064.7 Output and input response of the learnt system with inputnoise of σ2 = 0.1 for SISO system (4.6) . . . . . . . . . . . . . 1074.8 Output and input response of the learnt SISO system with setpoint change and output noise of σ2 = 0.1 . . . . . . . . . . . 1084.9 Output and input response of the learnt system with set pointchange and output and input noise of σ2 = 0.1 for SISO system(4.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109xiiiList of Figures4.10 Output and input response of the learnt SISO system (4.6) forset points unseen during training . . . . . . . . . . . . . . . . 1104.11 Output and input response of the first variable with outputnoise of σ2 = 0.01 of the 2× 2 MIMO system (4.8) . . . . . . 1114.12 Output and input response of the second variable with outputnoise of σ2 = 0.01 of the 2× 2 MIMO system (4.8) . . . . . . 112xivAcronymsAcronyms Full FormDL Deep LearningDRL Deep Reinforcement LearningLSTM Long Short Term MemoryLSTMSNN Long Short Term Memory Supported Neural NetworkMPC Model Predictive ControlNN Neural NetworkP - Controller Proportional - ControllerPI - Controller Proportional Integral - ControlPID - Controller Proportional Integral Dervative Controller ControlRL Reinforcement LearningxvAcknowledgementsIt gives me an immense pleasure to express my deep sense of gratitude andthankfulness to Dr. Bhushan Gopaluni who through more than 700 emails,many meetings, phone calls and Skype conversations over two years, chiseledme from an a curious student to a competent researcher and developer. Dr.Bhushan’s vision, passion, aspiration are infectious. Whenever I had a narrowidea he transformed those ideas into a long term broad visionary one andpatiently repeated them and helped me until I got it. Whenever I sent himan unstructured draft he patiently gave me many feedback and correctionstill I got a good one. He also gave me enough freedom in terms of approachingthe research problem we have been working and in granting me permissionsto take courses relevant to my research. He gave me opportunities to helpme get relevant skills needed for the project. The two years at UBC was alot of learning and it would not have been possible without Dr. Bhushan.His teachings have helped me not only during project period but will inspireme in future also.Next, I am grately thankful to Dr. Philip Loewen on helping me through-out my masters project and for giving me the kind opportunity to pursueresearch in such an esteemed field (machine learning + process control). Hewas very encouraging and been a constant support to me. I am very thankfulxviAcknowledgementsto his detailed comments on my writing and for his encouragement duringmy presentation. His comments will help me throughout my career.I am highly grateful to Dr. Ryozo Nagumune for his time in reviewingmy thesis and in giving me valuable feedback and corrections.I would like to thank Dr. Mark Schmidt in helping me give necessaryplatform to improve my machine learning skills. I want to thank my depart-ment for being very responsive and supportive especially I want to express mygratitude to Marlene Chow. I am thankful to Dr. Peter Englezos (Head ofDepartment Chemical and Biological Engineering), Dr. Elod Gyenge (Grad-uate Advisor), Rockne Hewko (Graduate and Post Doctoral Studies), GinaAbernathy for helping me throughout my masters.I would like to thank Smriti Shyammal, McMaster University for beingvery supportive during my ups and downs. His intuitive recommendations,suggestions, guidance supported with case studies, thoughtful discussions andexamples was one of the best things during my course of my masters degree.I have to thank my lab mates Lee Rippon, Quigang Lee, Yiting, Hui whowere available whenever I had clarifications. I would like to thank Mitacs forthe financial support. I also want to express my thanks to Kevin Shen andZilun Peng for helpful discussions. I am also indebted to Dhruv Gupta, Uni-versity of Wisconsin Madison and Raj Ganesh for being a constant support.Last but not the least I would like to thank my mother PackiavathyGeetha for all the struggles and hard work she put to raise me. Thanks forall her sacrifice to educate me. I also want to thank my father for beingsupportive to my family. A big thank you to my brother for being verysupportive to me and to my family without my live presence back home.xviiAcknowledgementsThanks to Ebenezer, Annie and Mohan for being supportive. Finally, I amalways thankful to Lord Almighty for all the opportunities and success hehas given me in life.xviiiDedicationTo my mother Packiavathy Geetha.xixChapter 1Introduction1.1 OverviewOne of the dreams of the field of process control is to enable controllers tounderstand the plant dynamics fully and then act optimally to control it.It could be challenging because online learning is relatively easy for humanbeings because of the evolution compared to machines.The current approaches for control are either classic control approachor optimization based approach. These approaches do not take intelligentdecisions but obey certain rules that was depicted to the controller by a con-trol expert. The conventional approaches require extensive knowledge froman expert with relevant domain knowledge to transfer the knowledge to thecontroller via control law and other mathematical derivation. On the otherhand, optimization based controllers like Model Predictive Controllers lookahead in the future and take action considering future errors but suffer fromthe fact that the optimization step takes time to return optimal control in-put, especially for complex high dimensional systems. The detailed drawbackof the current approaches for process control is discussed in the limitationsection below.11.2. Limitations1.2 LimitationsThere are many limitations with the current approaches to industrial processcontrol. The classical controller requires very careful analysis of the processdynamics. It requires a model of the process either derived from first princi-ples or empirical. If the system under consideration is complex then the firstprinciple model requires advanced domain knowledge to build the model. Incertain cases, the conventional approaches involves derivation of control lawthat meets desired criteria (either to control the plant to track the set pointor operate in a way to be economical). It involves tuning of controllers withrespect to the system to be controlled. Model maintenance is very difficultor rarely achieved. On the other hand, optimization based controllers lookahead into the future and take future errors into account. However, theysuffer from the fact that computing optimal control actions via optimizationtends to take time especially for non-linear high dimensional complex sys-tem. They cannot handle uncertainty about the true system dynamics andnoise in observation data. One more burden of optimization based controllerlike MPC is the prediction of hidden states, since often those states are notreadily available. MPC also demands to tune prediction horizon, controlhorizon, weights given to each output etc. As a prerequisite, MPC requiresan accurate model of the system to start with. Also over the course of timethe controller will not possess an updated model, hence the performance willstart to deteriorate.21.3. Motivation1.3 MotivationHaving given the above limitations can we build a learning controller thatdoes not require 1) intensive tuning 2) plant dynamics 3)model maintenance,and 4) takes instant control actions? This work aims to develop such nextgeneration intelligent controllers.1.4 ChallengesIt could be challenging to learn the entire trajectory of a plant’s output alongwith the set point. Deploying these learning controllers in production couldbe risky, hence it should be made sure that the controller has been alreadytrained offline before implementing it in the real plant. The action spaceand state space in process control application are continuous, hence functionapproximators are used with reinforcement learning to learn the continuousstate space and action space. Since it has been found that deep neural net-works serve as an effective function approximators and recently have foundgreat success in image [1], speech [7] and language understanding [8], deepneural networks are used as common function approximators with reinforce-ment learning. However, to speed up the training of deep neural networks, itrequires training them in the Graphical Processing Unit (GPU) to parallelizethe computation. This could be tricky to train as care should be taken tooptimize the memory allotted to perform computations in the GPU. The re-inforcement learning based controller involves training two neural networkssimultaneously, tuning those two neural networks for the first time to trainthe plant was challenging. The reason is the enormous number of options31.5. Encouraging progressand space available as tunable parameters (hyper parameters). In addition,running simulation takes a lot of time, hence re tuning the hyper parametersto retrain the network was bit challenging.1.5 Encouraging progressIn recent years, there has been significant progress in the fields of computervision and natural language processing that followed the success of deeplearning. The deep learning era started in 2012 after it won the image classi-fication challenge in 2012, by a huge margin compared to other approaches [1].In Machine translation conventional probabilistic approaches were replacedwith deep neural networks [53]. From then on, self driving cars started totake the advantage of deep nets [54]. They use deep learning to learn the op-timal control behaviour required to drive the car without the help of human.This is a form of a learning control problem. Human level control has alsobeen attained in games [2] and physical tasks [3]. Computers were able tobeat human beings in classical Atari games [2]. Deep neural nets have evenlearnt the complex game of Go [15]. This is again posed as a learning controlproblem. The Go game has 1.74× 10172 states, which is more than the num-ber of atoms in the universe. The question is can we gain advantage of thosepowerful models to build a learning controller to overcome the limitations ofthe current control approach? Our work aims to answer this question. Tosupport using such powerful models, the number of calculations per secondhas been growing exponentially over the years.41.6. Contributions and OutlineSelf driving carComplex Go GameLearning from humans via VRHuman level control in gamesFigure 1.1: Success of Deep learning and motivation for learning based controller[43],[44],[45],[46]1.6 Contributions and OutlineWe present two principal contributions,1) Optimization based control We propose a new deep learning ar-chitecture to learn the complete behaviour of MPC. Once the deep neuralnetwork is learnt, it can take action instantaneously, without having to esti-mate hidden states, with no prediction and optimization as these informationwill be learnt while training.51.6. Contributions and Outline2) Classical Control We have developed a learning based controller thatcan learn plant dynamics and determine control actions by just interactingwith the plant. These controllers does not require tuning and are adaptivein nature. Also, the learnt controller are fast in returning optimal controlactions.The thesis is organized as follows: Chapter II provides background in-formation about machine learning, deep learning, MPC and ReinforcementLearning. Chapter III discusses our contribution to optimization based con-troller. Chapter IV discusses our work on learning based controller designusing deep reinforcement learning. Chapter V presents conclusions and fu-ture directions of this work.The work discussed in Chapter III is based on supervised learning withthe data from MPC to learn control behaviour, whereas the work discussed inChapter IV is based on reinforcement learning to learn the control behaviour.The supervised learning with MPC is preferred, if MPC has to be performedon a complex system, because optimization step on complex system takestime and the optimization step is learnt offline with the data from MPC inthe supervised approach. The reinforcement learning approach is preferred,1) when it is tedious to develop or derive plant model 2) a controller that issusceptible to change in plant model is needed 3) the control behaviour arelearnt using function approximators and the learnt control policy (mappingfrom states to actions) with function approximators are fast.6Chapter 2BackgroundThe thesis has two parts 1) Using deep learning to learn the control behaviourof Model Predictive Control, 2) Using deep learning and reinforcement learn-ing to learn the optimal behaviour online just by interacting with the plant.Hence, this chapter discusses the background on machine learning, neuralnetworks, Model Predictive Control (MPC) and deep reinforcement learn-ing. For detailed information about these topics we recommend the Deeplearning book from Goodfellow et al. [16] and the Reinforcement Learningbook from Sutton and Barto [17].2.1 Machine LearningLet us discuss machine learning with an example. Consider the followingdata set given in table 2.1 showing Time taken to move a box 50m vs massof the box.72.1. Machine LearningTable 2.1: Sample input output dataBox Mass (kg) Time taken to move 50 m (seconds)1 2.082 4.953 5.634 7.165 11.766 14.397 17.038 16.069 18.4610 20Now let us visualize this data as shown in figure 2.1.82.1. Machine LearningFigure 2.1: Plot of data given in table 2.1The data is given only for 10 specific boxes. What if we want to find thetime taken to move a 5.5 kg box? In this case we have to learn from the data.Learning the data essentially means learning the parameters or weights of amachine learning model. Since the data visualized in figure 2.1 looks linear,a linear model is enough to learn the data.y = θ1x+ θ2 (2.1)where θ1 and θ2 are parameters or weights of the equation. Equation 2.1 isknown as linear regression machine learning model. The parameters θ1 and92.1. Machine Learningθ2 are learnt using optimization techniques which will be discussed later inthis chapter. Once θ1 and θ2 are learnt the data can be plotted along withthe machine learning model, in this case a line as follows:Figure 2.2: Plot of data along with learnt line using optimization techniqueNow if we know the line that is learnt using the data we had, we cananswer questions related to the data set even for box weight data we donot have. θ1 is the slope of the line and θ2 is the intercept of the model.For instance, for the above data set and for linear regression model, theparameters θ1 and θ2 learnt using optimization techniques are θ1 = 2.2 andθ2 = 0. Then the time taken to move a 5.5 kg box 50 m would be 2.2×5.5+0 =102.2. Supervised Learning12.1s. For learning the parameters of this linear model, we are supervisingour algorithm with input and corresponding output data, hence it is calledsupervised learning. We will discuss this further in the next section.Training DataLearning AlgorithmPredictNew Data ParameterParameterFigure 2.3: Machine learning framework.In general the machine learning framework can be summarized with theflow diagram in Figure 2.3. First we have training data and we use optimiza-tion technique to learn the parameters. The learnt parameters can be usedwith the new data to predict the output of the new data.2.2 Supervised LearningWe saw a simple example of machine learning, now let us go through su-pervised learning in detail. Many practical problems can be formulated as amapping f : X 7→ Y , where X is an input space and Y is an output space.For instance, in the above example, the input space is mass and the outputspace is time. In most cases it is not easy to come up with the mapping func-tion f manually. The supervised learning framework provides an approachthat takes advantage of the fact that it is often relatively easy to obtain sam-ple pairs (x, y) ∈ X × Y such that y = f(x). In the above example data set,112.2. Supervised Learningthis would correspond to collecting a data set of time taken to move 50m ofbox considered in the input space annotated by humans.Supervised Learning Objective Precisely, we have a ”training dataset” of n examples (x1, y1), ...(xn, yn) made up of independent and identi-cally distributed (i.i.d.) samples from a data generating distribution D; i.e.,(xi, yi) ∼ D for all i. We frame learning the mapping f : X 7→ Y by searchingover a set of candidate functions and finding the one that is most consistentwith the training examples. D is the distribution from which the data isbeing generated and we are interested in finding the function f∗ (from setof functions f) that can better approximate the data generation distributionD. Concretely, we consider particular set of functions F and choose a lossfunction which is a scalar L(x, yˆ, y) that expresses the disagreement betweena predicted label yˆi = f(xi) for some f ∈ F that meets the condition below:f∗ = arg minf∈FE(x,y) ∼D L(f(x), y) (2.2)From a big picture, we are interested in a function f∗ that minimizesthe expected loss over the distribution D from which the data is generated.Once the function is determined, we can only have this function and use thisfunction with the new training data set to do the mapping from X to Y .The initial training data can be discarded.Unfortunately, the above optimization objective cannot be solved easilybecause we do not have all possible samples (or elements) of D and hence theexpectation cannot be evaluated analytically without making assumptionsabout the form of D,L or f . If we assume that the available samples from122.2. Supervised LearningD are i.i.d. (Independent and Identically Distributed) we can rewrite andapproximate the expectation given in equation (2.2) by averaging the lossover the training data available as follows:f∗ ≈ arg minf∈F1nn∑i=1L(xi, f(xi), yi) (2.3)Even though we approximate the expectation with data available in train-ing, we believe that it is a good approximation to equation (2.2), especially ifwe have a large number of data and it serves as a good proxy to the objectiveto equation (2.2).Overfitting Sometimes, there are cases where the equation given in (2.2)learns too much about the training data that it starts to over fit withoutlearning the general pattern of the data. This scenario is called overfitting.Figure 2.4: left: Learnt linear function on data (red dot); right: example ofoverfitting [47]Figure 2.4 is used to discuss about overfitting. The red dots representthe data points and the blue line the function f∗ learnt from the data using132.2. Supervised Learningoptimization. The trend of the data looks linear, hence a linear function asshown on the left hand side is enough to learn the data, because the deviationof data points away the blue line could be considered as noise. On the otherhand, in the right hand side, the function (blue line) interpolates all thepoints without learning the generalized pattern of the underlying data. Theright hand side figure is an example of function over fitting the data.Regularization Consider a function f that maps xi to yi in the trainingdata, but for data other than training data it returns zero. This would bethe result of equation (2.3) when the minimum value is attained at y = yˆ.Technically, we expect higher loss for all points in D that are not in thetraining data. The reason to the problem is the optimization objective triesto look for a weights where y and yˆ are almost equal so as to be closer. Tomake the objective function generalize well the training data, an additionalterm R, called a regularization term, is added to the objective as follows:f∗ ≈ arg minf∈F1nn∑i=1L(f(xi), yi) +R(f) (2.4)where R is a function which is scalar valued. It adds a penalty on differentweights of the model to reduce the freedom of the model. Hence, the modelwill be less likely to fit the noise of the training data and will improve thegeneralization abilities of the model. Two common forms of regularizationR, used are,ˆ L1 regularization (also called Lasso)ˆ L2 regularization (also called Ridge)142.2. Supervised LearningThe L1 regularization shrinks some parameters to zero. It is called L1regularization because the regularization term is the sum of absolute values(1 Norm) of weights/parameters of the model available. Hence some variableswill not play any role in the model.The L2 regularization forces the parameters to be relatively small, if thepenalization is bigger and vice versa. It is called L2 regularization because theregularization term R(f) is the 2 norm of the weights available. Hence, whenthe overall objective function along with regularization term is minimized, itshrinks the parameters/weights of the model.Linear Regression Consider a 2 dimensional input data set of 100 samplesrepresented as (X = R2) each corresponding to a scalar value of output(Y = R). Since it is linear regression, the function F we consider is the setof linear functions that maps from X to Y (i.e., F = wTx+ b|w ∈ R2, b ∈ R).Since, our input has 2 dimensions (2 variables), 2 weights are needed to learnthe model apart from one more weight which is required as a bias term.Hence, our hypothesis space has 3 parameters (w1, w2, b), where w = [w1, w2]Now the next step is to learn the weights w1, w2 and b. Hence, a loss func-tion has to be constructed. For regression, the commonly used loss functionis the squared loss (squared difference between the predicted value and thetarget value of the training data), L(yˆ, y) = (yˆ − y)2. As we saw earlier, weuse regularization apart from the regular term given by, R(w, b) = λ(w21+w22).The regularization term favours parameters that are relatively small. Com-152.2. Supervised Learningbining everything together, the optimization objective is given as,minw1,w2,b[ 1nn∑i=1(wTxi + b− yi)2]︸ ︷︷ ︸fit the training data+λ(w21 + w22)︸ ︷︷ ︸regularization(2.5)It is common with models of the form yi = wxi + b to regularize onlyweights w, leaving bias b free to be as large or small as it wants, with nopenalty. The variance penalty for leaving the bias (intercept) unregularizedis likely to be smaller than the rewards we reap it in terms of decreased bias,since most linear models do have some nonzero intercept. Also, the bias termdoes not interact multiplicatively with inputs; they merely allow the modelto offset itself away from the origin. The bias term typically requires less datato fit accurately than the weights. Each weight specifies how two variablesinteract, and requires observing both variables in a variety of conditions to fitwell. Each bias controls only a single variable. This means that we don notinduce too much variance by leaving the biases unregularized. Regularizing,the bias term can introduce a significant amount of underfitting.162.2. Supervised LearningFigure 2.5: Data flow diagram of supervised learning framework in machinelearningFigure 2.5 shows the data flow diagram of supervised learning framework.The x is the training input and y is the corresponding label/ output of thetraining data. We have three decisions to make 1) The function f , which is amapping from input x to output y, (a Neural Network in this dissertation),2) The Loss function L, that is the squared error difference between pre-dicted output from the machine learning output and the actual output (this172.2. Supervised Learningdissertation has problems formulated only as regression, hence squared erroris common throughout), 3) The regularization function R, since this disser-tation only has regression formulation, the L2 regularization is used. Oncethe predicted output with respect to weights and machine learning model iscomputed, the Data loss is calculated and then finally the overall loss is thesummation of data loss and regularization loss.Linear Regression in neuron representation We described linear re-gression in the previous section. Since we will deal with neural networks later,in this chapter, the linear regression can be represented in the diagrammaticnotation popular for general neurons as follows:Figure 2.6: Neural Netowk representation for Linear regression on 1 dimensionaldata set.Figure 2.6 shows the neural network representation for linear regression.In this case, the linear regression model is a weighted combination of inputsand weights. The θ1 and θ2 are weights or parameters here. The activationfunction in this case represented as slanting line is just the identity function(o1 = u1). More advanced activation functions will be used later in this182.2. Supervised Learningchapter in neural networks.Logistic Regression in neuron representation Let us consider linearregression for a binary classification task with labels 1 and 0. The drawbackof linear regression is the output is not bounded between 0 and 1. This isaddressed in logistic regression, which is a simple non-linear model for per-forming classification tasks. The logistic regression in neuron representationis given as follows:Figure 2.7: Logistic regression on 1 dimensional input data set.Figure 2.7 show logistic regression in neural network representation. Inthis case, first the linear regression model which is a weighted combinationof inputs and weights is used. The θ1 and θ2 are weights or parameters here.Since, the output is not bounded between 0 and 1, a sigmoid activationfunction given by σ(x) = 11+e−x is used. This function squashes the outputto be between 0 and 1. The predicted output yˆi of logistic regression givenin Figure 2.7 is,yˆi =11 + e−(θ1+θ2×xi)(2.6)192.3. OptimizationThe objective is to find the parameter θ that minimize the loss function givenby,L(x, yˆ, y) = −K∑k=1yk log yˆk (2.7)The equality in equation 2.7 is the definition of cross-entropy between twodistributions H(p, q) = −∑x p(x) log q(x). Since we interpret the output oflogistic regression as containing probabilities of two classes (0 and 1), we seethat we are effectively minimizing the negative log probability of the correctclass. This is also consistent with a probabilistic interpretation of this lossas maximizing the log likelihood of the class y conditioned on the input x.We saw linear regression and logistic regression. Now the goal is to learnall the components of the parameter vector θ using optimization techniques.Let us take a digression to discuss the optimization techniques used in ma-chine learning.2.3 OptimizationIn linear regression we saw how the vector θ can be learnt from the data byminimizing the squared error between predicted output and actual labelledoutput. Specifically, θ∗ = arg minθg(θ), where θ is the parameters to belearnt (usually in vector) and g is the sum of data loss and regularizationloss. Let us see some of the gradient based techniques used to find the optimalparameters θ.Derivative free optimization The simplest technique that one couldthink of is to evaluate g(θ) for various θ. One approach to this problem202.3. Optimizationis the ’trial and error’ approach. We can draw a number of θ at randomfrom some distribution. For each θ drawn we evaluate g(θ), and then simplypick the one that minimizes g. Another method is to give perturbations toθ and continuously monitor the loss g(θ) (error in predicted and actual out-put). However, the trial and error approach and perturbing weights approachdoes not scale, as the neural networks discussed in this dissertation will havehundreds of thousands of parameters.First order methods Instead of searching for the optimal parameters ofthe machine learning model in a brute force manner, we can make someassumptions about g to compute the parameters efficiently. If we assume gto be a differentiable function then, since our loss function is convex ( for1 parameter, its graph is a parabola), the function will be minimum at thepoint where slope of loss function with respect to parameter is zero. In otherwords, the slope or gradient (∇θg) at minimum would be zero. The vector∇θg is composed of first derivatives of g. This is the reason why it is calleda first order methods?.212.3. OptimizationFigure 2.8: Loss function vs parameter.We can use the information from the gradient to efficiently find the pointof minimum. In short if we move away from the direction of gradient then wecan approach the point of minimum. From figure 2.8, we can see that at θ0 (i.e., at the point of initialization) the gradient direction is positive (positiveslope). Since the slope is positive, we have to take a step (given by , the socalled learning rate) in the opposite direction i.e., negative direction, resultingin the point θ2. This process is continued until we reach the value of θ forwhich the loss function is minimum. This strategy is the base for the gradientdescent algorithm. 1) Evaluate the gradient of the loss function with respectto parameters at the current point 2) Update the parameters of the modelin the direction opposite to the sign of gradients. Gradient descent requirescomputing the gradient for all data samples. This may not be practicallyfeasible, especially when the data set is large. Hence, we usually split thedataset into batches, compute the gradients, and then update the parameters.222.3. OptimizationIf the size of the batch is 1 then it results in Stochastic Gradient Descent(SGD) which is summarized in Algorithm 1.Algorithm 1 Stochastic Gradient descentGiven a starting point θ ∈ domgGiven a step size  ∈ R+repeat1. Sample a mini batch of m examples (x1, y1), ..., (xm, ym) from thetraining data2. Estimate the gradient ∇θg(θ) ≈ ∇θ[ 1m∑mi=1 L(fθ(xi), yi) + R(fθ)]with backpropogation3. Compute the update direction: δθ := −∇θg(θ)4. Update the parameter: δ := δ + δθuntil convergenceAn important parameter in the SGD is the learning rate  which is relatedto the step size. If the value of  is very small then the time taken to reach theminimum will be high. On the other hand if the value is too high the methodwon’t converge and will oscillate around the minimum. Let us consider anexample: to minimize the function, g(θ) = θ2. The gradient of which isg′(θ) = 2θ. Let us initialize with θ = 1. If  > 1 it will cause the gradientdescent update to oscillate and diverge to infinity. On the other hand if  = 1,it will make the algorithm to oscillate between θ = 1 and θ = −1 infinitely.The third case being a good learning rate, 0 <  < 1 will take the value of θto zero, which is the point of minimum. It232.3. OptimizationPractically, a good rule of thumb is to pick a learning rate that makesthe algorithm to diverge (e.g. 0.1). Having noted this value, set the learningrate to be slightly smaller than this amount (e.g. 0.05) as a safety margin.Gradually decaying the learning rate over time is a good practice. This canbe achieved by multiplying the learning rate by a small factor (e.g. 0.1) onceevery few iterations to lower the learning rate. The convergence rates ofthis heuristic is discussed in [55] .Polyak averaging also provides consistentaveraging [18]. In Polyak averaging the average parameter vector, θ¯ (e.g.θ¯ = 0.999θ¯+0.001θ) is computed after every parameter update. At test timeit uses θ¯.Other optimization techniques. Usually the performance of gradientdescent techniques can be improved by modifying the update step which isline number 3 in Algorithm 1. The first popular improvement that was madeto gradient descent is the momentum update. It is framed in a way to progressalong slow towards the direction of the gradient consistently. If the gradientis denoted as ∇θg(θ) for the gradient vector, the update step δθ is found byupdating an additional step with the variable v := αv+g (initialized at zero).The update step now becomes δθ := −v. The variable v encourages thegradient update to proceed in a consistent (not identical) direction becauseit contains an exponentially-decaying sum of previous gradient directions. Itcomes with inspiration from physics, where the gradient is interpreted as theforce F on a particle with position θ, hence this increments position directly,which is supported by Newton's second law F = dpdt, where the mass m isassumed to be constant and p = mv is the momentum term. The momentum242.3. Optimizationupdate technique with a running estimate of the mean (first moment). Basedon this, many similar methods of increased efficiency have been proposed.ˆ Instead of the intermediate variable v as in momentum step, the in-termediate variable may have the update rule, r := r + g  g of sumof squared gradients (where  is element wise multiplication insteadof vector or matrix multiplication). The update is given as follows:δθ := − γ+√r g, where γ is added to prevent the division by zero,usually the value of γ is set to 10−5. This update is known as Adagradupdate [19]ˆ Instead of running mean first moment if second moment which varianceis used as given by: r := ρr + (1 − ρ)g  g, where ρ is usually set to0.99. The update is known as RMSProp update [20]ˆ If the update step of Adagrad and RMSProp is combined, which isthe estimates of both first and second running moments, the updateis called the Adam update [21]. As in Adagrad and RMSProp, γ isusually set to 10−5 and ρ is usually set to 0.99. This update takes us tothe optimum point with less oscillation. Throughout the dissertationduring training the Adam method is employed for gradient descentupdate.Cross-validation. We saw several other parameters called ’hyper parame-ters’ apart from the regular parameters or weights we wanted to learn. Someof the hyper parameters are the learning rate , the regularization weightsλ, the number of hidden units in neural network (see below) etc. Usually252.3. Optimizationto determine good parameters, the available data set is split into trainingand validation samples. Once the model is trained on training set, it willbe tested on validation set to see how well it performs. Usually this is doneby comparing the predicted value of validation sample to true value of vali-dation sample output. One familiar cross validation technique is the k-foldcross validation. Here the data are split into k subsets and then each subsetwill act as a validation set and finally the mean of all validation errors willbe computed to test the efficiency of the model. Eg., 5-fold cross-validationwill divide the data into 5 folds. Fig 2.9 shows visualization of k-fold crossvalidation.262.3. OptimizationFigure 2.9: Figure shows how training and validation set are chosen for k-foldcross validation. The folds get extended to the right till the last fold is reached.[48]Supervised Learning - Summary The supervised learning approach su-pervises the learning algorithm with known data. We are given a data set ofn samples (x1, y1), ..., (xn, yn) where (xi, yi) ∈ X × Y , and we want to findthe three components which are listed below:1. The family of mapping functions F , where each f ∈ F maps X to Y ,it could be linear regression, logistic regression, neural networks etc.2. The Loss function L(x, yˆ, y) that evaluates the error between a truelabel y and a predicted label yˆ = f(x). Loss functions vary slightly for272.3. Optimizationregression and classification based problems.3. The regularization loss function R(f), which is scalar, to help generalizethe data better.Since the dissertation focuses on problems formulated using regression,usually L2 regularization is used as a R(f) regularization loss. Our lossfunction L is always the squared difference between predicted and actualoutput.Neural Network regression The linear and logistic regression neuronrepresentation can be extended to neural networks as in Figure 2.10:Figure 2.10: A Neural networks on 1 dimensional input data set.From Figure 2.10, first the linear regression model which is a weightedcombination of inputs and weights are used. Followed by a sigmoid activationfunction (σ(x) = 11+e−x ), which says if the neuron has to be activated or not(1 if activated). The θ1 and θ2 are weights or parameters here. Ultimately,the final layer has linear activation because we are interested in using neuralnetworks for regression.282.3. OptimizationIn Fig. 2.10, the output yˆi is given by,yˆi = θ5 + θ6σ(θ1 + θ2xi) + θ7σ(θ4 + θ3xi) (2.8)Where,ˆ θ′s are the parameters.ˆ xi is the input of the ith sampleˆ Activation function (sigma represented in green circular rectangle) isthe sigmoid function (σ(x) = 11+e−x )ˆ u11, u12 are the outputs of the neurons before activationˆ o11, o12 are the outputs of the neurons after activation in first layerˆ o21 is the output of the neuron after activation in the second layerˆ Since linear activation is used in final layer, yˆi = o21 = u21Now our goal is to learn the parameters θ of the neural network. If g isthe loss function, the parameters are computed using the gradient descentupdate given as,θt+1 = θt − α× ∂g(θ)∂θ(2.9)The gradient of loss, g with respect to the parameter is given as,∂g(θ)∂θj= −2(yi − yˆi(xi, θ))∂yˆi(xi, θ)∂θj(2.10)The value of yˆi(xi, θ) can be computed while doing a forward pass (dis-cussed below) and the derivative of yˆi with respect to the parameter is com-puted during the back propagation step discussed below.292.3. Optimization2.3.1 Forward PassIn forward pass, each element of θ has a known value. Then from Fig. 2.10,the values of u, o and yˆi are given as,ˆ u11 = θ1 × 1 + θ2 × xiˆ u12 = θ4 × 1 + θ3 × xiˆ o11 =11+e−u11ˆ o12 =11+e−u12ˆ u21 = θ5 × 1 + θ6 × o11 + θ7 × o12ˆ yˆi = o21 = u212.3.2 BackpropagationThe next step is to compute the partial derivatives of yˆi with respect to theparameter θ. Each yˆi can be represented as a function of previous layer’sweights as,yˆi = θ5 × 1 + θ6 × o11 + θ7 × o12 (2.11)From the above equation the gradient of yˆi with respect to θ5, θ6, θ7 is givenas,ˆ∂yˆi∂θ5= 1ˆ∂yˆi∂θ6= o11ˆ∂yˆi∂θ7= o12302.3. OptimizationSimilarly, u12 = θ2 × xi + θ3 × 1, now the derivative of yˆi with respect to θ3is given in equation (2.12). It is essential to note that θ3 only influences yˆithrough u12.∂yˆi∂θ3=∂yˆi∂o12∂o12∂u12∂u12∂θ3(2.12)The o′s in the equation 2.12 are received during the forward pass. Similarlythe gradient with respect to θ1, θ2, θ4 are computed.Once the yˆi and the derivatives of yˆi with respect to the θ are computed,the gradient descent as given in 2.9 is used to update the parameters θ and theforward pass and update step continues until the optimum value of parameter(or minimal loss) is attained.Neural Network implementation via graph object Since the compu-tation above involves backpropagating errors to update the parameters, theentire operation is constructed in the form of graph to organize the compu-tation better. The operations can be thought as a linear list of operations.To make it more precise the function that maps from input to output canbe imagined within a directed acyclic graph (DAG) of operations, whereinthe data and parameters vector flow along the edges and the nodes repre-sent differentiable transformations. The implementation of backpropogationin recent deep learning frameworks like tensorflow and pytorch uses a graphobject that records the connectivity operations (gates and layers) along withlist of operations. Each graph and node object has two functions as discussedearlier: forward(), for doing the forward pass and backward() for perform-ing the gradient descent update using backpropogation. The forward() goesover all the nodes in a topological order and takes computation flow in for-312.3. Optimizationward direction to predict the output using the known weights. backward()goes over nodes in reverse direction and does the update step. Each nodecomputes its output with some function during its forward() operation, andin backward() operation, the nodes are given the gradient of loss functionwith respect to edges (weights) attached to the nodes. The derivatives aredone using finite difference method. The backward process lists the gradienton output and multiplying it by its own local gradient and this gradient thenflows to its child nodes. The child nodes then perform the same operationrecursively.Let us illustrate by describing the implementation of a tanh layer. Theforward function is given as y = tanh x. The backward function will have ∂g∂y(Jacobian product) given by the Graph object, where g is the loss at the endof the graph. The backward function chains the Jacobian of this layer ontothe running product. Here, the product is the matrix product ∂g∂x= ∂y∂x∂g∂y.Now, since the Jacobian ∂y∂xof this layer only has elements along the diagonal,we can implement this efficiently by rewriting the matrix multiplication as∂g∂x= eye( ∂y∂x) ∂g∂y, where  is an element wise multiplication and eye( ∂y∂x)is the diagonal of the matrix as a vector (for tanh this would be the vector(1− y  y) since ddxtanhx = (1− (tanhx)2). The returned ∂g∂xvia the Graphobject from this layer will be passed on to the node that computed x. Therecursion continues till it reaches the inputs of the graph.322.4. Neural NetworksFigure 2.11: A different representation of neural network as used in literature.Figure 2.11 shows a different representation of neural network as used inliterature. It will be useful to understand recurrent neural networks in thelater part of the chapter. Note: the neural network in the right hand sideand the left hand side are not exactly equal, the motivation of this figure isto show how neural network could be expressed just using circles.2.4 Neural NetworksWe have seen how we can use neural networks to perform a regression task.We defined a differentiable function that transform the inputs x to predictedoutputs yˆ. Then we optimized the model with respect to loss and parametersusing gradient descent methods. Now we will see more details about theobjective and extend to recurrent neural networks and Long Short TermMemory (LSTM).332.4. Neural Networks2.4.1 Simple Neural NetworksNeural networks involve series of repeated matrix multiplications and element-wise non-linearities. Some of the commonly used non-linearities are,ˆ Sigmoid, σ(x) = 1/(1 + e−x)ˆ ReLU, σ(x) = max{0, x}ˆ tanh, σ(x) = ex−e−xex+e−xFor a vector ~x = (x1, ..., xN), we write σ(~x) = (σ(x1), ..., σ(xN) as the naturalelement-wise extension. For instance, a 2-layer neural network would beimplemented as f(x) = W2σ(W1x), where σ is an element-wise non-linearity(e.g. sigmoid) and W1,W2 are matrices. 3-layer will take the following formf(x) = W3σ(W2σ(W1x)), etc. If the non-linearity is not used in between thelayers then the entire neural network becomes a simple linear model, becausethe weights in the each layer get multiplied by the weights in the next, forminga series of matrix products, which is just a single weighted combination ofinput, which is just linear. The last layer of the neural network either has asigmoid or a non-linear layer to squash the input to 0 and 1 for classificationtasks or identity function in the case of regression problems.342.4. Neural NetworksFigure 2.12: Compact way of representing neural networks with just input layer,hidden layer and output layer.Figure 2.12 will be useful in Chapter 3 in constructing complex recurrentneural networks which will be discussed later in this chapter.Biological inspiration A crude model of biological neurons is the moti-vation for neural network. Each row of the parameter or weight matrices Wcorresponds to one neuron and its synaptic connection strengths to its input.The positive weights relate to excitation, whereas the negative weights relatesto inhibition. On the other hand, the weight zero represents no dependence.The weighted sum of the inputs and the weights reaches the neurons at the352.4. Neural Networkscell body. The sigmoid activation function draws motivation from the factthat, since it squashes the output to [0, 1] it can be interpreted as indicatingif a neuron has fired or not.Figure 2.13: The real neural network in the brain is merged and connected withthe artificial neural network in this figure. [49], [50]From Figure 2.13, the Artificial Neural Networks got inspiration from theneural network in the brain. The connection between the two is given by theedges and represents weights or parameter of the neural network model. Theweights in Artificial Neural Network helps in transferring signal (0 meansdoes not transfer signal) to other neurons.362.4. Neural Networks2.4.2 Recurrent Neural NetworksRecurrent neural networks are machine learning models that inherit neu-ral network architecture to handle input and output sequences. Weatherprediction requires time series of temperature, pressure data in the form ofsequences to be processed as inputs so as to predict the future temperature.A recurrent neural network (RNN) uses a recurrence formula of the formht = fθ(ht−1, xt), where f is a function that we describe in more detail belowand the same parameters θ are used at every time step. The function fθ isa connectivity pattern that can process a sequence of vector x1, x2, ..., xT ,with arbitrary length. The hidden vector ht can be thought of as a runningsummary of all vectors xt′ for t′ < t. The recurrence formula updates thesummary based on the latest input. For initialization, we can either treath0 as vector parameters and learn the starting hidden state or simply seth0 =#»0 . There are many forms of recurrence and the exact mathematicalform of the recurrence (ht−1, xt) 7→ ht varies from neural network architectureto architecture. We will discuss about this later in this section.The simplest Recurrent Neural Network uses a recurrence of the form,ht = tanh(W xtht−1), (2.13)where the tanh function is applied element-by-element transformation asoutlined in section 2.4.1.372.4. Neural NetworksFigure 2.14: A recurrent neural networkFrom Figure 2.14, the neural network can be thought of as rolling feedforward neural network in time. From the above figure you can see the thatthe θx is preserved as the same parameter in each and every time step. Thesame works for other parameters θ’s along the time step. The reason whyit stays constant is that, while training because of limitation in memorythe network can be trained only with constant time steps, because of theinvolvement of many gradients along the entire time step. While testingsince the same parameter remains in each time step the neural network canbe rolled to any steps in the future for for future prediction.The current input and the previous hidden vector are concatenated andtransformed linearly by the matrix product with W . This is equivalent toht = tanh(Wxhxt+Whhht−1), where the two matrices Wxh,Whh concatenatedhorizontally are equivalent to the matrix W above. These equations do notconsider the bias vector that should be included. Instead of the tanh non382.4. Neural Networkslinearity, a ReLU or sigmoid non linearity can also be used. If the inputvectors xt have dimension D and the hidden states have dimension H then Wis a matrix of size [H × (D+H)]. Interpreting the equation, the new hiddenstates at each time step are a linear combination of elements of xt, ht−1,squashed by non-linearity. The simple RNN has a simple neat form, but theadditive and simple multiplicative interactions result in weak coupling [22, 23]especially between the inputs and the hidden states. The functional form ofthe simple RNN leads to undesirable dynamics during backpropagation [24].The gradients either explode (if the gradient value is greater than 1) or vanish(if the gradient value is less than 1) especially when expanded over long timeperiods. This specific exploding gradient problem can be overcoming withgradient clipping [25], i.e., clipping the gradient or restarting the gradientespecially if it reaches a threshold. On the other hand, the vanishing gradientproblem cannot be addressed by gradient clipping.Figure 2.15: Block representation of RNN. RED: input layer, GREEN: hiddenlayer, BLUE: output layer.392.4. Neural Networks2.4.3 Long Short-Term MemoryFigure 2.16: LSTM internal structure [30]The motivation for Long Short Term Memory [26] (LSTM) is to addresssome of the limitations of RNN, especially during training, i.e., vanishing andexploding gradient problem. Its recurrence formula allows the inputs xt andht−1 to interact in a more computationally complex manner. The interactionincludes multiplicative interactions and additive interactions over time stepsthat effectively propagate gradients backwards in time [26]. LSTMs maintaina memory vector ct in addition to a hidden state vector ht. The LSTMcan choose to read from, write to, or reset the cell using explicit gatingmechanisms at each time step . The LSTM recurrence update is as follows:402.4. Neural Networksijog =sigmsigmsigmtanhW xtht−1 (2.14)ct = f  ct−1 + i g (2.15)ht = o tanh(ct) (2.16)If the input dimensionality is D and the hidden state has H units then thematrix W has dimensions [4H × (D +H)]. Here, the sigmoid function sigmand tanh are applied element-wise. The i ∈ RH is a binary gate that controlswhether each memory cell is updated. The j ∈ RH is a forget gate thatcontrols whether it is reset to zero and the o ∈ RH is an output gate controlswhether its local state is revealed in the hidden state vector. To keep themodel differentiable and hence allowed to range smoothly between zero andone, the activations of these gates are based on the sigmoid function . Thevector g ∈ RH ranges between −1 and 1 and is used to additively modify thememory contents. This unique additive interaction operation makes LSTMstand out of simple RNN. During backpropagation this additive operationequally distributes gradients. This allows gradients on c to flow backwardsin time for long time periods without vanishing or exploding.412.4. Neural NetworksFigure 2.17: Sample RNN architecture for time series prediction.Time series prediction with RNN Figure 2.17 shows a sample RNNarchitecture for time series based prediction. Consider a data set where wehave time series data of daily temperatures over the last couple of yearsand our goal is to predict temperature at the next day given the previous4 days of temperature. Now, the input can be fed as series of windows ofsize 4 and the output to train the network is the temperature at the nextday after the day 4 i.e., 4 days of temperature as input from xt−3 to xt422.5. Model Predictive Controland the next day temperature as output, yt. Here, the loss function will beL(θ) =∑ni=1(yit − xit+1)2, where yit is the predicted output from RNN andxit+1 is the labeled output in the training data.The first part of our work discussed in Chapter III involves learning ofcontrol behaviour using supervised learning. Having seen machine learning,specifically supervised learning, and the various models and optimizationtechniques involved in supervised learning, we now turn to Model PredictiveControl, the knowledge of which is essential for chapter III.2.5 Model Predictive ControlModel (Based) Predictive Control (MBPC or MPC) originated in the late1970’s and has been constantly developed since then, from simple predictionstrategies to the incorporation of economics into MPC constraints. The termModel Predictive Control explicitly uses a model of the process to obtain thecontrol signal by minimizing an objective function. It does not designate aspecific control strategy or a control algorithm but a very broad range ofcontrol methods. MPC design and formulation lead to non-linear controllerswhich present adequate degrees of freedom and have practically the samestructure as that of MPC. In general the predictive control family involvesthe following key ideas:ˆ Uses a model (empirical or first principle) to predict the process outputfor some number of future time steps (or horizon).ˆ Calculates a sequence of control actions (equal to control horizon) byminimizing an objective function.432.5. Model Predictive Controlˆ Using the prediction step uses the future error, to come up with thefirst control action to the plant, then re-calculating as above. Thisstrategy is called receding strategy.The different MPC algorithms that have been developed vary by themodel that is used to represent the process, the noise model, and the costfunction to be minimized. MPC has got its application in diverse set ofindustries ranging from chemical process industries to clinical industries.2.5.1 MPC Strategy1.1. MPC Strategy 3 u(t) 1\ y(t+klt) y(t)! N , ,I! :> t-l t t+l t+k t+N Figure 1.1: MPC Strategy it is obvious that the benefits obtained will be affected by the discrepancies existing between the real process and the model used. In practice, MPC has proved to be a reasonable strategy for industrial control, in spite of the original lack of theoretical results at some crucial points such as stability or robustness. . 1.1 MPC Strategy The methodology of all the controllers belonging to the MPC family is charac-terized by the following strategy, represented in figure 1.1: 1. The future outputs for a determined horizon N, called the prediction horizon, are predicted at each instant t using the process model. These predicted outputs y(t + kit) 1 for k = 1 ... N depend on the known values up to instant t (past inputs and outputs) and on the future control signals u(t + kit), k = O ... N - I, which are those to be sent to the system and to be calculated. 2. The set of future control signals is calculated by optimizing a determined criterion in order to keep the process as close as possible to the reference trajectory w( t + k) (which can be the setpoint itself or a close approxima-tion of it). This criterion usually takes the form of a quadratic function 1 The notation indicates the value of the variable at the instant t + k calculated at instant t. Figure 2.18: MPC strategy. [52]Now let us see the strategies of MPC that makes this controller a well suitedone for c ntrol.442.5. Model Predictive Control1. The prediction horizon which is the future outputs for a predefinedhorizon N are predicted and computed at each instant t using thedefined process model. The predicted outputs y(t+k|t) for k = 1, ..., Ndepend on the known values up to instant t and on the future controlinputs u(t+ k|t), k = 0, ..., N − 1, that are to be sent to the plant andto be calculated.2. The goal is to keep the process output close to the reference set point ortrajectory w(t+k), hence future control input is calculated by optimiz-ing an objective function. The objective function is usually a quadraticfunction - commonly the squared error in difference between the pre-dicted output and the reference set point. The control effort and otherconstraints (economical or operational constraints) are included in theobjective function. Even for simple quadratic objective, the presenceof constraints means an iterative optimization method has to be used.Performance is tuned in the objective function, by choosing weights fordifferent process outputs especially if the system is Multi Input MultiOutput system.3. Once the control input u(t|t) is sent to the process, the next controlsignals calculated from minimizing the objective function are discarded.The reason is at the next instant y(t+ 1) is already known and step 1can be repeated with the new value that is returned by optimizationobjective function. Then all the sequences are brought up to date, andu(t+1|t+1) is calculated using the receding horizon concept, discussedearlier.452.5. Model Predictive Control2.5.2 MPC ComponentsTo set up an MPC, we must stipulate the horizon, an explicit model andoptimization method.ˆ Horizonˆ Explicit use of modelˆ OptimizationTwo Neural Network Approaches to Mod l Predictive ControlYunpeng Pan and Jun WangAbstract— Model predictive control (MPC) is a powerfultechnique for optimizing the performance of control systems.However, the high computational demand in solving optimiza-tion problem associated with MPC in real-time is a majorobstacle. Recurrent neural networks have various advantagesin solving optimization problems. In this paper, we apply tworecurrent neural network models for MPC based on linear andquadratic programming formulations. Both neural networkshave good convergence performance and low computationalcomplexity. A numerical example is provided to illustrate theeffectiveness and efficiency of the proposed methods and showthe different control behaviors of the two neural networkapproaches.I. INTRODUCTIONModel predictive control (MPC), which is more advancedthan the well-known PID-control, has achieved great successin practical applications in recent decades. One of the keyadvantages of MPC is its ability to deal with input and outputconstraints; another is that MPC can be naturally applied formultivariable process control. Because of these advantages,MPC has been used in numerous industrial applications inthe refining, petrochemical, chemical, pulp, paper, and foodprocessing industries. Academic interests in MPC startedgrowing in the late seventies, several publications providea good introduction to theoretical and practical issues asso-ciated with MPC technology [1]-[3].Most control techniques do not consider the future im-plication of current control actions. MPC applies on-lineoptimization to a system model. By taking the current stateas an initial state, a cost-minimization control strategy iscomputed at each sample time, and at the next computationtime interval, the calculation repeated with a new state. Thebasic structure of MPC is shown in Fig.1. As the processmodel of MPC is usually expressed with linear or quadraticcriterion, MPC problems can be generally formulated aslinear programming or quadratic programming problems. Asa result, they can be solved using solution methods for linearand quadratic programming problems.Over years, a variety of numerical methods have beendeveloped for solving linear and quadratic programmingproblems. Compared with traditional numerical methodsfor constrained optimization, neural networks have severaladvantages: first, they can solve many optimization problemswith time-varying parameters; second, they can handle large-scale problems with their parallelizable ability; third, theyThe authors are with the Department of Mechanical and Automation En-gineering, The Chinese University of Hong Kong, Shatin, New Territories,Hong Kong (email: {yppan, jwang}@mae.cuhk.edu.hk)ModelOptimizerFutureinputsPast inputsand outputsPredictedoutputsReferencetrajectory+-Constraints  Cost functionFig. 1. Basic structure of MPC.can be implemented effectively using VLSI or optical tech-nologies [4]. Therefore, neural networks can solve optimalcontrol problems in running times at the orders of magnitudemuch faster than the most popular optimization algorithmsexecuted on general-purpose digital computers. Applicationareas of neural networks include, but are not limited to,system modeling, mathematical programming, associativememory, combinatorial optimization, pattern recognition andclassification, robotic and process control.In the past two decades, recurrent neural networks foroptimization and their engineering applications have beenwidely investigated. Tank and Hopfield proposed the firstworking recurrent neural network implemented on analogcircuits [5], their work inspired many researchers to developother neural networks for solving linear and nonlinear opti-mization problems. By the dual and projection methods, Hu,Liu, Xia, and Wang developed several neural networks forsolving general linear and quadratic programming problems.These neural networks have shown good performance inconvergence.In this paper, we propose two neural network approachesto the design of MPC by applying two recurrent neuralnetworks [6] and [7]. Both neural networks have desiredconvergence property and relatively lower computationalcomplexity. A numerical example shows that the proposedapproaches are effective and efficient in MPC controllerdesign. Furthermore, a comparison was made between thetwo neural network approaches.The rest of this paper is organized as follows. In Section II,we derive both linear and quadratic formulations for MPCcontroller design. In Section III, we present two recurrentneural network approaches to MPC based on linear andquadratic programming. In Section IV, we provide a numer-2008 American Control ConferenceWestin Seattle Hotel, Seattle, Washington, USAJune 11-13, 2008WeC13.5978-1-4244-2079-7/08/$25.00 ©2008 AACC. 1685Figure 2.19: Basic structure of MPC. [52]462.5. Model Predictive ControlStandard ModelThe standard model we use in MPC is a state space model given by,Xk+1 = AXk +BUk (2.17)Yk = CXk +DUk (2.18)where A,B,C,D are constant matrices whose dimensions are compatiblewith the dimensions of the state X, contol U , and output Y .HorizonMPC uses prediction horizon (number of time steps to look forward) to pre-dict forward and control horizon (number of times steps that we have controlon) to take control actions. The prediction horizon and control horizon canbe visualized using the figure 2.20,MPC contains three parts  • Horizon • Explicit use of model  • Optimization Analysis: Performance and stability:  How do we tune the parameters of the controller to attain these factors? MPC handles the connection between performance, model and horizon.    Horizon:           If control horizon is less than prediction horizon we are looking at casual systems. Standard model: Standard model we use in MPC is state space model. 𝑋𝑋𝐾𝐾+1 = 𝐴𝐴𝑋𝑋𝑘𝑘 + 𝐵𝐵 𝑈𝑈𝑘𝑘 𝑋𝑋�𝑘𝑘+1|𝑘𝑘 = 𝐴𝐴𝑋𝑋�𝑘𝑘|𝑘𝑘 + 𝐵𝐵 𝑈𝑈𝑘𝑘 𝑋𝑋�𝑘𝑘+2|𝑘𝑘 = 𝐴𝐴𝑋𝑋�𝑘𝑘+1|𝑘𝑘 + 𝐵𝐵 𝑈𝑈𝑘𝑘+1 𝑋𝑋�𝑘𝑘+𝑀𝑀+1|𝑘𝑘 = 𝐴𝐴𝑋𝑋�𝑘𝑘+𝑀𝑀|𝑘𝑘 + 𝐵𝐵 𝑈𝑈𝑘𝑘+𝑀𝑀 𝑋𝑋�𝑘𝑘+𝑀𝑀+2|𝑘𝑘 = 𝐴𝐴𝑋𝑋�𝑘𝑘+𝑀𝑀+1|𝑘𝑘 + 𝐵𝐵 𝑈𝑈𝑘𝑘+𝑀𝑀  After the control horizon the input is not changing so the 𝐵𝐵 𝑈𝑈𝑘𝑘+𝑀𝑀 is common.  In MPC we look forward steps for prediction and we call this as prediction horizon. Decision variables of an optimization problem are control moves at time instants up to control horizon.  Optimization problem for MPC:  If we define  Φ =  ⎣⎢⎢⎢⎢⎡𝑌𝑌�𝑘𝑘+1 − 𝑌𝑌�𝑘𝑘+1𝑓𝑓𝑓𝑓𝑓𝑓𝑌𝑌�𝑘𝑘+2 − 𝑌𝑌�𝑘𝑘+2𝑓𝑓𝑓𝑓𝑓𝑓..𝑌𝑌�𝑘𝑘+𝑃𝑃 − 𝑌𝑌�𝑘𝑘+𝑃𝑃𝑓𝑓𝑓𝑓𝑓𝑓⎦⎥⎥⎥⎥⎤ The objective function is Φ𝑇𝑇𝐻𝐻Φ + ΦuTQΦu  …. Prediction horizon(P) 𝑋𝑋�𝑘𝑘|𝑘𝑘 𝑡𝑡 = 𝑘𝑘 𝑡𝑡 = 𝑘𝑘 + 1 𝑡𝑡 = 𝑘𝑘 + 𝑀𝑀 𝑡𝑡 = 𝑘𝑘 + 𝑃𝑃 …. Control horizon(M) Figure 2.20: Prediction and Control Horizon472.5. Model Predictive ControlHence if we denote P as prediction horizon and M as control horizon, thestate space model can be rolled forward with respect time steps as,Xk+1 = AXk +BUkˆXk+1|k = AXˆk +BUkˆXk+2|k = A ˆXk+1|k +BUk+1ˆXk+M+1|k = A ˆXk+M |k +BUk+MˆXk+M+2|k = A ˆXk+M+1|k +BUk+MAfter the control horizon, the input does not change so the BUk+M is com-mon. The decision variables in the optimization problem are the controlvalues at all time instants up to control horizon.Optimization Objective MPCWe define,Yˆk:k+P =Yˆk+1Yˆk+2..Yˆk+P, Yˆ refk:k+P =Yˆ refk+1Yˆ refk+2..Yˆ refk+P, Uk:k+M =UkUk+1..Uk+M(2.19),where Yˆ is the predicted output, Yˆ ref is the reference set point and U isthe control input.Objective function of MPC can be written as,482.5. Model Predictive Control(Yˆk:k+P − Yˆ refk:k+P )TH(Yˆk:k+P − Yˆ refk:k+P ) + UTk:k+MQUk:k+MWhere H and Q is a weighting matrix. . Either Uk or ∆Uk = Uk+1 − Ukcan be used as decision variables.Constraints To protect the physical actuators and to avoid excessive fluc-tuations, input constraints are imposed. The constraints are given as,Umin ≤ U ≤ Umax (2.20)∆Umin ≤ ∆U ≤ ∆Umax (2.21)We also impose constraints on the output for stability.Y min ≤ Y ≤ Y maxCoincidence Points If we want the output to reach a certain value at acertain time, we can pose this requirement as an equality constraint. Thedefinition of the coincident point is Y refk+i = YSP , where Y SP is the set point.Tuning parameters Available tuning parameters are M,P,H and W . Pis chosen based on the steady state of the system under consideration. Hand W are to be chosen based on the system knowledge and we can tuneH,W and M . Infinite control horizon problems are stable [56].Implementation of MPC The way we implement MPC is at t = k , fromoptimization problem we get Uk, Uk+1, ..UkM . We take Uk and implementthat control move. At the next time step, we obtain additional informationYk+1, so the U′ks are recalculated and the solution of an optimization problem492.5. Model Predictive Controlwill provide Uk+1, ..Uk+M . We will implement Uk+1 control move alone anddiscard the rest. This procedure is repeated till we reach control horizon.After the control horizon, we don’t perform any control moves. To be im-plemented online in a real plant, when the sampling time is very small, theoptimization calculations should be fast enough to give control moves at therates compatible with the sampling rates. The optimization problems re-quired to implement MPC are Quadratic programming problems, for whichfast solution algorithms are well developed.Handling of Model Mismatch If one can assume that the differencein the predicted and the measured Y values is due to the model mismatchdk alone, the offset in the estimated states can be reduced by including dkwith the predicted values of Yk at each time instant, with the assumptionthat dk+i = dk, k to k+M , and when we move to next step, use dk+1, newlycalculated and follow the assumption for the time steps ahead up to M . Thiscorrection is equivalent to using the integrating effect, in a PI controller andthis will eliminate the offset.The second part of our work discussed in Chapter IV involves learning ofcontrol behavior using reinforcement learning. Earlier we saw about machinelearning and MPC. Now we will discuss reinforcement learning, which will beapplied to process control in chapter IV, where we develop a learning basedcontroller.502.6. Reinforcement Learning2.6 Reinforcement LearningReinforcement Learning TermsAgentActionEnvironmentStateFigure 2.21: Reinforcement learning flow diagram2.6.1 IntroductionReinforcement learning is entirely driven by a reward hypothesis, for whichdetails will be discussed later. It is a sequential decision making problem inwhich an agent takes an action at time step t and learns as it performs thataction at each and every time step, with the goal of maximizing the cumula-tive reward. The decision maker is the agent and the place the agent interactsor acts is the environment. The agent receives observations (or states) fromthe environment and takes an action according to a desired (or to be learnt)512.6. Reinforcement Learningpolicy at each time step. Once the action is performed, the environmentthen provides a reward signal (scalar) that indicates how well the agent hasperformed. The general goal of reinforcement learning is to maximize theagent’s future reward, given all past experience of the agent in the environ-ment. The reinforcement learning problem is usually modelled as a MarkovDecision Process (MDP) that comprises a state space and action space, re-ward function and an initial state distribution that satisfies the Markov prop-erty that P (st+1, rt+1|s1, a1, r1, .....st, at, rt) = P (st+1, rt+1|st, at). where st, atand rt are states, action and reward at time t respectively and P is a jointdistribution conditioned on state, action and reward until time t.2.6.2 Markov Decision ProcessesMarkov Decision Processes (MDPs) are used to better explain or to modelthe environment where the agent interacts. In our work and in this disser-tation we consider fully observable MDPs. The MDP is defined as a tuple(S,D,A, Psa, γ, R) consisting of :ˆ S: Set of possible states of the worldˆ D: An initial state distributionˆ A: Set of possible actions that the agent can performˆ Psa: The state transition probabilities (which depend on the action)ˆ γ: A constant in (0, 1) that is called the discount factor, (used to giveless effect on future reward and more weight to the current reward)522.6. Reinforcement Learningˆ R: A scalar-valued reward function defined on S × A → R, whichindicates how well the agent is doing at each time step.In an MDP, there is an initial state s0 that is drawn from the initialstate distribution. The agent interacts in the environment through action,and at each time step, resulting in the next successor state st+1. At eachstep by taking actions, the system visits a series of states s0, s1, s2, ... in theenvironment such that the sum of discounted rewards as the agent visits isgiven by:R(s0, a0) + γR(s1, a1) + γ2R(s2, a2) + ... (2.22)The goal of the reinforcement learning agent is to take actions a0, a1, ...such that the expected value of the rewards given in equation (2.22) is max-imized. The agent’s goal is to learn a good policy (which is a mapping fromstates to actions) such that during each time step for a given state, the agentcan take a stochastic or deterministic action pi(s) that will result in maximumexpected sum of rewards, compared to other possible sequences of actions:Epi[R(s0, pi(s0)) + γR(s1, pi(s1)) + γ2R(s2, pi(s2)) + ...] (2.23)2.6.3 Learning FormulationHere we consider only model-free reinforcement learning methods. The model-free reinforcement learning framework aims at directly learning the systemfrom past experience without any prior knowledge of the environment dy-namics. A policy could be a function or a discrete mapping from states toactions and is used to select actions in a MDP. We consider both stochastic532.6. Reinforcement Learninga ∼ piθ(a|s) and deterministic a = piθ(s) policies. Usually function basedpolicies are used because if the state and action space is large then the dis-crete mapping (in the form of a table) is not efficient to capture all thestates and actions. Hence, we consider function based parametric policies,in which we choose an action a (whether stochastically or deterministically)according to a parameter vector θ ∈ Rn. Given parametric policy, the agentcan interact with the MDP to give a series of states, actions and rewardsh1:T = s1, a1, r1, ....sT , aT , rT . We use the return, rγt which is the total dis-counted reward from time step t onwards, rγt =∑∞k=t γk−tr(sk, ak) where0 < γ < 1. The agent’s objective function is given by J(pi) = E[rγ1 |pi]. Inmost process control applications, the state and action spaces are continuoushence we use neural networks to approximate the policy.2.6.4 Expected ReturnAs discussed earlier, the goal of reinforcement learning is to maximize theexpected return of a policy piθ with respect to the expected cumulative return.J(θ) = ZγEH∑k=0γkrk (2.24)where γ ∈ (0, 1) is the discount factor, H is the planning horizon, and Zγis a normalization factor. To achieve this, we want to find a policy thatcan help us achieve the maximum reward. Here, we consider actor-criticbased policy gradient algorithms. The policy gradient algorithms can handlecontinuous state and action spaces. They adjust the parameters θ of thepolicy in the direction of the performance gradient ∇θJpiθ maximizing theobjective function J(piθ) which is the value function (value function will be542.6. Reinforcement Learningdiscussed later).Reinforcement Learning problems can be formulated in two ways, Valuebased and policy based reinforcement learning. We now briefly describe bothoptions.2.6.5 Value-Based Reinforcement LearningReinforcement learning uses the value functions which is used to evaluategoodness/ badness of states. For an agent following a policy pi(s), the valuefunction V pi(s) is given by:V pi(s) = Epi[Rt|st = s] (2.25)To incorporate action in the value function, action value function is used. Itis used to evaluate goodness/ badness of states along with the actions. Theaction value function Qpi(s, a) when the agent takes action a for state s, andfollowing a policy pi, is given by:Qpi(s, a) = Epi[Rt|st = s, at = a] (2.26)where Epi denotes the expectation over agent’s experiences that are collectedby agent following the policy pi. As stated earlier, we consider model-freereinforcement learning such that the value or action-value function (which isa neural network) is updated using a sample backup as the agent interacts.This means, that at each time step of the agent’s interaction, an action issampled from the agent’s current policy. Then the successor state is sampledfrom the environment and the corresponding rewards are computed. Theaction value function is updated as the agent gets reward by interacting withthe environment.552.6. Reinforcement Learning2.6.6 Policy based Reinforcement LearningThe value based approach above can handle discrete states and actions. Forcontinuous states and actions an optimization based approach is required tolearn the continuous policy function. Policy gradient algorithms rely uponoptimizing parameterized policies with respect to the future cumulative re-ward. Policy gradient directly optimize a parametrized (parametrized withneural network) policy by gradient ascent such that the optimal policy willmaximize the agent’s cumulative reward at each and every time step. ForValue-based methods local optima convergence is not guaranteed, hence pol-icy gradient methods are useful because they are guaranteed to converge inthe worst case at least to a locally optimal policy. Also the policy gradientalgorithms can handle high dimensional continuous states and actions. Thedisadvantage of policy gradient algorithms is that it is difficult to attain aglobally optimal policy.In this work, we consider deterministic policy gradient algorithm for learn-ing the control behaviour of the plant. Hence, let us investigate the detailsof the deterministic policy gradient approach.2.6.7 Policy Gradient TheoremThe policy gradient theorem [58] is the basis for a series of deterministicpolicy gradient algorithms. We first consider stochastic policy gradient the-orem, then gradually migrate from probabilities and stochasticity to a moredeterministic framework. The stochastic policy gradient theorem, says that562.6. Reinforcement Learningfor stochastic policies, one has∇θJ(θ) = Es∼ρpi ,a∼piθ[∇θlogpiθ(s,a)Qpiθ (s,a)] (2.27)The deterministic policy gradient theorem says that for deterministic policyis given as:∇θJ(θ) = Es∼ρpi [∇θµθ(s)∇aQµ(s, a)|a=µθ(s)] (2.28)2.6.8 Gradient Ascent in Policy SpaceSince we are interested in maximizing the expected return by choice of policyinstead of gradient descent which searches for minimum, we have to searchfor the maximum using gradient ascent. The gradient step to increase theexpected return is given as,θk+1 = θk + αk∇θJ(θ)|θ=θk (2.29)Here θ0 is the initial policy parameter and θk denotes the policy parameterafter k updates. αk denotes the learning rate which is related to step-size.2.6.9 Stochastic Policy GradientThe policy gradient algorithm searches for a locally optimal policy (locallyoptimum) by ascending the gradient of J(θ) of the policy, such that, δθ =α∇θJ(θ), and α is the learning rate for gradient ascent. If we consider aGaussian policy whose mean is a linear combination of state features µ(s) =φ(s)T θ.In the equation of the stochastic policy gradient theorem expressed inequation 2.27, the equation for the ∇θlogpiθ(s, a), that is called the score572.6. Reinforcement Learningfunction, is given by:∇θlogpiθ(s, a) = (a− µ(s))φ(s)σ2(2.30)From equation 2.27, the Qpiθ(s, a) is approximated using a linear functionapproximator (linear regression or linear regression with basis functions),where the critic estimates Qw(s, a), and in the stochastic policy gradient, thefunction approximator is compatible such thatQw(s, a) = ∇θlogpiθ(a|s)Tw (2.31)where the w parameters are chosen to minimize the mean squared errorbetween the true and the approximated action-value function.∇θJ(θ) = Es∼ρpi ,a∼piθ [∇θlogpiθ(s, a)Qw(s, a)] (2.32)If we use the off-policy stochastic actor-critic along with the stochastic gra-dients setting. we get the following expression for the stochastic off-policypolicy gradient:∇θJ(θ)β(piθ) = Es∼ρβ ,a∼β[piθ(a|s)βθ(a|s)∇θlogpiθ(a|s)Qpi(s, a)] (2.33)where β(a|s) is the behaviour off-policy (that is used during learning) whichis different from the parameterized stochastic policy, β(a|s) = piθ(a|s) that isbeing learnt.2.6.10 Deterministic Policy GradientThe stochastic gradient framework can be extended to deterministic policygradients [12] by removing the stochasticity involved, similar to the reduction582.7. Function Approximationfrom the stochastic policy gradient theorem to the deterministic policy gra-dient theorem. In deterministic policy gradients having continuous actions,the goal is to move the policy in the direction of gradient of Q rather thantaking the policy to the global maximum. Following that in the deterministicpolicy gradients, the parameters are updated as follows:θk+1 = θk + αEs∼ρµk[∇θQµk(s, µθ(s))] (2.34)The function approximator Qw(s, a) is consistent with the deterministic pol-icy µθ(s) if :∇aQw(s, a)|a=µθ(s) = ∇Tθµθ(s)w (2.35)where ∇θµθ(s) is the gradient of the actor function approximator. Using thefunction approximator, we get the expression for the deterministic policygradient theorem as shown in the equation below:∇θJ(θ) = Es∼ρβ [∇θµθ(s)∇aQw(s, a)|a=µθ(s)] (2.36)We consider the off-policy deterministic actor-critic where we learn a deter-ministic target policy µθ(s) from trajectories to estimate the gradient usinga stochastic behaviour policy pi(s, a) that is considered only during training.∇θJβ(µθ) = Es∼ρβ [∇θµθ(s)∇aQw(s, a)|a=µθ(s)] (2.37)2.7 Function ApproximationThe action-value function has to be learnt in reinforcement learning setup.If the number of states and actions are large, it is impossible to store themall in a table. Hence, function approximators are used. Since, deep learning592.7. Function Approximationfound a great success in the recent years (after breakthroughs in image classi-fication, speech recognition, machine translation, etc.), deep neural networksare used as a function approximators to learn the action-value function, i.e.,to approximate Q in the policy gradient theorem. Policy gradient methodsuse function approximators in which the policy is represented by a functionapproximator and is updated with respect to the policy parameters duringthe learning process by following a stochastic policy.In this dissertation, we have only used neural networks as function approx-imators. Both the stochastic and the deterministic policies can be approxi-mated with different neural network approximators, with different parame-ters w for each network. The main reason for using function approximatorsis to learn the action value function that cannot be stored in a gigantic tableespecially when the number of states and actions are large. To learn actionvalue function using function approximators, policy evaluation methods liketemporal difference learning are used.2.7.1 Temporal Difference LearningWe use Temporal Difference (TD) Learning to find the action value function.TD learning learns the action value function using reward, current and suc-cessive states and actions. Temporal Difference methods have been widelyused policy evaluation reinforcement learning based algorithm. TD learningapplies when the transition probability P and the reward function rpi are notknown and we simulate the RL agent with a policy pi.The off-policy TD algorithm known as Q-learning was one of the famousmodel-free reinforcement learning based algorithms [59]. Here the learnt602.7. Function Approximationaction value function is a direct estimate of the optimal action-value functionQ∗. For learning the action value function, again a loss has to be definedand we will have squared error loss with reward and from action value atnext time step. This squared loss will be useful for learning the parametersof action value function neural network. It will be discussed in detail in thenext section.2.7.2 Least-Squares Temporal Difference LearningThe least squares temporal difference algorithm (LSTD) [51] takes advan-tage of sample complexity and learns the action value function quickly. Asdiscussed earlier, in order to avoid storing many state action pair in a table,for large state and action spaces, for both continuous and discrete MDPs, weapproximate the Q function with a parametric function approximator (neuralnetwork) as follows:Qˆpi =k∑i=1pii(s, a)wi = φ(s, a)Tw (2.38)where w is the set of weights or parameters, φ is a (|S||A| × k) matrix whererow i is the vector φi(s, a)T , and we find the set of weights w that canyield a fixed point in the value function space such that the approximate Qaction-value function becomes:Qˆpi = φwpi (2.39)Assuming that the columns of φ are independent, we require that Awpi = bwhere b is b = φTR and A is such thatA = φT (φ− γP piφ) (2.40)612.8. Actor-Critic AlgorithmsWe can find the function approximator parameters w from:wpi = A−1b (2.41)2.8 Actor-Critic AlgorithmsIn policy gradient actor-critic [31], the actor adjusts the policy parame-ters θ by gradient ascent to maximize the action value function, for bothstochastic and deterministic gradients. The critic learns the w parametersof the LSTD function approximator to estimate the action-value functionQw(s, a) u Qpi(s, a). The estimated Qw(s, a) that we substitute instead ofthe true Qpi(s, a) may introduce bias in our estimates if we do not ensurethat the function approximator is compatible for both stochastic and deter-ministic gradients.The function approximator for stochastic policy gradient is Qw(s, a) =∇θlogpiθ(a|s)Tw and the function approximator for deterministic policy gra-dient is Qw(s, a) = wTaT∇θµθ(s), where ∇θµθ(s) is the gradient of the actor.Our approach in this work is off policy actor critic. To estimate the policygradient using an off-policy setting we use a different behaviour policy tosample the trajectories. We use the off-policy actor critic algorithm to sam-ple the trajectories for each gradient update using a stochastic policy. Withthe same trajectories, the critic also simultaneously learns the action-valuefunction off-policy by temporal difference learning.622.9. Exploration in Deterministic Policy Gradient Algorithms2.9 Exploration in Deterministic PolicyGradient AlgorithmsSince we are learning a deterministic policy while learning the reinforcementlearning agent our agent does not explore well. Hence, stochastic off-policyin a deterministic setting will make sure that there is enough exploration inthe state and action space. The advantage of off-policy exploration is thatthe agent will learn a deterministic policy in deterministic policy gradientsbut will choose its actions according to a stochastic behaviour off-policy. Theexploration rate should be set carefully, domain knowledge is required to setthe exploration rate. Also, manual tuning of exploration (σ parameters) canalso help the agent learn to reach a better policy, as it can explore more.In our work, we consider obtaining trajectory of states and actions usingstochastic off-policy (which makes sure of enough exploration) and therebylearn a deterministic policy.2.10 Improving Convergence ofDeterministic Policy GradientsWe used the following optimization based techniques to improve the conver-gence of deterministic policy gradient algorithms.632.10. Improving Convergence of Deterministic Policy Gradients2.10.1 Momentum-based Gradient AscentClassic MomentumThe classic momentum technique was already seen in the first order basedderivative optimization section in section 2.3. Classical momentum (CM) foroptimizing gradient descent update, is a technique for accelerating gradientascent. It can accumulate a velocity vector in directions of consistent im-provement in the objective function. For our objective function J(θ), theclassical momentum approach is given by:vt+1 = µvt + ∇θJ(θt) (2.42)θt+1 = θt + vt+1 (2.43)where, θ is our vector of policy parameters,  > 0 is the learning rate, andwe use a decaying learning rate of  = at+b, where a and b are the parametersthat we run a grid search over to find optimal values, t is the number oflearning trials on each task, and µ is the momentum parameter µ ∈ [0, 1].2.10.2 Nesterov Accelerated GradientWe also used Nesterov Accelerated Gradient (NAG) for optimizing gradientdescent update in our policy gradient algorithms. It is a first order gradientbased method that has a better convergence rate guarantee. The gradientupdate of NAG is given as:vt+1 = µvt + ∇θJ(θt + µvt) (2.44)642.11. SummaryWe used both classic momentum and nesterov accelerated gradient tospeed up the learning.2.11 SummaryLet us summarize the work flow involved in training a machine learning basedmodel.Data preparation The first task is to collect training data to be able tostart training the model. The input data is represented as x and output datais represented as y in this dissertation. For validating our model we split thedata into three sets. 80% data as training data, 10% as validation data andthe remaining 10% as test data. We use gradient descent based optimizationtechnique to learn the parameters from the training set. For verifying thehyper parameters, the validation data set will be used. Finally to evaluatethe performance of the model, the test data set will be used.Data preprocessing Data preprocessing helps in rapidly finding the op-timal parameters. In this dissertation, the mean shift was done to have zeromean and the individual samples of the features were divided by the vari-ance, to have variance of 1. Also, chapter 3 involves training for recurrentneural networks, which involved processing of data as time steps in the formof window slices.Architecture Design Coming up with an effective neural architectureinvolves understanding the underlying structure with supporting theories.652.11. SummaryIn process control based applications, data are stored with respect to time.A common neural network model to learn time series data is the recurrentneural network. Hence, in Chapter 3 we have proposed a recurrent neuralnetwork based architecture to learn optimal control behaviour of MPC. Thenumber of layers and the number of hidden units can be chosen manually tominimize the validation error.Optimization Throughout the dissertation the optimization technique usedfor performing gradient descent update is the Adam optimizer [21] with learn-ing rate of 10−3. Along with Nesterov (NAG) and Momentum (CM) basedtechnique to perform gradient ascent while computing action value function.We also made sure that the learning rate decays over time, we achieved thisby multiplying the learning rate by 0.1 after every 200 iterations. To makesure our optimization algorithm actually searches for minima, we run thealgorithm so as to over fit the data and get 0 training error. In this way weconfirm that bug did not exist in the code. We also store weight files onceevery few iterations, in this way you can decide to choose the weights thathad lower validation error.Hyper parameter Optimization There is no deterministic way of com-ing up with better hyper parameters. The best way to choose hyper param-eters is to randomly choose different hyper parameters and then pick thehyper parameter that has minimum training and validation error.Evaluation Once the model is trained on the training data set, it has tobe tested on the test data set, to check for test error of the trained machine662.11. Summarylearning model. Once we identify the best hyper parameter using hyperparameter optimization, we can use the hyper parameter to learn the entiresample (with training, validation and test data combined.) The machinelearning model is then ready for evaluation and testing.The reminder of this dissertation is organized around the two projectsthat leverage the application of deep learning in process control to solvesome of its important challenges.67Chapter 3Offline Policy Learning - Adeep learning architecture forpredictive controlReferred Conference Publication:The contributions of this chapter have been submitted in:Steven Spielberg Pon Kumar, Bhushan Gopaluni, Philip Loewen,A deeplearning architecture for predictive control, Advanced Control of ChemicalProcesses, 2018 [submitted].Conference Presentation:The contributions of this chapter have also been presented in:Steven Spielberg Pon Kumar, Bhushan Gopaluni, Philip Loewen,A deeplearning architecture for predictive control, Canadian Chemical EngineeringConference, Alberta, 2017.In this chapter we develop approaches for learning the complex behaviourof Model Predictive Control (MPC). The learning is done offline. In the nextchapter we will see how to learn the complex control policies just by inter-acting with the environment. Model Predictive Control (MPC) is a popular683.1. Introductioncontrol strategy that computes control action by solving an optimization ob-jective. However, application of MPC can be computationally demandingand sometimes requires estimation of the hidden states of the system, whichitself can be rather challenging. In this work, we propose a novel Deep NeuralNetwork (NN) architecture by combining standard Long Short Term Mem-ory (LSTM) architecture with that of NN to learn control policies from amodel predictive controller. The proposed architecture, referred to as LSTMsupported NN (LSTMSNN), simultaneously accounts for past and presentbehaviour of the system to learn the complex control policies of MPC. Theproposed neural network architecture is trained using an MPC and it learnsand operates extremely fast a control policy that maps system output di-rectly to control action without having to estimate the states.We evaluatedour trained model on varying target outputs, various initial conditions andcompared it with other trained models which only use NN or LSTM.3.1 IntroductionControlling complex dynamical systems in the presence of uncertainty is achallenging problem. These challenges arise due to non-linearities, distur-bances, multivariate interactions and model uncertainties. A control methodwell-suited to handle these challenges is MPC. In MPC, the control actionsare computed by solving an optimization problem that minimizes a cost func-tion while accounting for system dynamics (using a prediction model) andsatisfying input-output constraints. MPC is robust to modeling errors [29]and has the ability to use high-level optimization objectives [32]. However,693.1. Introductionsolving the optimization problem in real time is computationally demandingand often takes lot of time for complex systems. Moreover, MPC sometimesrequires the estimation of hidden system states which can be challengingin complex stochastic non-linear systems and in systems which are not ob-servable. Also, standard MPC algorithms are not designed to automaticallyadapt the controller to model plant mismatch. Several algorithms exist tospeed up the MPC optimization through linearization [33] of the non-linearsystem and through approximation of the complex system by a simpler sys-tem [33],[34],[35]. However, these algorithms do not account for the completenon-linear dynamics of the system. In this chapter, we propose using the deepneural network function approximator to represent the complex system andthe corresponding MPC control policy. Once the control policies of MPC arelearned by the proposed deep neural network, no optimization or estimationis required. The deep neural network is computationally less demanding andruns extremely fast at the time of implementation.We propose a novel neural network architecture using standard LSTMsupported by NN models (LSTMSNN). The output of LSTMSNN is a weightedcombination of the outputs from LSTM and NN. This novel neural networkarchitecture is developed in such a way that its output depends on past con-trol actions, the current system output and the target output. The LSTMpart of LSTMSNN architecture uses past control actions as input to capturethe temporal dependency between control actions. The NN part of LSTM-SNN architecture uses the current system output and the target output asinputs to predict the control action. Our simulations show that the proposeddeep neural network architecture is able to learn the complex nonlinear con-703.2. Related Worktrol policies arising out of the prediction and optimization steps of MPC.We use a supervised learning approach to train LSTMSNN. First, weuse MPC to generate optimal control actions and system output for a giventarget trajectory. Then, we use these data to train LSTMSNN and learn theMPC control policies. Once the control policies are learned, the LSTMSNNmodel can completely replace the MPC. This approach has the advantage ofnot having to solve any online optimization problems. In this chapter, werestrict ourselves to developing the LSTMSNN model.Our main contribution is the combined LSTM and NN architecture thatcan keep track of past control actions and present system information totake near optimal control actions. Since LSTMSNN uses deep neural net-works accounting for past and present information, we can train complex,high-dimensional states (system with large number of states) in stochasticnon-linear systems using this approach. One of the unique features of ourmethod is that the network is trained using an MPC. The trained LSTMSNNModel is computationally less expensive than MPC because it does not in-volve an optimization step and does not require estimation of hidden states.In implementation, we use a Graphical Processing Unit (GPU) to parallelizethe computation and the prediction of optimal control actions is done ratherrapidly.3.2 Related WorkModel predictive control (MPC) is an industrially successful algorithm forcontrol of dynamic systems [37],[38] but it suffers from high computational713.3. Modelcomplexity when controlling large dimensional complex non-linear systems -(1) the optimization step that determines the optimal control action can takea very long time, especially for higher dimensional non-linear systems; (2) theestimation of states can be burdensome. To overcome this, various functionalapproximation techniques were developed for use with MPC [39],[40]. Inaddition, model free methods such as reinforcement learning techniques [42]were also developed. NN were used to approximate the MPC prediction step[34] and the optimization cost function. NN were also used to approximatethe nonlinear system dynamics [35] and then MPC was performed on theneural network model. However, NN generated system predictions tend to beoscillatory. Other approaches include [36] using a two tiered recurrent neuralnetwork for solving the optimization problem based on linear and quadraticprogramming formulations. However, these approaches involve estimation ofhidden states at each instant. LSTMs are effective at capturing long termtemporal patterns [30] and are used in a number of applications like speechrecognition, smart text completion, time series prediction, etc. Our approachdiffers from others in that LSTMSNN is based on the past MPC controlactions and the current system output. Moreover, our approach does notinvolve the burden of estimating the hidden states that characterize systemdynamics.3.3 ModelIn this work, we propose a novel neural network architecture called LSTM-SNN. We use LSTMSNN (weighted linear combination of LSTM and NN)723.3. Modelto learn the complex behaviour of MPC. Block diagrams for the trainingand testing phases are shown in Figure 3.1 and Figure 3.2. In the trainingphase shown in Figure 3.1, the LSTMSNN neural network model learns theMPC controlling the system through the data from set point, MPC optimalcontrol action and plant output. Once the model has learnt, it can be usedto generate optimal control actions required to control the plant as given inFigure 3.2.The different neural network models used for training MPC behaviourare discussed below.Figure 3.1: Computation flow of the training processFigure 3.2: Computation flow using the trained model to control the system.733.3. Model3.3.1 Long Short Term Memory (LSTM)The LSTM with a sequence length of 5 is trained with data generated fromMPC. There is a helpful overview of LSTM in chapter 2 section 2.4.3. Theinputs to the model are past MPC control actions, plant output and target.The output is the control action to be taken at the next time step. Fig.3.3 illustrates the LSTM model. In Fig. 3.3, the ` denotes the final featurevalues to be considered in the LSTM sequence. For instance, u[k−`] denotesthe final past control action in the LSTM sequence. The details of the modelis given in table 3.2u[k + 1] u[k -l] u[k -1] u[k] y[k -l] y[k - 1] y[k] ytarget[k -l]     ytarget[k -1]   ytarget[k]Figure 3.3: Architecture of LSTM-only model.743.3. Model3.3.2 Neural Network (NN)The inputs to our NN are the previous system output and target output. Theoutput of NN is the control action at the next time step. The considered NNmodel has 3 layers. Fig. 3.4 illustrates the NN model. The details of the NNmodel architecture is given in table 3.2Figure 3.4: Architecture of NN-only model3.3.3 LSTMSNNThis is the new proposed architecture combining the LSTM (sequence lengthof 5) and NN (3 layers) part. The motivation for this design is MPC controlactions depend on both current system output and past input trajectories.753.3. ModelHence, the LSTMSNN output is a weighted combination of the LSTM (whichtakes past input into account) and the NN (which uses the current plantoutput and required set point) to learn the optimal control action given pastinput trajectories, current plant output and required set point. The bestconfiguration of LSTMSNN resulted after tuning by simulation is shown inFig. 3.5. The details of the LSTMSNN model is given in table 3.2u[k -l] u[k-l] u[k] u[k + l] y[k] ytarget[k] Figure 3.5: LSTMSNN model763.4. Experimental setup3.4 Experimental setup3.4.1 Physical systemThe physical system we chose to study is the manufacture of paper in apaper machine. The target output is the desired moisture content of thepaper sheet. The control action is the steam flow rate. The system output isthe current moisture content. The difference equation of the system is givenby,y[k] = 0.6× y[k − 1] + 0.05× u[k − 4] (3.1)The corresponding transfer function of the system is,G(z) =0.05z−41− 0.6z−1 (3.2)The time step used for simulation is 1 second.3.4.2 ImplementationThe neural networks are trained and tested using the deep learning frameworkTensorflow [41]. In Tensorflow, the networks are constructed in the formof a graph and the gradients of the network required during training arecomputed using automatic differentiation. Training and testing were doneon GPU (NVIDIA Geforce 960M) using CUDA [6]. We ran the MPC andgenerated artificial training data in Matlab.3.4.3 Data collectionWe required three variables for training: the target output, system output,and control action. All the data used for training was artificially generated in773.4. Experimental setupMatlab. We chose the target outputs and then used MPC to find the optimalcontrol actions and resulting system outputs. There are three different setsof data used in this study. Each set of data had 100,000 training points.time0 100 200 300 400 500output-5-4-3-2-1012345Pseudo Random Binary SequenceReferenceActionStatetime0 100 200 300 400 500output-5-4-3-2-1012345Pseudo Random Sinusoidal SequenceReferenceActionStatetime0 100 200 300 400 500output-5-4-3-2-1012345Sinusoidal + Pseudo Random Binary SignalReferenceActionStateFigure 3.7: Excerpts of length 500 from three types of training data. Blueis the target output, red the system output and green the control action. Thebinary, sinusoidal, and combined sequences are shown on the top left, top right,and bottom respectively783.4. Experimental setupTable 3.1: MPC detailsVariable ValuePlatform Matlab (Model Predictive Control Tool Box)Prediction horizon 20Control horizon 10Sampling time 1 sThe first set is a pseudorandom binary sequence (RandomJump). Thetarget output randomly jumps to a value in the range (−3,−0.2) ∪ (0.2, 3).The target output then remains constant for 10 to 100 time steps beforejumping again. The second set is a pseudorandom sinusoidal sequence. Foreach 1000 time steps, a sine function with period (10, 1000) time steps waschosen. Gaussian noise was added to the data with signal-to-noise ratio of10. The last set of data is the combination of the first two sets, a pseudo-random binary-sinusoidal sequence (SineAndJump). At each time step, thetarget outputs from the two data sets were added. The motivation for usingpseudorandom binary data to train is that a step function is the buildingblock for more complex functions. Any function can be approximated by aseries of binary steps. Furthermore, physical systems are often operated witha step function reference. The motivation to train on sinusoidal data is thatour model should learn periodic behaviour.The details of MPC setup is as follows:793.5. Simulation Results3.5 Simulation ResultsWe conducted experiments to show the effects of data set on LSTMSNN’sperformance, the effectiveness of LSTMSNN relative to other models, and therobustness of LSTMSNN under various initial conditions and target outputs.We trained LSTMSNN by using RMSprop. We also used the two parts ofLSTMSNN, LSTM and NN, as comparison models in our experiments. Theyare denoted as LSTM-only and NN-only. Details of the model architecturesare in Table 3.2. Note that LSTMSNN uses a NN with many layers becausewe found that a simpler structure does not perform well during test time. Wethink this is because our training data sets are large, which limits a simplerstructure from getting the most out of training data. However, decreasingthe size of the training data decreases LSTMSNN’s performances during testtime, especially under varying target outputs. So there is a trade off betweenthe complexity of the architecture and the size of the training data.Table 3.2: Details of comparison modelsModel Optimizer Number of layers Sequence Length Learning rateLSTMSNN RMSprop 2 layers of LSTM and 4 layers of NN 5 0.001LSTM-only RMSprop 2 5 0.001NN-only Stochastic Gradient Descent 3 N/A 0.001Our first set of experiments shows the effect of training data on LSTM-SNN’s performance. We test the performance of LSTMSNN by specifyingan initial condition. LSTMSNN will generate a control action in each timestep and get a system output. Previous control action, system output and aconstant target output will be the input to LSTMSNN in the next time step.We run this process for 1000 time steps. We use mean square error (MSE)803.5. Simulation Resultsand offset error (OE) to measure the performance. MSE is the average ofsquared errors between the system model’s output and the target outputduring the process. OE is the difference between system model’s output andthe target output after the system output converges. We will compare theperformances of LSTMSNN by training on RandomJump and SineAndJump,as shown in Table 3.3.Table 3.3: Performance comparison by testing different data setData Set Target Output MSE OERandomJump 2 0.02 0.01SineAndJump 2 0.06 0.01RandomJump 5 0.176 0.18SineAndJump 5 0.078 0.01RandomJump 10 1.28 1.8SineAndJump 10 0.01 0.01Although training on RandomJump outperforms SineAndJump when thetarget output is the constant 2, LSTMSNN is able to maintain a low OE asthe target output increases. And it is more important for a controller tomaintain a smaller OE during the process. We think the size and diversity ofdata set causes the performance difference. SineAndJump allows LSTMSNNto learn the optimal control actions under more system and target outputs,whereas RandomJump cannot. The output response of trained LSTMSNNmodel for constant set point of 5 and 10 are shown in Figure 3.8 and 3.9813.5. Simulation ResultsTable 3.4: Performance comparison between methodsModel Target Output MSE OELSTMSNN 2 0.06 0.01NN-only 2 0.07 0.23LSTM-only 2 5.56 Did not convergeLSTMSNN 5 0.078 0.01NN-only 5 0.53 0.7LSTM-only 5 62.42 Did not convergeLSTMSNN 10 0.01 0.01NN-only 10 2.21 1.3LSTM-only 10 167.78 Did not convergeOur second set of experiments shows the effectiveness of LSTMSNN bycomparing with NN-only and LSTM-only. In experiment, each model startswith the same initial condition and receives a system output after makinga decision on control action in each time step. The current system output,fixed target output and current control action are the input to LSTMSNNand LSTM-only in next time step. The current system output and the fixedtarget output are the input to NN-only in the next time step, because NNoutputs the next control action without using past control action accordingto Section 3.3.2. We run this process for 1000 time steps for each model. Weuse SineAndJump to train each model. Results are shown in Table 3.4Figures 3.8 and 3.9 show the system outputs from LSTMSNN and NN-only. NN-only has a constant gap between its system output and the targetoutput, whereas LSTMSNN does not. The difference in OE between LSTM-823.5. Simulation Results0 200 400 600 800 1000time10123456outputNN-Model controlling the system with for set point = 50 200 400 600 800 1000time10123456outputLSTMSNN-Model controlling the system with for set point =5Figure 3.8: Comparison of NN only model and LSTMSNN model to track a setpoint of 5833.5. Simulation Results0 200 400 600 800 1000time2024681012outputNN-Model controlling the system with for set point = 10. Theset point = 10 was not shown to the model during training0 200 400 600 800 1000time2024681012outputLSTMSNN-Model controlling the system with for set point =10. The set point = 10 was not shown to the model duringtrainingFigure 3.9: Comparison of NN only model and LSTMSNN model to track a setpoint of 10843.5. Simulation ResultsSNN and NN-only also reflects the flaw of NN-only. It is very promising thatLSTMSNN maintains an OE close to 0.01 as time step increases. LSTMSNNperforms much better than other models in all other cases. A target outputof 10 is completely out of the range in our training data, but LSTMSNN stillcan produce system outputs with low MSE and OE. The reason behind thatis LSTMSNN takes control actions calculating the trend of the past controlactions using LSTM as well as present system output with NN. The weightedcombination of LSTM and NN in LSTMSNN has learnt the time series andmapping (map from output to control actions) trend so well that even thoughthe data set range is −3 to +3 it is able to make the system output reachtarget output of 10 with less oscillatory effect.0 10 20 30 40 50 60 70time02468outputFigure 3.10: LSTMSNN’s system output under various initial condition. Greenline is the target output. Other colors denote different system outputs underdifferent initial conditions.853.5. Simulation ResultsThe third set of experiments shows LSTMSNN’s ability to handle variousinitial conditions and varying target outputs. We use SineAndJump to trainLSTMSNN. We give LSTMSNN a set of random initial conditions and set afixed target output. From Figure 3.10, we can see that LSTMSNN is able toadjust and produce system output close to the target output after some timesteps. We also give LSTMSNN varying target outputs, and we can observethat LSTMSNN can accurately follow the varying target output from Figure3.11 and 3.12. This demonstrates that LSTMSNN is a robust model. Systemoutputs of LSTMS are very good when the varying target output is smooth.We did find that LSTMSNN’s system outputs are not as good when thetarget output has some sharp corners. In those cases, LSTMSNN has poorperformances at those sharp corners but it starts to perform well when targetoutput becomes smooth again.863.5. Simulation Results0 50 100 150 200time0.00.51.01.52.02.53.0outputTrained LSTMSNN model tracking a linear set point change0 50 100 150 200time0.50.00.51.01.52.02.53.03.5outputTrained LSTMSNN model tracking a quadratic setpointchangeFigure 3.11: LSTMSNN’s system output under varying target output. Greenline is target output. Blue line is LSTMSNN’s system output.873.5. Simulation Results0 50 100 150 200time1.00.50.00.51.01.52.02.53.03.5outputTrained LSTMSNN model tracking a cubic set point change0 200 400 600 800 1000time0.50.00.51.01.52.02.53.03.5outputLSTMSNN-Model controlling the system with for set point =10. The set point = 10 was not shown to the model duringtrainingFigure 3.12: LSTMSNN’s system output under varying target output. Greenline is target output. Blue line is LSTMSNN’s system output. 883.6. Conclusion3.6 ConclusionWe have developed a novel neural network architecture to learn the complexbehaviour of MPC. This approach eliminates the burden of state estimation,optimization and prediction in MPC, since they are learnt as an abstractinformation in the hidden layers and hidden units of the network. Also, sinceour implementation uses the GPU, computing control actions is relativelyfast. Hence, the proposed approach can be used to learn the behaviour ofMPC offline and the trained model can be deployed in production to controla system. We believe these results are promising enough to justify furtherresearch involving complex systems.89Chapter 4Online Policy Learning usingDeep Reinforcement LearningReferred Conference Publication:The contributions of this chapter have been published in:Steven Spielberg Pon Kumar, Bhushan Gopaluni and Philip D. Loewen, DeepReinforcement Learning Approaches for Process Control, Advanced Controlof Industrial Processes, Taiwan, 2017Presentation:The contributions of this chapter have also been presented in:Steven Spielberg Pon Kumar, Bhushan Gopaluni, Philip Loewen, Deep Re-inforcement Learning Approaches for Process Control, American Institute ofChemical Engineers Annual Meeting, San Francisco, USA, 2016.It is common in the process industry to use controllers ranging fromproportional controllers to advanced Model Predictive Controllers (MPC).However, classical controller design procedure involves careful analysis ofthe process dynamics, development of an abstract mathematical model, andfinally, derivation of a control law that meets certain design criteria. In con-trast to the classical design process, reinforcement learning offers the prospect90Chapter 4. Online Policy Learning using Deep Reinforcement Learningof learning appropriate closed-loop controllers by simply interacting with theprocess and incrementally improving control behaviour. The promise of suchan approach is appealing: instead of using a time consuming human-controldesign process, the controller learns the process behaviour automatically byinteracting directly with the process. Moreover, the same underlying learningprinciple can be applied to a wide range of different process types: linear andnonlinear systems; deterministic and stochastic systems; single input/outputand multi input/output systems. From a system identification perspective,both model identification and controller design are performed simultaneously.This specific controller differs from traditional controllers in that it assumesno predefined control law but rather learns the control law from experience.With the proposed approach the reward function serves as an objective func-tion indirectly. The drawbacks of standard control algorithms are: (i) areasonably accurate dynamic model of the complex process is required, (ii)model maintenance is often very difficult, and (iii) online adaptation is rarelyachieved. The proposed reinforcement learning controller is an efficient al-ternative to standard algorithms and it automatically allows for continuousonline tuning of the controller.The main advantages of the proposed algorithm are as follows: (i) itdoes not involve deriving an explicit control law; (ii) it does not involvederiving first principle models as they are learnt by simply interacting withthe environment; and (iii) the learning policy with deep neural networks israther fast.The chapter is organized as follows: Section 4.2 highlights the systemconfiguration in a reinforcment learning perspective. Section 4.3 outlines the914.1. Related Worktechnical approach of the learning controller. Section 4.4 empirically verifiesthe effectiveness of our approach. Conclusions follows in Section 4.5.4.1 Related WorkReinforcement learning (RL) has been used in process control for more thana decade [10]. However, the available algorithms do not take into accountthe improvements made through recent progress in the field of artificial in-telligence, especially deep learning [1]. Remarkably, human level control hasbeen attained in games [2] and physical tasks [3] by combining deep learningand reinforcement learning [2]. Recently, these controllers have even learntthe control policy of the complex game of Go [15]. However, the currentliterature is primarily focused on games and physical tasks. The objective ofthese tasks differs slightly from that of process control. In process control,the task is to keep the outputs close to a prescribed set point while meet-ing certain constraints, whereas in games and physical tasks the objectiveis rather generic. For instance: in games, the goal is to take an action toeventually win a game; in physical tasks, to make a robot walk or stand.This paper discusses how the success in deep reinforcement learning can beapplied to process control problems. In process control, action spaces arecontinuous and reinforcement learning for continuous action spaces has notbeen studied until [3]. This work aims at extending the ideas in [3] to processcontrol applications.924.2. Policy Representation4.2 Policy RepresentationA policy is RL agent’s behaviour. It is a mapping from a set of States S toa set of Actions A, i.e., pi : S 7→ A. We consider a deterministic policy, withboth states and actions in a continuous space. The following subsectionsprovide further details about the representations.4.2.1 StatesA state s consists of features describing the current state of the plant. Sincethe controller needs both the current output of the plant and the requiredset point, the state is the plant output and set point tuple < y, yset >.4.2.2 ActionsThe action, a is the means through which a RL agent interacts with theenvironment. The controller input to the plant is the action.4.2.3 RewardThe reward signal r is a scalar feedback signal that indicates how well anRL agent is doing at step t. It reflects the desirability of a particular statetransition that is observed by performing action a starting in the initial states and resulting in a successor state s′. Figure 4.1 illustrates state-actiontransitions and corresponding rewards of the controller in a reinforcementlearning framework. For a process control task, the objective is to drivethe output toward the set point, while meeting certain constraints. Thisspecific objective can be fed to the RL agent (controller) by means of a934.2. Policy Representationreward function. Thus, the reward function serves as a function similar toan objective function in a Model Predictive Control formulation.Figure 4.1: Transition of states and actionsThe reward function has the form,r(s, a, s′) =c, if |yi − yiset| ≤ ε,∀i−∑ni=1 |yi − yiset|, otherwise (4.1)where i represents ith component in a vector of n outputs in MIMO systems,and c > 0 is a constant. A high value of c leads to larger value of r when theoutputs are within the prescribed radius ε of the setpoint. A high value of ctypically results in quicker tracking of set point.944.2. Policy RepresentationThe goal of learning is to find a control policy, pi that maximizes theexpected value of the cumulative reward, R. Here, R can be expressed as thetime-discounted sum of all transition rewards, ri, from the current action upto a specified horizon T (where T may be infinite) i.e.,R(s0) = r0 + γr1 + ...+ γT rT (4.2)where ri = r(si, ai, s′i) and γ ∈ (0, 1) is a discount factor. The sequenceof states and actions are determined by the policy and the dynamics of thesystem. The discount factor ensures that the cumulative reward is bounded,and captures the fact that events occurring in the distant future are likelyto be less consequential than those occurring in the more immediate future.The goal of RL is to maximize the expected cumulative reward.4.2.4 Policy and Value Function RepresentationThe policy pi is represented by an actor using a deep feed forward neuralnetwork parameterized by weights Wa. Thus, an actor is represented aspi(s,Wa). This network is queried at each time step to take an action given thecurrent state. The value function is represented by a critic using another deepneural network parameterized by weights Wc. Thus, critic is represented asQ(s, a,Wc). The critic network predicts the Q-values for each actor and actornetwork proposes an action for the given state. During the final runtime, thelearnt deterministic policy is used to compute the control action at each stepgiven the current state, s which is a function of current output and set pointof the process.954.3. Learning4.3 LearningFigure 4.2: Learning OverviewThe overview of control and learning of the control system is shown in Fig4.2. The algorithm for learning the control policy is given in Algorithm 1.The learning algorithm is inspired by [3] with modifications to account forset point tracking and other recent advances discussed in [4],[5]. In order tofacilitate exploration, we added noise sampled from an Ornstein-Uhlenbeck(OU) process [14] discussed similarly in [3]. Apart from exploration noise, therandom initialization of system output at the start of each episode ensuresthat the policy is not stuck in a local optimum. An episode is terminatedwhen it runs for 200 time steps or if it has tracked the set point by a factor ofε for five consecutive time steps. For systems with higher values of the timeconstant, τ , larger time steps per episode are preferred. At each time step the964.3. Learningactor, pi(s,Wa) is queried and the tuple < si, s′i, ai, ri > is stored in the replaymemory, RM at each iteration. The motivation for using replay memory isto break the correlated samples obtained used for training. The learningalgorithm uses an actor-critic framework. The critic network estimates thevalue of current policy by Q-learning. Thus the critic provides loss functionfor learning the actor. The loss function of the critic is given by, L(Wc) =E[(r+ γQt−Q(s, a,Wc))2]. Hence, the critic network is updated using thisloss with the gradient given by∂L(Wc)∂Wc= E[(r + γQt −Q(s, a,Wc))∂Q(s, a,Wc)∂Wc], (4.3)where Qt = Q(s′, pi(s′,W ta),Wtc ), denotes target values and Wta,Wtc are theweights of the target actor and the target critic respectively and ∂Q(s,a,Wc)∂Wcis the gradient of critic with respect to critic parameter Wc. The discountfactor γ ∈ (0, 1) determines the present value of future rewards. A valueof γ = 1 considers all rewards from future states to be equally influentialwhereas a value of 0 ignores all reward except for the current state. Rewardsin future states are discounted by value of γ. This is a tunable parameterthat makes our learning controller to check how far to look in the future.We chose γ = 0.99 in our simulations. The expectation in (4.3) is over themini-batches sampled from RM . Thus the learnt value function provides aloss function to the actor and the actor updates its policy in a direction thatimproves Q. Thus, the actor network is updated using the gradient given by∂J(Wa)∂Wa= E[∂Q(s, a,Wc)∂a∂pi(s,Wa)∂Wa], (4.4)974.3. LearningAlgorithm 2 Learning Algorithm1: Wa,Wc ← initialize random weights2: initialize Replay memory, RM, with random policies3: for episode = 1 to E do4: Reset the OU process noise N5: Specify set point, yset at random6: for step = 1 to T do7: s←< yt, yset >8: a← action, ut = pi(s,Wa) +Nt9: Execute action ut on the plant10: s′ ←< yt+1, yset >, r ← reward; state & reward at next instant11: Store the tuple < s, a, s′, r > in RM12: Sample a mini batch of n tuples from RM13: Compute yi = ri + γQit ∀i ∈ mini batch14: Update Critic:15: Wc ← Wc + α( 1n∑ni (yi −Q(si, ai,Wc)∂Q(si,ai,Wc)∂Wc )16: Compute ∇ip = ∂Q(si,ai,Wc)∂ai17: Clip gradient ∇ip by (4.5)18: Update Actor:19: Wa ← Wa + α 1n∑n1 (∇ip ∂pi(si,Wa)∂Wa)20: Update Target Critic:21: W ta ← τWa + (1− τ)W ta22: Update Target Actor:23: W tc ← τWc + (1− τ)W tc24: end for25: end for984.3. Learningwhere ∂Q(s,a,Wc)∂ais the gradient of the critic with respect to the sampledactions of mini-batches from RM and ∂pi(s,Wa)∂Wais the gradient of actor withrespect to parameter, Wa of the actor. Thus, the critic and actor are updatedin each iteration, resulting in policy improvement. The layers of actor andcritic neural networks are batch normalized [10].Target Network: Similar to [3], we use separate target networks foractor and critic. We freeze the target network and replace it with the exist-ing network once every 500 iterations. This makes the learning an off-policyalgorithm.Prioritized Experience Replay: While drawing samples from the re-play buffer, RM we use prioritized experienced replay as proposed in [4].Empirically, we found that this accelerates learning compared to uniformrandom sampling.Inverting Gradients: Sometimes the actor neural network, pi(s,Wa)produces an output that exceeds the action bounds of the specific system.Hence, the bound on the output layer of the actor neural network is specifiedby controlling the gradient required for actor from critic. To avoid makingthe actor return outputs that violate control constraints, we inverted thegradients of ∂Q(s,a,Wc)∂agiven in [4] using the following transformation,∇p =∇p.(pmax − p)/(pmax − pmin), if ∇psuggests increasing p(p− pmin)/(pmax − pmin), otherwise (4.5)994.4. Implementation and StructureWhere ∇p refers to parameterized gradient of the critic. Here pmax, pmin setthe maximum and minimum actions of the agent. p correspond to entries ofthe parameterized gradient, ∂Q(s,a,Wc)∂a. This tranformation makes sure thatthe gradients are clipped to truncate the output of actor network i.e., controlinput, within the range given in pmin and pmax, where pmin and pmax arelower and upper bound on the action space. It serves as an input constraintto the RL agent.4.4 Implementation and StructureThe learning algorithm was written in python and implemented on a Ubuntulinux distribution machine. The Random Access Memory of the machineis 16 GB. The deep neural networks was built using Tensorflow [41]. Thecomputation of high dimensional matrix multiplication was made parallelwith the help of Graphics Processing Unit (GPU) and we used NVIDIA960M GPU.The details of our actor and critic neural networks are as follows: Weused Adam [21] for learning actor and critic neural network. The learningrate used for actor was 10−4 and critic was 10−3. For critic we added L2regularization with penalty λ = 10−2. The discount factor to learn critic wasγ = 0.99. We used Rectified non-linearity as an activation function for allhidden layers in actor and critic. The actor and critic neural networks had 2hidden layers with 400 and 300 units respectively. The Adam optimizer wastrained in minibatch sizes of 64. The size of the replay buffer is 104.1004.5. Simulation Results4.5 Simulation ResultsWe tested our learning based controller on a paper-making machine (SingleInput Single Output system) and a high purity distillation column (MultiInput Multi Output System). The output and input responses from thelearned polices are illustrated in Fig. 4.3 - 4.12. The results are discussedbelow.4.5.1 Example 1- SISO SystemsA 1 × 1 process is used to study the effect of this approach of learning thepolicy. The industrial system we choose to study is the control of a paper-making machine. The target output, yset, is the desired moisture content ofthe paper sheet. The control action, u, is the steam flow rate, and the systemoutput, y, is the current moisture content. The differential equation of thesystem is given by,y(t)− 0.6y˙(t) = 0.05u˙(t), (4.6)where y˙(t) and u˙(t) are output and input derivatives with respect to time.The corresponding transfer function of the system is,G(s) =0.05s1− 0.6s (4.7)The time step used for simulation is 1 second.The system is learned using Algorithm 1. The portion of the learning curveof the SISO system given in (4.6) is shown in Fig 4.3 and the correspondinglearned policy is shown in Fig 4.4.The learned controller is also tested on various other cases, such as theresponse to a set point change and the effect of output and input noise. The1014.5. Simulation Results0 10 20 30 40 50 60Episode #-400-20002004006008001000TotalRewardsperEpisodeLearning CurveFigure 4.3: Learning Curve: Rewards to the Reinforcement Learning agent forsystem (4.6) over time. Here 1 episode is equal to 200 time steps.results are shown in Fig 4.5 - Fig 4.10 . For learning set point changes, thelearning algorithm was shown integer-valued set points in the range [0, 10].1024.5. Simulation Results0 10 20 30 40 50Time Steps1.00.50.00.51.0OutputOutputSetpointOutput response of the learnt policy0 10 20 30 40 50Time5101520253035InputInput response of the learnt policyFigure 4.4: Output and Input response of the learnt policy for system (4.6)1034.5. Simulation ResultsIt can be seen from Fig 4.10 that the agent has learned how to track evenset points that it was not shown during training.4.5.2 Example 2- MIMO SystemOur approach is tested on a high purity distillation column Multi InputMulti Output (MIMO) system described in [57]. The first output, y1, is thedistillate composition, the second output, y2, is the bottom composition, thefirst control input, u1, is the boil-up rate and the second control input, u2, isthe reflux rate. The differential equations of the MIMO is,τ y˙1(t) + y1(t) = 0.878u1(t)− 0.864u2(t)τ y˙2(t) + y2(t) = 1.0819u1(t)− 1.0958u2(t)(4.8)where y˙1(t) and y˙1(t) are derivatives with respect to time. The output andinput responses of the learned policy of the system are shown in Fig 4.11 andFig 4.12We choose the number of steps per episode to be 200 for learning the con-trol policies. When systems with larger time constants are encountered, weusually require more steps per episode. The reward hypothesis formulationserves as a tuner to adjust the performance of the controller, how set pointsshould be tracked, etc. When implementing this approach on a real plant,effort can often be saved by warm-starting the deep neural network throughprior training on a simulated model. Experience also suggests using explo-ration noise with higher variance in the early phase of learning, to encouragehigh exploration.The final policies for the learned SISO system, given in (4.6), are the resultof 7,000 iterations of training collecting about 15,000 tuples and requiring1044.5. Simulation Results0 10 20 30 40 50Time Steps0.51.01.52.02.53.03.5OutputOutputSetpointOutput response of the learnt system with output noise ofσ2 = 0.10 10 20 30 40 50Time Steps25201510505101520InputInput response of the learnt system with output noise of σ2 =0.1Figure 4.5: Output and input response of the learnt system with output noise ofσ2 = 0.1 for SISO system (4.6) 1054.5. Simulation Results0 10 20 30 40 50Time Steps1012345OutputOutputSetpointOutput response of the learnt system with output noise ofσ2 = 0.30 10 20 30 40 50Time Steps302010010203040InputInput response of the learnt system with output noise of σ2 =0.3Figure 4.6: Output and input response of the learnt system with output noise ofσ2 = 0.3 for SISO system (4.6) 1064.5. Simulation Results0 10 20 30 40 50Time Steps1.51.00.50.00.51.01.5OutputOutputSetpointOutput response of the learnt system with input noise of σ2 =0.10 10 20 30 40 50Time Steps5101520253035InputInput response of the learnt system with input noise of σ2 =0.1Figure 4.7: Output and input response of the learnt system with input noise ofσ2 = 0.1 for SISO system (4.6) 1074.5. Simulation Results0 50 100 150 200Time Steps0.51.01.52.02.53.03.54.0OutputOutputSetpointOutput response of the learnt system with set point changeand output noise of σ2 = 0.10 50 100 150 200Time Steps302010010203040InputInput response of the learnt system with set point change andoutput noise of σ2 = 0.1 for system (4.6)Figure 4.8: Output and input response of the learnt SISO system with set pointchange and output noise of σ2 = 0.1 1084.5. Simulation Results0 50 100 150 200Time Steps0123456OutputOutputSetpointOutput response of the learnt system with set point changeand output and input noise of σ2 = 0.10 50 100 150 200Time Steps30201001020304050InputInput response of the learnt system with set point change andoutput and input noise of σ2 = 0.1Figure 4.9: Output and input response of the learnt system with set point changeand output and input noise of σ2 = 0.1 for SISO system (4.6) 1094.5. Simulation Results0 50 100 150 200Time Steps1.01.52.02.53.03.54.0OutputOutputSetpointOutput response of the learnt system for set points unseenduring training0 50 100 150 200Time Steps201001020304050InputInput response of the learnt system for set points unseen dur-ing trainingFigure 4.10: Output and input response of the learnt SISO system (4.6) for setpoints unseen during training 1104.5. Simulation Results0 50 100 150 200Time Steps0.20.00.20.40.60.81.01.21.41.6y1OutputSetpointOutput response of the first variable with output noise of σ2 =0.01 of the 2× 2 MIMO system0 50 100 150 200Time Steps2015105051015u1Input response of the first variable with output noise of σ2 =0.01 of the 2× 2 MIMO systemFigure 4.11: Output and input response of the first variable with output noiseof σ2 = 0.01 of the 2× 2 MIMO system (4.8) 1114.5. Simulation Results0 50 100 150 200Time Steps1.00.50.00.51.01.5y2OutputSetpointOutput response of the second variable with output noise ofσ2 = 0.01 of the 2× 2 MIMO system0 50 100 150 200Time Steps505101520u2Input response of the second variable with output noise ofσ2 = 0.01 of the 2× 2 MIMO systemFigure 4.12: Output and input response of the second variable with output noiseof σ2 = 0.01 of the 2× 2 MIMO system (4.8) 1124.6. Conclusionabout 2 hours of training on a NVIDIA 960M GPU. For learning the MIMOsystem given in (4.8) the compute time on the same GPU was about 4 hoursrequiring 50K tuples. The learning time remains dominated by the mini-batch training at each time step.4.6 ConclusionThe use of deep learning and reinforcement learning to build a learning con-troller was discussed. We believe process control will see development in thearea of reinforcement learning based controller.113Chapter 5Conclusion & Future WorkWe have developed an artificial intelligence based approach to process con-trol using deep learning and reinforcement learning. We posed our problemas supervised learning as well as a reinforcement learning based problem.This framework supports the development of control policies that can learndirectly by interacting with plant’s output. All our work, just requires plantoutput to return optimal control actions, hence these are relatively fast com-pared to currently available controllers. Our approach avoids the need forhand-crafted feature descriptors, controller tuning, deriving control laws, anddeveloping mathematical models. We believe industrial process control willsee rapid and significant advances in the field of deep and reinforcement learn-ing in the near future. Possibilities for future work include the following,MPC + Reinforcement Learning In Chapter III we combined MPCwith deep learning to build a controller that can learn offline and then beable to control the plant. Our future work aims to learn the MPC and plantonline. MPC relies on an accurate model of the plant for best performance.Hence with this approach the reinforcement learning applied to MPC andplant will help to adapt to change in the plant and the considered model.114Chapter 5. Conclusion & Future WorkObservation of Hidden states Our new architecture LSTMSNN wasused to learn the behaviour of MPC. Our observation to hidden states isthat LSTMSNN is learning information about hidden states through avail-able hidden neurons in different levels. It would be instructive to study theinformation about these hidden units and how the hidden states are learnt,through relevant visualization and theoretical justification.115Bibliography[1] Krizhevsky, A., I. Sutskever, and G. E. Hinton “Imagenet classificationwith deep convolutional neural networks”. In Advances in neural infor-mation processing systems, Lake Tahoe. pp. 1097-1105, 2012.[2] Volodymyr, M., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,D. Wierstra, S. Legg and D. Hassabis “Human-level control throughdeep reinforcement learning.” Nature 518, no. 7540: 529-533, 2015.[3] Lillicrap, T. P., J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D.Silver, and D. Wierstra. “Continuous control with deep reinforcementlearning.” arXiv preprint arXiv:1509.02971, 2015.[4] Hausknecht, M., and P. Stone. “Deep reinforcement learning in param-eterized action space.” arXiv preprint arXiv:1511.04143, 2015.[5] Schaul, T., J. Quan, I. Antonoglou, and D. Silver. “Prioritized experi-ence replay.” arXiv preprint arXiv:1511.05952, 2015.[6] NVIDIA, NVIDIA CUDA—Programming Guide, December 2008.116Bibliography[7] Hinton, G., L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury.“Deep neural networks for acoustic modeling in speech recognition: Theshared views of four research groups.” IEEE Signal Processing Magazine29, no. 6: 82-97, 2012.[8] Yao K., G. Zweig, M. Y. Hwang, Y. Shi, and D. Yu. “Recurrent neuralnetworks for language understanding.” Interspeech, Lyon, France, pp.2524-2528, 2013.[9] Mnih, V., K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D.Wierstra, and M. Riedmiller. ”Playing atari with deep reinforcementlearning.” arXiv preprint arXiv:1312.5602, 2013.[10] Hoskins, J. C., and D. M. Himmelblau “Process control via artificialneural networks and reinforcement learning”. Computers & chemicalengineering, 16(4), 241-251.[11] Ioffe, S. and C. Szegedy “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift” arXiv preprintarXiv:1502.03167, 2015.[12] Silver, D., G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-miller. “Deterministic policy gradient algorithms” Proceedings of the31st International Conference on Machine Learning, Beijing, China, pp.387-395, 2014.[13] Bertsekas, D. P., and J. N. Tsitsiklis. “Neuro-dynamic programming: an117Bibliographyoverview.” Proceedings of the 34th IEEE Conference on Decision andControl, Louisiana, USA, vol. 1, pp. 560-564. IEEE, 1995.[14] Uhlenbeck, G. E., and L. S. Ornstein. “On the theory of the Brownianmotion.” Physical review 36, no. 5 : 823, 1930.[15] Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V. D. Driess-che,J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S.Dieleman, D. Grewe,J. Nham, N. Kalchbrenner, I. Sutskever, T. Lilli-crap, M. Leach, K. Kavukcuoglu, T. Graepel and D. Hassabis, “Master-ing the game of Go with deep neural networks and tree search”. Nature.Jan 28;529(7587):484-489, 2016.[16] Goodfellow, I., Y. Bengio, and A. Courville. Deep learning. MIT press,2016.[17] Sutton, R. S., and A. G. Barto. Reinforcement learning: An introduc-tion. Vol. 1, no. 1. Cambridge: MIT press, 1998.[18] Polyak, B. T., and A. B. Juditsky. “Acceleration of stochastic approxi-mation by averaging.” SIAM Journal on Control and Optimization 30,no. 4: 838-855, 1992.[19] Duchi, J., E. Hazan, and Y. Singer. “Adaptive subgradient methodsfor online learning and stochastic optimization.” Journal of MachineLearning Research 12, no.: 2121-2159, 2011.[20] Tieleman, T. and G. Hinton, Lecture 6.5 RmsProp: Divide the gradi-ent by a running average of its recent magnitude. COURSERA: NeuralNetworks for Machine Learning, 2012118Bibliography[21] Kingma, D., and J. Ba. “Adam: A method for stochastic optimization.”arXiv preprint arXiv:1412.6980, 2014.[22] Sutskever, I., J. Martens, and G. E. Hinton. “Generating text with recur-rent neural networks.” In Proceedings of the 28th International Confer-ence on Machine Learning (ICML-11), Washington, USA, pp. 1017-1024,2011.[23] Wu, Y., S. Zhang, Y. Zhang, Y. Bengio, and R. R. Salakhutdinov. “Onmultiplicative integration with recurrent neural networks.” In Advancesin Neural Information Processing Systems, Spain, pp. 2856-2864. 2016.[24] Bengio, Y., P. Simard, and P. Frasconi. “Learning long-term dependen-cies with gradient descent is difficult”. IEEE Transactions on NeuralNetworks, 5(2):157-166, 1994.[25] Pascanu, R., T. Mikolov, and Y. Bengio. ”On the difficulty of train-ing recurrent neural networks.” In International Conference on MachineLearning, Atlanta, USA, pp. 1310-1318. 2013.[26] Hochreiter, S., and J. Schmidhuber. “Long short-term memory.” Neuralcomputation 9, no. 8: 1735-1780, 1997.[27] LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learn-ing applied to document recognition.” Proceedings of the IEEE 86, no.11: 2278-2324, 1998.[28] Bergstra, J., and Y. Bengio. “Random search for hyper-parameter opti-mization.” Journal of Machine Learning Research 13: 281-305, 2012.119Bibliography[29] Mayne, D. Q., M. M. Seron, and S. V. Rakovic. “Robust model predic-tive control of constrained linear systems with bounded disturbances.”Automatica 41, no. 2 : 219-224, 2005.[30] Greff, K., R. K. Srivastava, J. Koutnk, B. R. Steunebrink, and J.Schmidhuber. “LSTM: A search space odyssey.” IEEE transactions onneural networks and learning systems, 2017.[31] Degris, T., M. White, and R. S. Sutton. “Off-policy actor-critic.” arXivpreprint arXiv:1205.4839, 2012.[32] Todorov, E., T. Erez, and Y. Tassa, “MuJoCo: A physics engine formodel-based control” IEEE/RSJ International Conference on IntelligentRobots and Systems, Portugal, pp. 5026-5033, 2012[33] Kuhne, F., W. F. Lages, and J. M. G. D. Silva Jr. “Model predictivecontrol of a mobile robot using linearization.” In Proceedings of mecha-tronics and robotics, pp. 525-530, 2004.[34] Kittisupakorn, P., P. Thitiyasook, M. A. Hussain, and W. Daosud. “Neu-ral network based model predictive control for a steel pickling process.”Journal of Process Control 19, no. 4 : 579-590, 2009.[35] Piche, S., J. D. Keeler, G. Martin, G. Boe, D. Johnson, and M. Gerules.“Neural network based model predictive control.” In Advances in NeuralInformation Processing Systems, pp. 1029-1035. 2000.[36] Pan, Y., and J. Wang. ”Two neural network approaches to model predic-tive control.” In American Control Conference, Washington, pp. 1685-1690. IEEE, 2008.120Bibliography[37] Alexis, K., G. Nikolakopoulos, and A. Tzes, “Model predictive quadro-tor control: attitude, altitude and position experimental studies”. IETControl Theory & Applications, 6(12), pp.1812-1827, 2012.[38] Wallace, M., S. S. P. Kumar, and P. Mhaskar. “Offset-free model predic-tive control with explicit performance specification.” Industrial & Engi-neering Chemistry Research 55, no. 4: 995-1003, 2016.[39] Zhong, M., M. Johnson, Y. Tassa, T. Erez, and E. Todorov. ”Valuefunction approximation and model predictive control.” IEEE Sympo-sium on Adaptive Dynamic Programming And Reinforcement Learning(ADPRL), pp. 100-107. IEEE, 2013.[40] Akesson, B. M., and H. T. Toivonen. ”A neural network model predictivecontroller.” Journal of Process Control 16 no.9: 937-946, 2006.[41] Abadi, M., A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kud-lur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M.Wattenberg, M. Wicke, Y. Yu, and X. Zheng “Tensorflow: Large-scalemachine learning on heterogeneous distributed systems.” arXiv preprintarXiv:1603.04467, 2016.[42] Spielberg, S. P. K., R. B. Gopaluni, and P. D. Loewen. “Deep reinforce-ment learning approaches for process control.” In Advanced Control121Bibliographyof Industrial Processes (AdCONIP), Taiwan, 6th International Sympo-sium, pp. 201-206. IEEE, 2017.[43] Hassabis, D., “AlphaGo: using machine learning to master the ancientgame of Go”, https://blog.google/topics/machine-learning/alphago-machine-learning-game-go/, January, 2016[44] Karpathy, A., “Deep Reinforcement Learning: Pong from Pixels”, http://karpathy.github.io/2016/05/31/rl/, May 31, 2016[45] Levy, K., “Here’s What It Looks Like To Drive In One Of Google’s Self-Driving Cars On City Streets”, http://www.businessinsider.com/google-self-driving-cars-in-traffic-2014-4, April, 2014.[46] Welinder, P., B. Mcgrew, J. Schneider, R. Duan, J. Tobin, R. Fong,A. Ray, F. Wolski, V. Kumar, J. Ho, M. Andrychowicz, B. Stadie,A. Handa, M. Plappert, E. Reinhardt, P. Abbeel, G. Brockman, I.Sutskever, J. Clark & W. Zaremba., “Robots that learn”, https://blog.openai.com/robots-that-learn/, May 16, 2017.[47] Stelling, S., “Is there a graphical representa-tion of bias-variance tradeoff in linear regression?”,https://stats.stackexchange.com/questions/19102/is-there.., November29, 2011.[48] Macdonald K., “k-Folds Cross Validation inPython”, http://www.kmdatascience.com/2017/07/k-folds-cross-validation-in-python.html, July 11, 2017.122Bibliography[49] Preetham V. V., “Mathematical foundation for Activation Func-tions in Artificial Neural Networks”, https://medium.com/autonomous-agents/mathematical-foundation-for-activation.., August 8, 2016.[50] Klimova, E., “Mathematical Model of Evolutionof Meteoparameters: Developing and Forecast”,http://masters.donntu.org/2011/fknt/klimova/.., 2011.[51] Boyan, J. A. “Least-squares temporal difference learning.” InternationalConference on Machine Learning, Bled, Slovenia, pp. 49-56. 1999.[52] Camacho, E. F. and C. Bordons, “Model Predictive Control”, SpringerScience & Business Media. 1998.[53] Cho, K., B. V. Merrienboer, D. Bahdanau, and Y. Bengio. “On theproperties of neural machine translation: Encoder-decoder approaches.”arXiv preprint arXiv:1409.1259, 2014.[54] Bojarski, M., D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P.Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao,K. Zieba. “End to end learning for self-driving cars.” arXiv preprintarXiv:1604.07316, 2016.[55] Schmidt, M., “Proximal gradient and stochastic sub gradient”, MachineLearning, Fall-2016. https://www.cs.ubc.ca/ schmidtm/Courses/540-W16/L7.pdf[56] Postoyan, R., L. Busoniu, D. Nesic and J. Daafouz, “Stability Analy-sis of Discrete-Time Infinite-Horizon Optimal Control With Discounted123BibliographyCost,” in IEEE Transactions on Automatic Control, vol. 62, no. 6, pp.2736-2749, 2017.[57] Patwardhan, R. S., and R. B. Goapluni. “A moving horizon approach toinput design for closed loop identification.” Journal of Process Control24.3: 188-202, 2014.[58] Sutton, R. S., D. A. McAllester, S. P. Singh, and Y. Mansour. “Policygradient methods for reinforcement learning with function approxima-tion.” In Advances in neural information processing systems, Denver,United States, pp. 1057-1063. 2000.[59] Precup, D., R. S. Sutton, and S. Dasgupta. ”Off-policy temporal-difference learning with function approximation.” International Confer-ence on Machine Learning, Massachusetts, USA, pp. 417-424. 2001.124

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0361156/manifest

Comment

Related Items