Monotone Optimal Policies for QuasivariationalInequalities Arising in Optimal Portfolio LiquidationbyDaniel J. CrawfordB.A.Sc. Engineering Physics, University of British Columbia, 2010a thesis submitted in partial fulfillmentof the requirements for the degree ofMaster of Applied Scienceinthe faculty of graduate and postdoctoral studies(Electrical and Computer Engineering)The University of British Columbia(Vancouver)December 2014c© Daniel J. Crawford, 2014AbstractThis thesis studies the Hamilton-Jacobi-Ballman quasivariational inequality (HJBQVI),the corresponding optimal value function, and discrete schemes useful for approx-imating this value function. Moreover, the structural properties of the optimalpolicy of particular discrete scheme is studied. The motivation is to find a con-vergent, approximating scheme for the otherwise complicated HJBQVI that hasmonotone policy structure that can be exploited in a stochastic gradient estimationscheme to approximate optimal policy function parameters. In order to motivatethis approach, we consider the problem of optimal liquidation of a single risky assetportfolio as an impulse control problem. The model is defined over continuous time,state, and compact action sets, and the optimal liquidation value and strategy arefound from the viscosity solution of a HJBQVI. It is shown that the optimal strategyis monotone in the number of shares owned and the time remaining to liquidation.This structural result is exploited to estimate the optimal policy via a reinforcementlearning method based on the simultaneous perturbation stochastic approximation(SPSA) algorithm. The optimal policy can be estimated without knowledge of theparameters of the underlying model.iiPrefaceThis dissertation is original, unpublished, independent work by the author, D. Craw-ford. Portions of this thesis which use notions from previous work have been indi-cated with rigorous citation.Works Submitted for Publication:D. Crawford, V. Krishnamurthy, ”Monotone Optimal Policies for QuasivariationalInequalities Arising in Optimal Portfolio Liquidation”.D. Crawford, V. Krishnamurthy, ”Monotone Optimal Policies in Portfolio Liquida-tion Problems”.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Main Results and Organization . . . . . . . . . . . . . . . . . . . . . 32 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Discrete Time, State, and Action Optimal Control . . . . . . . . . . 52.2 Continuous Time, State, and Action Optimal Control . . . . . . . . 82.3 Viscosity Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 133 Finite Horizon Portfolio Liquidation Model . . . . . . . . . . . . . 153.1 Liquidation Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 HJBQVI Characterization of Optimal Policy . . . . . . . . . . . . . 18iv3.3 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Proof of Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1 The Denumerable MDP Representation of the HJBQVI . . . . . . . 234.2 Convergence of the Approximating Scheme . . . . . . . . . . . . . . 284.3 Optimal Policy Structure . . . . . . . . . . . . . . . . . . . . . . . . 305 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.1 Finite Difference Approximations: Ground Truth Estimates . . . . . 345.2 Countable State, Action, and Time MDP . . . . . . . . . . . . . . . 355.3 Policy Search Algorithm for Unknown Price Evolution Structure . . 386 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.1 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54A.1 Proof of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 54A.2 Proof of Theorem 4.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 55A.3 Proof of Lemma 4.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 61A.4 Proof of Theorem 4.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 62vList of TablesTable 7.1 Simulated Liquidation Problem Parameters . . . . . . . . . . . . . 44viList of FiguresFigure 7.1 Solving for the Value Function from Generator Function (5.1),Snapshot at Constant Shares Owned . . . . . . . . . . . . . . . . 45Figure 7.2 Small action taken (”Sell 10 shares”) . . . . . . . . . . . . . . . . 46Figure 7.3 Maximum action taken (”Sell 50 shares”) . . . . . . . . . . . . . 46Figure 7.4 Value function over time and price, with constant shares held. . 47Figure 7.5 Value function over time and price, with constant shares held. . 47Figure 7.6 Optimal Liquidation Strategy - Constant Price P = 10. . . . . . 48Figure 7.7 Optimal Liquidation Strategy - Constant Price P = 60. . . . . . 48Figure 7.8 SPSA: Sub-Optimal Liquidation Strategy - Constant Price P = 10. 49Figure 7.9 SPSA: Sub-Optimal Liquidation Strategy - Constant Price P = 60. 49viiGlossaryHJB Hamilton-Jacobi-BellmanQVI Quasivariational InequalityHJBQVI Hamilton-Jacobi-Bellman Quasivariational InequalityMDP Markov Decision ProcessSPSA Simultaneous Perturbation Stochastic ApproximationUSC Upper Semi-continuousLSC Lower Semi-continuousWLOG Without Loss of GeneralityviiiAcknowledgmentsSpecial thanks to Ann, Murray, Mark, Peeti, Steph, Bridget, Woody and all of myclose friends for the endless support during my degree.Thank you to all of my laboratory colleagues for numerous insightful talks aboutmathematics and engineering, and for the less insightful - and equally rewarding -talks about the stresses of studying mathematics and engineering.A very sincere thank you to my supervisor Dr. Vikram Krishnamurthy - for pushingme to excel in a difficult area of research, and for always being available to discussthe direction of my research.ixDedicationTo my grandparents.xChapter 1IntroductionIn financial frameworks, situations arise where an agent must convert a po-sition in a risky asset into cash. This could be due to sudden unexpected ex-penses, disinterest in the market, or any foreseeable reason why having a positionin the asset is unwanted by the asset holder. Some popular, yet na¨ıve strategies forliquidating this position is to immediately sell the position, uniformly sell portionsof shares up to a liquidation deadline, or wait until a liquidation deadline to sell allshares. However, given this liquidation deadline and some a priori knowledge of theprice dynamics, the decision maker is given the opportunity to create an optimizedstrategy in order to maximize the total profit from the process.This thesis studies optimal liquidation strategies of a single risky asset as animpulse control problem. This problem can be formulated as a Hamilton-Jacobi-Bellman quasivariational inequality (HJBQVI), a continuous time, state, and ac-tion optimal stochastic control tool which uses impulse control tools developed byBensoussan and Lions, 1984 [1] combined with continuous dynamic programmingmethods. The HJBQVI has since been used in numerous applications, for examplein physics, engineering, automatic control and finance. This thesis follows the ap-plication of the HJBQVI in optimal impulse control liquidation strategies developedin [2]. In this model, the price of the asset is assumed to follow a geometric Brow-nian motion and no risk free asset such as a guaranteed interest savings account oroptional bond holding is permitted. The model discussed in [2] is extended in thiswork to include market impact incurred through the selling actions of the agent (seealso [3], [4], [5], [6], [7], [2], and [8], although the latter considers a more general geo-metric Brownian motion whose drift coefficient is a one jump Markov process rather1than constant). The market impact representation used in this work is adaptedfrom [9], and is considered a non-decaying, permanent price impact. Refer to [5],[10] for discussion on price impact (temporary, delayed, permanent). The impactterm is included to assist in the structural proofs contained in the following sections.After describing the impulse control HJBQVI, it is shown that the optimal valuefunction arises from an optimal liquidation strategy which is a monotone functionfunction of the number of shares of the asset owned, the price of the asset, and theamount of time remaining until the liquidation horizon. The arguments leading tothis conclusion hang on the assumption that the immediate and terminus investoractions give rise to a reward function that is concave in the sell action, and non-decreasing in the asset price. It is shown that this structural result can be derivedfrom a corresponding discretized Markov Decision Process (MDP) (see [11], [12] forbackground). Further, it is shown in the limiting case of the discretization in time,these properties continue to hold, therefore showing a structural result for the op-timal policy of the HJBQVI. Once this method for proving monotone structures ofHJBQVI impulse problems is established, it can be adapted for a variety of otherproblems not necessarily in a financial framework.The goal of this work is to show that the inherently complicated HJBQVI de-cision model can be probed to yield a policy structure result. When we have thisresult, the study of this mathematical construct can be adjusted from the classi-cal optimal control treatment and instead studied from a more modern engineeringpoint of view in the form of applying reinforcement learning techniques.1.1 Related WorkFor a development of risk theory and decision making in finance, see [13] for a basicoverview, and [1], [14] for in-depth treatments of optimal stopping time and optimalimpulse control processes. Since the primary interest is in analyzing the structureof the optimal policy, our model assumes no bid-ask spread, and no trading fre-quency penalty. For work closely considering these factors see [2], [3], [15], [16], and[5]. There has been a push to work on more general stochastic processes to modelthe price of financial instruments, most generally through jump-diffusion processes([17], [18], [19], [15], [20], [21], [22], [23] ) where prices follow discontinuous pathswith jumps modeled by Le´vy processes such as a compound Poisson process. Theideas of this thesis do not immediately consider jump discontinuities in price, but2the ideas covered here with permanent price impact extend directly to these moregeneral cases as the jump intensities typically do no depend on the investing agent’sactions. For an excellent reference on reinforcement learning algorithms, stochasticsearch, stochastic optimization, stochastic gradient methods, and especially the si-multaneous perturbation stochastic approximation algorithm applied in this work,see [24].1.2 Main Results and OrganizationA general background is given in Sec. 2. It is split into four sections. Sec 2.1 intro-duces the discrete time, space, and action Markov Decision Process (MDP). MDPsare a mathematical tool used to model decision making in situations where out-comes are partially random and partially under the control of an agent. Moreover,the underlying state can also be stochastic. In this sense, MDPs are discrete time,state, and action stochastic control processes. An MDP is used in this work as anapproximating tool to its more complex continuous time, state, and action cousin,which, as mentioned above, is formulated as the HJBQVI. This more complicatedmodel is explained in both background and derivation (although not in great depth,as this could take up the majority of this thesis) in Sec. 2.2. In impulse controlproblems, there are often jumps in the value function solution to the HJBQVI. Forthis, the concept of viscosity solutions is introduced in Sec. 2.3, and is used laterin the convergence proofs when comparing value function of the MDP to the valuefunction of the HJBQVI. Finally, since the major contribution of this work is toincorporate modern engineering reinforcement learning techniques into the complexmathematical discipline of optimal impulse control, Sec. 2.4 introduces the con-cept of reinforcement learning, it’s applications, and various algorithms which arein prominent use today.In Sec.3, the continuous-time portfolio liquidation problem is formulated infinite-continuous time. The portfolio model contains a three dimensional state spaceover a positive real cube and includes the price of the asset, the cash in pocket, andthe quantity of shares of the asset. The optimal portfolio liquidation value functionis shown to satisfy the HJBQVI. The setup closely follows [2] in order to constructthe dynamic programming principle. The main result of the work is stated: thevalue function of the HJBQVI can be approximated by the value function of a de-numerable MDP whose optimal policy is monotone in price, shares owned, and time3until liquidation.Sec. 4 aims to prove the main result. This is done in three steps: A countablestate,action, and time MDP is formulated on a discretized grid of the problemvariables from Sec. 3. The value function of this MDP is then shown to convergeto the value function of the HJBQVI in the limiting case. The value functionconvergence ultimately uses a viscosity solution argument (see Sec. 2.4), which is initself an approximating argument. However, we argue that this approximating erroris negligible as it is extensively used in the literature. The structure of the optimalliquidation policy of the discrete problem is analyzed and it is shown that there existsmonotone characteristics of the optimal liquidation action in time until liquidation,the price of the asset, and the quantity of shares owned. Proof of this structuralresult is given and follows from concavity of the value function with respect tothe asset price and shares owned variables, and the treatment of the HJBQVI asa sequential allocation process. Given this result, combined with the convergenceresults above, this shows that the HJBQVI has a value function that is equal tothe value function under convergence of a discrete MDP with this monotonic actioncharacteristics. We maintain that this is a new result and may be extended to moregeneral decision processes in future work.In Sec. 5, numerical simulations are provided when assuming that the price evo-lution distribution is known. A finite difference scheme is introduced along with aniterative method for numerically estimating the optimal value function and optimalliquidation strategy of the HJBQVI. Simulation parameters are chosen to emulatereal world scenarios and resemble those used in [2]. The approximating countablestate and action MDP is modeled using the same parameters and the results areshown to be consistent with the finite difference approximation. When the price evo-lution distribution is not known and therefore transition probabilities of the MDPformulation are unknown, the mentioned machine learning approach is used, exploit-ing the monotone structure results from Sec. 4. More specifically, the reinforcementlearning method of simultaneous perturbation stochastic approximation (SPSA) isused to estimate the optimal monotone strategy in this case.4Chapter 2BackgroundThis introductory section begins by giving an overview of Markov Decision Pro-cesses (MDPs) used traditionally in finite state, action, and discrete time optimalcontrol and decision making problems. These mathematical models are heavily usedin engineering, robotics, automated control, manufacturing (for instance) when theunderlying process which a user bases descisions on is stochastic in nature, andwhen outcomes are partially random. Following this, a continuous state, action,and time optimal constrol/decision model is introduced in the form of a Hamilton-Jacobi-Bellman quasivariational inequality (HJBQVI). This is a somewhat complexformulation (compared to the previous MDP model) in that we now have to usestochastic calculus, uniqueness/existence theorems, as well as the notion of viscos-ity solutions to derive optimal control strategies and value functions. Finally, theconcept of reinforcement learning algorithms is introduced. These are the typesof algorithms the author has been interested in applying to the otherwise complexHJBQVI problem, in an attempt to approximate the optimal control policy whennot all paremeters of the underlying model are known, or if the state, time, and/oraction discretizations are simply too fine for using classical MDP approximations.2.1 Discrete Time, State, and Action Optimal ControlThis section is adapted from [12] and defines the basic structure of finite horizonMarkov Decision Processes in countable state, action, and time. MDPs are mod-els for sequential decision making when the controlled system involves uncertainty.There are five components which describe a MDP:5i. Decision epochs: an increasing time variable, e.g. n = 0,1,2, .... in the infinitehorizon case or m ∈ N∩ [0,T ] for some T <∞ in the finite horizon case.ii. System states: a stochastic process which represents the underlying systemstate. This could be the condition of some machinery, the price of a stock ona financial exchange, etc. Typically states are given a symbolic ordering s.t.S = {si}i∈N where each symbol is understood by the decision maker.iii. Actions/decisions: a set of functions that constitute a decision maker policy.The application of these action functions is how the decision maker affects thesytem in order to maximize some reward function or minimize some cost func-tion.iv. Immediate costs/rewards and terminal cost/rewards: These are functions map-ping from state and action to some value function which is assessed by thedecision maker to be ”good” or ”bad”, depending on the desired outcome of thecontrol process.v. State Transition Probability Functions: in the MDP formulation, it is assumedthat the state can transition according to a Markov Chain with underlying(known or unknown) transition kernels. Moreover, we assume that the statetransitions are also affected by the decision maker’s actions, such that certaintransitions could be considered more or less favorable for the system dependingon the user’s external input.The goal of a decision maker is to use a MDP combined with a dynamic programmingalgorithm to decide which actions are optimal at each decision epoch, in any systemstate, in order to maximize or minimize some value criterion. In the case of this work,decision epochs 1, ...,N are considered, with N <∞. Let pi = (d1,d2, ...,dN−1),di ∈T ⊂ Rn be a randomized policy for each decision epoch n = 1, ...,N − 1, where didenotes the control decision applied when in epoch i. Let the admissible actionsbe D. Let the state of the system be donoted {Xn}n=1,...,N take values in someset S ⊂ Rn. Let cn(Xn,dn) ∈ R be the immediate cost of applying action dn whenthe system is in state Xn at epoch n. Let CN (XN ) be the terminal cost when thecontrolled system is left in state XN at the final epoch. Finally, assume the systemis in state Xi = x ∈ S at epoch i and that some action di = d ∈Rn is applied. At thenext epoch, the system will transition to another state Xj = x′ ∈ S probabilisticallywith the transition kernel given by P(Xj = x′|Xi = x,di = d) = P(x′,x,d). Given6these components, a total expected cost function can be formulated for policy pi forsome initial state x ∈ S s.t.vpi(x) = E{N∑n=1c(Xn,dn) +C(XN )|X0 = x,pi}. (2.1)The optimaztion problem is to find pi∗ = (d∗1, ...,d∗N−1),d∗i ∈ Rn such thatvpi∗(x) = v(x) := suppi∈Tvpi(x), (2.2)for an initial state x ∈ S. Given pi = {d1, ...dN−1} ∈ T , define the possible actionsat each epoch by di ∈ D This value function, and the coinciding optimal policy pi∗have been found to be the solution of the following stochastic dynamic programmingrecursion called Bellman’s equation [12]vn(x) = maxa∈DQn(x,a), (2.3)d∗n(x) = argmaxa∈DQn(x,a), (2.4)whereQn(x,a) = cn(x,a) +∑x′∈SP(x′,x,a)vn+1(x′), (2.5)and vN (x) = CN (x).Since this work is aimed to probe the structural properties of the decision process,the monotonic structure of an optimal policy is described. A policy pi∗ is monotonicin a certain variable if policy elements d∗i are nonincreasing (or nondecreasing) in thatvariable, for all i ∈ {1, ...,N−1}. For example, pi∗ is monotonically nonincreasing inx if d∗n(x)≤ d∗n(x+ δ) for some δ ≥ 0.There are two methods for proving this structural result in this work so far: oneuses a result in a sequential allocation framework, and the other uses super- andsubmodularity arguments. For the latter, define supermodularity as follows:Definition 2.1.1. A function F (x,y) : X × Y → R is supermodular in (x,y) ifF (x1,y1) +F (x2,y2) ≥ F (x1,y2) +F (x2,y1) ∀x1,x2 ∈ X and y1,y2 ∈ Y with x1 >x2,y1 > y2. If the inequality is reveresed, the function F (·, ·) is called submodular.7As will be shown in the Work Completed section, the methodology used to provethe monotonic structure of a policy is done in two steps:1. The monotonicity of the optimal value function vn(x) is shown.2. The state-action reward function Qn(·, ·) define in (2.5) is shown to be super-modular in (x,n), and when combined with 1) yields the monotonic structureof the optimal policy.This structural result is shown for a MDP which simulates the Hamilton-Jacobi-Bellman quasivariational inequality (discussed in the next section). A MDP thatsimulates a HJBQVI is a MDP whose value function converges to the value functionof the HJBQVI, but not necessarily the policy. In essense the summary of thiswork is that given a HJBQVI, we can formulate a MDP with specific optimal policystructure that converges in value to the original problem. This structure can then beexploited in numerical schemes by utilizing reinforcement learning techniques such asthe Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm, givinga machine learning approach to solving the HJBQVI when certain state propertiesare unknown.2.2 Continuous Time, State, and Action OptimalControlThis work utilizes the notion of a quasivariational inequality (QVI) originally de-scribed by [1] in 1982, but since has been used in many areas including finance,engineering, and operations research. The formal framework for QVIs is given asfollows:Definition 2.2.1. (Formal QVI Framework)Let A denote a second order elliptical or parabolic differential operator on a set O⊂Rn or in Q=O×]0,T [. Consider the following non-linear operator on a measurablefunction v:v→M(v) (2.6)with the following increasing propertyv1 ≤ v2⇒M(v1)≤M(v2) (2.7)8The goal is to find a function u(x,t),x ∈ O, t ∈]0,T [ such thatAu−f ≤ 0, u−M(u)≤ 0,(Au−f)(u−M(u)) = 0 in O×]0,T [,(QVI)and the addition of limit conditions, and if A is evolutionary, initial conditions.Next, given the formal definition of this mathematical tool, it is shown how wecan utilize the QVI in an impulse control problem in a dynamical system. Considerthe state of some system of interest given by the stochastic differential equation(SDE)dXt = g(Xt, t)dt+σ(Xt, t)dBt, (2.8)where g(·, ·) is called the drift of process X and σ(·, ·) is called the diffusion ofprocess X. It is assumed that the coefficients b(·, ·) and σ(·, ·) are Lipschitz, suchthat ∃K > 0 s.t.|σ(t,x)−σ(t,y)| ≤K|x−y|, |b(t,x)− b(t,y)| ≤K|x−y|,for all t ≥ 0 and x,y ∈ O. If the Lipschitz condition holds, and for each constantR> 0 there is some CR such that|σ(s,0)|+ |b(s,0)| ≤ CRfor all s≤R, then SDE (2.8) is an exact SDE and ∃! solution Xt : [0,T ]×O 7→ Rn.To control process X with an impulse control, it is assumed that at certain impulseinstances the state undergoes jumps refered to as impulses. The decision processis then the collection of pairs of impulse instances and impulse intensities, calledan impulse control. Denote the impulse control strategy as α= {τn, ζn}n=0,1,..., andgiven a finite horizon T < 0,P(limn→∞τn ≥ T)= 0. (2.9)9The state dynamics under impulse strategy α becomesdXt = g(Xt, t)dt+σ(Xt, t)dBt, for τi ≤ t < τi+1,Xτi =Xτ−i+ ζi,X0 = x,where Xτi is given byXτi =Xτi−1 +τi∫τi−1g(Xs,s)ds+τi∫τi−1σ(Xs,s)dBs.The goal is to maximize some economic function that is scenario specific. Theeconomic function is tailored by specifying rewards dependent on the impulse controland system state throughout the problem time interval [0,T ], with the option of alsospecifying a terminal reward dependent on the state and final action. In general theeconomic criterion for impulse control is writtenJα(x) = ET∫0f(Xs,s)ds+∑i:τi≤Tc(ζi,Xτi)e−βτi +CT (XT ) , (2.10)where f(·, ·) is the continuous running cost of the problem, c(·, ·) is the cost/rewardof a single impulse action, β > 0 is some time value discount factor, and CT (·) isthe terminal reward/cost dependent on the final state of the system. The decisionmaker wants to either minimize (in the case of c(·, ·) being a detriment), or maximize(in the case of c(·, ·) being favorable) the economic function J s.t.v = supα∈AJα, (2.11)α∗ = arg supα∈AJα, (2.12)where A is a system dependent set of admissible impulse actions which the decisionmaker defines to be allowable at each point in O× [0,T ], v is the optimal valuefunction (called just the value function), and α∗ is the optimal impulse strategy.The goal of this research is to study this optimal impulse strategy through variousdiscretizations and convergences, and say that the value function v corresponds to adiscrete MDP that has an optimal policy which has an exploitable structure. More10details about this structure is described in the next section, outlining the work donethus far. To relate this maximization/minimization problem to (QVI), define adynamic programming approach to finding the value function v. [Bensoussan andLions, 1982] show that under certain assumptions the value function v is the uniquesolution to the quasivariational inequality〈Av,u−v〉 ≥ 〈f,u−v〉, ∀v :O× [0,T ], u≤Mu0≤ v ≤Mv,(2.13)where Mu(t,x) :O× [0,T ] 7→ O× [0,T ] is an intervention operator defined asMu(t,x) = k+ inf(τi,ζi)∈A,x+ζ∈Ou(t−,x+ ζ) + c(ζi,Xτi), (2.14)where k ≥ 0 is a fixed impulse cost, and c(·, ·) is the immediate reward/cost asdescribed above. The operatorAu can be thought of as an infinitessimal, no-impulse,continuation regime, in which the state evolves without interruption according tothe state SDE. Therefore we define A as followsAu(x) = limt↓0Ex,y,p[u(t,Xt)]−u(t,x)t,and for the general SDE described in (2.8), for some function f ∈ C2(Rn× [0,T ]),by an application of Itoˆ’s formulaAf(t,x) =∑igi(t,x)∂f∂xi+12∑i,j(σσT )i,j(t,x)∂2f∂xi∂xj. (2.15)Given these control systems definiton, many authors have shown that the valuefunction v is the unique viscosity solution the the followingmax{∂v∂t−Av,Mv−v}= 0 on [0,T )×O, (2.16)together with the terminal conditionmax{v−CT (XT ),Mv−v}= 0 on {T}×O. (2.17)Equations (2.16) and (2.17) are referred to as the Hamilton-Jacobi-Bellman quasi-variational inequality (HJBQVI) for the impulse control problem (2.11) and (2.12).112.3 Viscosity SolutionsThe main result of this work relies on the concept of viscosity solutions, a topic thor-oughly discussed in [25]. Viscosity solutions are used when uniqueness, existence,and stability results are sought after for partial differential equations where singu-larities, discontinuities or regions of non-differentiability may occur. Further, weare considering applications of impulse control strategies with the Hamilton-Jacobi-Bellman PDE which is inherently discontinuous in regions of impulse. The formaldefinition of a viscosity solution, a weak solution to a PDE is as follows. Take theequationF (x,u(x), ∂u∂x ,∇u(x)) = 0 (2.18)over some domain Ω. Here, ∇u(x) is the Hessian matrix. F (·, ·, ·) is called degenerateelliptic if for any two symmetric matrices X and Y , X−Y is positive definite, and∀x ∈ Ω and p ∈ Rn,F (x,u(x),p,X)≥ F (x,u(x),p,Y ). (2.19)For a given upper semicontinuous function u¯(x) : Ω 7→Rn, u(·) is called the viscositysubsolution if at any point xˆ ∈Ω, and for any C2 function φ(·) : Ω 7→Rn s.t. φ(xˆ) =u(xˆ) and φ(x) ≥ u(x) for x ∈ B(, xˆ), a ball of arbitrary radius  > 0 around pointxˆ, the following holdsF (Xˆ,φ(xˆ), ∂φ(xˆ)∂x ,∇φ(xˆ))≤ 0. (2.20)Conversely, given a lower semicontinuous function u¯(x) : Ω 7→ Rn, u(·) is called theviscosity supersolution if at any point xˆ ∈ Ω, and for any C2 function φ(·) : Ω 7→ Rns.t. φ(xˆ) = u(xˆ) and φ(x) ≤ u(x) for x ∈ B(, xˆ), a ball of arbitrary radius  > 0around point xˆ, the following holdsF (Xˆ,φ(xˆ), ∂φ(xˆ)∂x ,∇φ(xˆ))≥ 0. (2.21)Finally, a continuous function u(x) : Ω 7→ Rn is a viscosity solution of the degen-erate elliptical equation (2.18) if u(·) is both a viscosity supersolution and a vis-cosity subsolution. In the next sections, the concept of viscosity solutions to thederived HJBQVI will be exploited to show an approximate convergence result of12the discretization scheme used to show the monotone structure of the ””-optimalliquidation policy, where ””-optimal indicates the optimal policy found from theviscosity-convergent value function of the corresponding MDP.2.4 Reinforcement LearningThe ultimate goal of this work is to apply a reinforcement learning technique tothe optimal liquidation impulse control problem, an approach that has yet to beattempted for an impulse control represented by the HJBQVI. This section gives abroad defintion of reinforcement learning, its application, and the use of the SPSAalgorithm for reasons which will be pointed out in the following.Reinforcement learning is an area of machine learning intent on learning howto take action/control in a given environment, in such a way as to maximize orminimize some given reward or cost structure. When compared to dynamic pro-gramming, where a decision making agent has knowledge of the underlying (pos-sibly stochastic) system process, a reinforcement learning technique is consideredapproximate dynamic programming, and is a situation where either the state spaceis too large, or the system dynamics are not completely known.The SPSA algorithm is an algorithm belonging to the reinforcement learningclass. This overview is adapted from [26] and [24]. SPSA is a method for optimizingstochastic processes when one or more of the underlying processes are governed byunknown parameters. First and foremost, SPSA is an algorithm capable of globalmaxima or minima to optimization problems of the formd∗ = argmaxd∈Dv(d), (2.22)where v(·) is some cummulative cost or reward functional dependent on parameterd. Notice how this formulation compares to the optimal control defined in Sec. 2.1.The algorithm uses an iterative scheme as follows:dn+1 = dn−γn∇v(dn), (2.23)where γn is a sequence of positive values eventually converging to zero, dn+1 =((dn+1)1, ...,(dn+1)k) is a k-dimensional vector which could, for example, representan action process. If we define a random perturbation vector ∆n and a scaling valuecn, the SPSA algorithm estimates the gradient of the i-the entry of cummulative13value function via∇v(dn)i =v(dn+ cn∆n)−v(dn− cn∆n)2cn(∆n)i, (2.24)which means we only need to simulate two ranom variables per entry of the valuefunction. Comparing this to classical estimation algorithms such as the Kiefer-Wolfowitz algorithm, which requires k simulations per value function entry. As thestate and action spaces of the MDP (Sec. 2.1) uses small enough discretizations inorder to approximate the HJBQVI value function (Sec. 2.2), it is easy to see that theSPSA algorithm remains tractable while other stochastic gradient estimation algo-rithms such as Kiefer-Wolfowitz become impractical. This is the motivation behindusing SPSA, and more details involving convergence and performance can be readin the cited reference and are not discussed in this thesis. In the numerical resultsportion (Sec. 5) the algorithm is written out explicitly to give a more sophisticatedlook at the methodology.This concludes the brief introduction of MDPs, the Hamilton-Jacobi-Bellmanquasivariational inequality, viscosity solutions, and the SPSA reinforcement learn-ing / stochastic gradient estimation algorithm. The next section beings the mainwork by the author on the thesis, and In the following section a detailed financialapplication is developed and explored, and the main result of this work is described.14Chapter 3Finite Horizon PortfolioLiquidation ModelAll processes are defined on a standard probability space (Ω,F ,P). Assume the priceof a single risky asset evolves according to a geometric brownian motion studied in,for example, [6],[2],[27], over the time interval t ∈ [0,T ], where T > 0 is a terminalportfolio liquidation horizon. Define a sigma algebra generated by a one-dimensionalBrownian motion Bt as Ft = σ(Bs : 0≤ s≤ t). Consider an impulse control strategyα = {τn, ζn}n=0,1,..., with each τn an Ft-stopping time and ζn an Fτn-measurablerandom variable, where Fτn = σ(Bs : 0≤ s≤ τn). Each τn indicates a trading impulseinstant, and each ζn indicates a trade amount. In this work only sell actions areconsidered as the agent is trying to liquidate a position and therefore ζn > 0 for alln. Given an impulse control strategy α, the impacted price process we use to studythe liquidation problem is written as a controlled scalar-valued price process Pt s.t.dPt = µ(Pt)dt+σ(Pt)dBt−∑τi≤Tm(ζi), P0 = p, (3.1)where Bt is a standard one-dimensional Brownian motion, p > 0 is a given constant.The market impact function m(·) is a bounded function described in [9]. Assumethe following about the price impact functionAssumption 3.0.1. (Market Impact)1. m(ζ) is a nondecreasing function of trading action ζ > 02. m(ζ1)≥m(ζ2) for all ζ1 ≥ ζ2 > 0153. m(0+) = 04. m(ζ)≥ 0 ∀ζ ≥ 05. The application of any market impact term cannot exceed the asset price s.t.(3.1) is always non-negativeWell-definedness of (3.1) follows from the coefficients of (3.1) meeting the stan-dard Lipschitz and linear growth conditions and the price impact function m(ζi), i=0,1, ... is bounded for all ζi.For now, given any impulse instant τn ∈ [0,T ], assume the following simple instan-taneous reward structurec(Xτ−n ,Yτ−n ,Pτ−n , ζn) = c(Pτ−n , ζn). (3.2)In general, take the following assumptions on c(·, ·)Assumption 3.0.2. (Instantaneous reward function.)1. infζ∈R+c(p,ζ)∼ Lζ > 0, for some L > 0,2. c(·, ·) is an increasing function in the asset price Pt and an increasing,concave function in the sell action ζ.3. c(·, ·) is sufficiently bounded in the price variable to ensure future expectationsare bounded.This assumption will be exploited in later sections to prove structural results ofthe optimal liquidation policy.So far the state of the system involves only the controlled price process. Moti-vated by the portfolio liquidation problem, we now introduce additional state pro-cesses. The portfolio description closely follows [2] up to the formulation of theHJBQVI for the impulse control problem. These states will allow the definitionof admissible impulse strategies and allow formulation of applicable solution algo-rithms.3.1 Liquidation DynamicsBeyond the asset price dynamics, the amount of shares owned by the trading agent asa function of time will need to be considered in order to impose realistic constraints16on the liquidation problem. Call the shares process Yt taking values in R+ ∪{0}.Typically we consider integer numbers of shares of an asset tradable through someexchange, but for the purposes of this work we consider values in some compactsubset of the real line. Given the impulse control definition for {τn, ζn}n=0,1,..., thedynamics of Yt is given byYs = Yτn for τn ≤ s < τn+1 and Yτn+1 = Yτn− ζn+1, (3.3)and for Ys = 0,s ≤ T , Yt = 0 for all s ≤ t ≤ T such that when the initial positionin the asset has been liquidated, it will remain liquidated (no purchasing action ispossible).Next define the amount of cash the investor has in pocket at time t ∈ [0,T ]. Letthis value be Xt taking values in R+. The cash in pocket is static between tradingtimes so thatXt =Xτn , τn ≤ t < τn+1, n≥ 0. (3.4)Given an impulse strategy α = {ζk, τk} such that ∆Yt = −ζn+1 occurs at time t =τn+1, then ∆Xt ≡ Xt−Xt− = −∫P (s)dYs. As such, we clearly ignore illiquidityeffects, frequency trading penalties, and bid-ask spreads in this formulation. So forthis impulse control we haveXτn =Xτ−n + ζnPτn =Xτn−1 + ζnPτn , n≥ 1. (3.5)It is clear now that we are only considering impulsive control without a continuouscontrol, as is evident from our control definition and the price dynamics definition(no continuous control terms appear in either the drift or diffusion terms).In order to apply a procedure to find an impulse control strategy in a realisticmarket scenario, a solvency constraint requires defining. Consider a liquidationfunction L(x,y,p) = x+yp, representing a full sell action such that no shares of theasset remain. The first constraint is to require the portfolio’s liquidation value tosatisfy L(Xt,Yt,Pt) ≥ 0 for all t ∈ [0,T ]. Given this criterion, define the solvencyregionS = {(x,y,p) ∈ R+×R+×R+ : L(x,y,p)> 0}, (3.6)17with boundary and closure ∂S = ∂yS ∪∂LS andS¯ = S ∪∂S, (3.7)respectively. Here we have∂yS = {(x,y,p) ∈ R+×R+×R+ : L(x,y,p)≥ 0,y = 0}, and∂LS = {(x,y,p) ∈ R+×R+×R+ : L(x,y,p) = 0}.Given (t,x,y,p) ∈ [0,T ]×S¯, we define an impulse control α as followsDefinition 3.1.1. (Conditions for an Admissible Impulse Control) The couple pro-cess {ζk, τk}k≥1 is admissible if the following properties are satisfied:1. {Xs,Ys,Ps} follow state laws (3.1), (3.3), (3.4), and (3.5) and remain in (3.7)for all liquidation impulses, for all time t≤ s≤ T2. 0< τi < τi+1, i≥ 1,3. τi is an Ft = σ(Bs : 0≤ s≤ k) stopping time for t≥ 0.4. ζi is Ft-measurable, and ζi ≤ Yτi , i ≥ 1, where Ft is the right-continous fil-tration,5. P[limi→∞τi ≤ T]= 0 for all T ∈ [0,∞),Given an initial state St = (Xt,Yt,Pt) = (x,y,p), call the set of all admissiblestrategies A(x,y,p).3.2 HJBQVI Characterization of Optimal PolicyAssociate with the above controlled price process (3.1) an objective functionvα(t,x,y,p) = Ex,y,p[∑τic(Pατ−i, ζi)]. (3.8)Our goal is to maximize our expected return and find a function v∗(t,x,y,p) suchthatv∗(t,x,y,p) = supα∈A(x,y,p)vα(t,x,y,p) (3.9)18and the corresponding optimal impulse strategyα∗ = arg supα∈A(x,y,p)vα(t,x,y,p). (3.10)The standard derivation (see for example [28], [29], [4], or [14]) of the correspond-ing dynamic programming principle shows that the optimal value function (3.9) isthe unique viscosity solution v(t,x,y,p) of the following Hamilton-Jacobi-Bellmanquasivariational inequality (HJBQVI)max{∂v∂t+Lv,Mv−v}= 0, in [0,T )×S¯, (3.11)with terminal conditionmax{v−L(XT ,YT ,PT ),Mv−v}= 0, on {T}×Sˆ. (3.12)As a note on the existence of the sup mentioned in (3.10), first define the followingtheorem:Theorem 3.2.1. (Arzela´-Ascoli Theorem)Given a compact metric space Ω, a subset F ⊂ C(Ω) is the space of continuouscomplex valued functions on Ω. F is compact if and only if F is closed, bounded,and equicontinuous.Proof. The proof is a standard result in functional analysis.Since (x,y,p) ∈ S¯ is compact, and we have placed bounds on α s.t. at all in-stances αt = (τi, ζi)⇒ ζi ≤ yt and we can reasonably state that yt < ymax for somefinite real valued constant ymax, then α is a closed and bounded function over acompact space. Showing that α is equicontinuous allows us to say that α is a con-tinous function, and thus the sup exists for all state combinations [0,T ]×S¯.Lemma 3.2.1. (Equicontinuity of Optimal Action Function) The functionα∗(x,y,p) = arg supα∈A(x,y,p)vα(t,x,y,p)is equicontinous, i.e., as α∗ : [0,T ]×S¯ 7→R+, α∗ is said to be equicontinuous at point19x ∈ [0,T ]×S¯ if for every  > 0, x has a neghborhood Ux such thatdR+(α∗(x),α∗(u))< for all u ∈ Ux and α∗ ∈ R+. Here d·(·, ·) is the distance function.Proof. The proof follows from the application of the Uniform Boundedness Principlewhen acknowledging that the optimal value function α∗ is a set of continuous linearoperators.In (3.11), the infinitesimal generator L and intervention operator M are de-fined in (3.14) and (3.16) below. Let L(XT ,YT ,PT ) = CT (YT ,PT ) be the terminalliquidation reward function, which assumes the followingAssumption 3.2.1. (Terminal reward function). CT (·) is a concave function ofthe terminal number of shares YT , and an increasing function of the terminal pricestate PT state s.t.CT (XT ,YT ,PT ) = CT (YT ,PT ),CT (·, ·) : R+×R+ 7→ R,CT (YT + δy,PT )≥ CT (YT ,PT ), ∀YT ,PT ∈ R+, δy ≥ 0 and CT (0,PT ) = 0, (3.13)CT (YT ,PT + δp)≥ CT (YT ,PT ), ∀YT ,PT ∈ R+, δp ≥ 0 and CT (YT ,0) = 0.This can be thought of as a price-dependent, final trade for what has not beensold at the end of the trading interval.Lφ(t,x,y,p) is the differential infinitesimal generator function for the uncon-trolled price process (3.1) which follows the following definitionDefinition 3.2.1. Let Pt be a drift-diffusion process in R+ corresponding to (3.1).The infinitesimal generator L of for a function of Pt is given byLφ(t,x,y,p) = limt↓0Ex,y,p[v(t,Xt,Yt,Rt)]−v(t,x,y,p)tfor all p ∈ R+. Define the set of functions f : R+→ R such that the limit exists atp as DL(p) while DL denotes the set of all functions in which the limit exists for allp ∈ R+.Notice that v(t,x,y,p) ∈ DL from (3.8), (3.9), and (3.1). Since the impulseactions and immediate/terminal rewards are bounded, v(·, ·, ·, ·) is bounded for all20(t,x,y,p) ∈ [0,T ]× S¯. Given Definition 3.2.1 the explicit generator function forv(t,x,y,p) exists and is given in general asLv(t,x,y,p) =12σ(p)2∂2∂p2v(t,x,y,p) +µ(p)∂∂pv(t,x,y,p).Since a geometric Brownian motion process is sought from the process (3.1), thegenerator function is given byLv(t,x,y,p) =12σ2p2∂2∂p2v(t,x,y,p) +µp∂∂pv(t,x,y,p). (3.14)(3.15)Remark: The structural results in Sec. 3 are not dependent on this particularformulation. The functions µ(Pt) = µPt and σ(Pt) = σPt are chosen specifically tomatch the literature consensus on using the geometric Brownian motion to representthe price evolution of a risky asset. Other interpretations are valid, as long as forµ(Pt) and σ(Pt) there exists constants Cµ,Cσ > 0 such that for each x,y ∈ R+|µ(x)−µ(y)| ≤ Cµ|x−y|, |σ(x)−σ(y)| ≤ Cσ|x−y|,and m(·) is bounded, so that SDE (3.1) admits a solution. The geometric Brownianmotion trivially satisfies this requirement.The operator M is an intervention operator such that M maps a function v :[0,T ]×S¯ 7→ [0,T ]×S¯ to the function Mv : [0,T ]×S¯ 7→ [0,T ]×S¯ given byMv(t,x,y,p) = supe∈C(x,y,p)v(t,Γ(x,y,p,e)),(x,y,p) ∈ S¯,e ∈ R, t ∈ [0,T ].(3.16)and terminal condition v(T,x,y,p) =CT (y,p). This is the result of the instantaneoussell order ζ at some time t, so that the state process jumps from St− = (x,y,p) toSt = Γ(x,y,p,ζ) = (x+ ζp,y− ζ,p−m(ζ)) ∈ S¯. The admissible impulse action setC(x,y,p) is given byC(x,y,p) ={e ∈ R+ :(Γ(x,y,p,e) ∈ S¯)}.213.3 Main ResultTheorem 3.3.1. Consider the HJBQVI (3.11)-(3.12). Assume 3.0.1, 3.0.2 and3.2.1 hold. The optimal value function v∗(t,x,y,p) defined in (3.9) coincides withthe optimal value function under time, state, and action convergence of a denu-merable Markov Decision Process with the same cost functions and price dynamics.Furthermore, this MDP has an optimal policy that is nondecreasing with respect toshares owned Yt, the price of the underlying asset Pt, and the time accumulatedduring the liquidation process t. Note that the optimal policy of the discrete schemeis not necessarily convergent to the optimal policy of the HJBQVI.Proof. The proof of Theorem 3.3.1 is given in the following steps 1−3:1. A countable state, action, and time MDP exists with transition probabili-ties derived from the distribution of (3.1), and with identical immediate andterminal reward functions as (3.2) and (3.13).2. The truncated trade decision epochs defined by the MDP from Step A ap-proach the HJBQVI impulse trade instances when the truncation is removed,and the optimal value function of the MDP from Step A converges to theoptimal value function of HJBQVI (3.11) under time, state, and action dis-cretization.3. The optimal policy of the MDP is monotone increasing in the shares ownedYt, price of the underlying asset Pt, and time elapsed t, thus giving the desiredstructural result.The details of each step 1−3 of the proof of Theorem 3.3.1 is given in Sec. 4.Showing the HJBQVI has a monotone result is useful in application because itcan be exploited numerically when searching for the optimal policy via a Simulta-neous Perturbation Stochastic Approximation (SPSA) type algorithm, explored inSec. 4.22Chapter 4Proof of Main Result4.1 The Denumerable MDP Representation of theHJBQVIBefore defining the MDP, consider the following state and time discretization forthe solvency region (3.7) and the HJBQVI (3.11).This section adapts a discrete derivation and convergence analysis from [30].Consider a time discretization to the HJBQVI (3.11) as follows. Let the time stepbeh= T/m,m ∈ N\{0} (4.1)and let Tm = {ti = ih, i = 0, ...,m} be the uniform gird over the interval [0,T ]. Sothe time discretization of (3.11) leads to the following backward scheme:vh(ti,x,y,p) = max{E[vh(ti+1,Xti+1 ,Yti+1 ,Pti+1)],Mvh(ti,Γ(x,y,p,ζ))}= max{E[vh(ti+1,Xti+1 ,Yti+1 ,Pti+1)], supζ∈C(x,y,p)vh(ti,Γ(x,y,p,ζ))}(4.2)vh(tm,x,y,p) = max{CT (x,y,p), supζ∈C(x,y,p)CT (y− ζ,p−m(ζ))}. (4.3)for i = 0, ...,m− 1, tm = T , and (x,y,p) ∈ S¯ . Since this definition is implicit due23to the nonlocal obstactle M, the usual way to remedy this issue is to introducean iterative scheme by considering a sequence of optimal stopping problems (seementions of this, for example, in [30], [31], [2]) s.t. for k ∈ N\{0}:vh,k+1(ti,x,y,p) =max{E[vh,k+1(ti+1,Xti+1 ,Yti+1 ,Pti+1)], supζ∈C(x,y,p)vh,k(ti,Γ(x,y,p,ζ))}(4.4)vh,k+1(tm,x,y,p) =max{CT (x,y,p), supζ∈C(x,y,p)CT (y− ζ,p−m(ζ))}. (4.5)This iterative scheme is used temporarily to handle the implicit scheme by approxi-mating the discretized value function, and later discarded by taking many iterations.A more detailed use of this approach is discussed in the proofs section (Appendix B).For state discretization, define a finite, localized subset of the admissible statespace as a uniform grid, denoted byS¯loc = S¯ ∩X ×Y ×P, (4.6)whereX = [xmin,xmax], Y = [ymin,ymax], P = [pmin,pmax]. (4.7)Here, X has increments of size (xmax− xmin)/l, l ∈ N\{0}, and similar for Y, P.Each state variable does not need to be have the same discretization parameter l,but for ease in notation and implementation assume they are equivalent. Define theregular gridZl = {(x,y,p) ∈ X ×Y ×P : (x,y,p) ∈ S¯loc}. (4.8)Define a similar grid for the admissible controlsCM,R(x,y,p) = {ζi = ζmin+iM(ζmax− ζmin) : 0≤ i≤M,Γ(x,y,p,ζi) ∈ S¯loc}, (4.9)24where ζmin < ζmax ∈ R+ and M ∈ N\{0} are fixed constants, andR := min(|xmin|, |xmax|, |ymin|, |ymin|, |pmax|) . (4.10)We assume the projection on the price variableΠ[0,pmax] : R+→ [0,pmax] (4.11)p→ p1[0,pmax] +pmax1]pmax,+∞[ (4.12)in order to constrain the price to pmin = 0 and pmax to some fixed positive constant.Next, approximate the expectation value from (4.4) and by the following Karhunen-Loe`ve quantizer method (again, adapted from [30])E[vh(ti+1,Xti+1 ,Yti+1 ,Pti+1)]∼EN,R[vh(ti+1,Xti+1 ,Yti+1 ,Pti+1 ] =Nd∑i=1Pi1,...,id(N)vh,n(t,Z0,s,x,y,pN,R (t)) ∀s≤ t(4.13)whereZ0,s,x,y,pN,R (t) =x,y,∏p∈[pmin,pmax](p exp{(µ− σ22 )(t−s) +σWNi1,...,id(N)(t−s)}) ,WNi1,...,id(N)(t) =d(N)∑k=1√2Tpi(k− 12)xik sin(pitT(k−12)),1≤ ik ≤Nk,d(N)∏k=1Nk =N,andPi1,...,id(N) =d(N)∏j=1P(xij ).Here (xij ) is theNk quantizer of the standard normal distribution.WNi1,...,id(N)(t) isthe quantizer of sizeN that is the optimal decomposition ofN =N1×N2× ...×Nd(N)giving the optimal d(N) and (Nk) for the given N . Pi1,...,id(N) is the weight associ-ated with the quantizer xik and is the product of the weights associated with the25quantization of the normal distribution. The optimal grid (xin) and weights Pi1,...,idcan be found online at http://www.quantize.maths-fi.com/downloads.In summary, we have the following notation for the approximating scheme tothe HJBQVI (3.11)1. h : the time step2. n : iterative scheme index3. N : the Brownian motion discretization parameter4. M : the number of admissible transactions5. R : the boundaries of the localized, discretized solvency region S¯loc.It is now possible to write the final iterative approximation scheme for theHJBQVI using the new expectation value approximation definition in (4.4) and(4.5)vh,n+1(ti,x,y,p) =max{EN,R[vh,n+1(ti+1,Xti+1 ,Yti+1 ,Pti+1)], supζ∈C(x,y,p)vh,n(ti,Γ(x,y,p,ζ))}(4.14)vh,n+1(tm,x,y,p) =max{CT (x,y,p), supζ∈C(x,y,p)CT (y− ζ,p−m(ζ))}. (4.15)starting from vh,0 = EN,R[CT (Y0,t,x,y,pT ,P0,t,x,y,pT )].Rewrite the discrete time and state impulse operator as follows:MM,Rvh(ti,x,y,p) = supζ∈CM,R(x,y,p)v(ti,Γ(x,y,p,ζ)),(x,y,p) ∈ Zl, ti ∈ Tm.(4.16)Next, construct the countable state, countable action MDP to approximate theHJBQVI (3.11) through emulation of the discrete scheme (4.2) and (4.3), using26the time, state, and action discretizations defined above. Let the controlled MDPcash, shares, and price process triplet be {Sn} = {Xn,Yn,Pn} for n ∈ Tm = {ti =ih, i= 0, ...,m}, where tm = T is the finite liquidation horizon. WLOG rename timeintervals s.t. n ∈ Tm := {0,1, ...,T}. Let Sn ∈ Zl be the set of all possible statesat time n, assuming the same discretization as in (4.8). Let the control space atepoch n be CM,R(Sn),n = 0,1, ...,T − 1 as defined in (4.9). Let {αn},n = 1, ...,Tbe a control process consisting of the liquidating agent’s control decisions at eachepoch n= 0,1, ...,T −1, with αn(x,y,p) ∈ CM,R(x,y,p) . Let Pn(i, j,αn), i ∈ Zl,αn ∈CM,R(x,y,p), j ∈ Zl,n= 0,1, ...,T −1 be the probability of transitioning from state ito state j given that action a has been made in period n, given byPn(i, j,a) = P(Pn+1 = j|Pn = i,αn = a,Fn) = p(i, j,a). (4.17)with Fn = σ(Bm : 0≤m≤ n), with reference to the density function implied by (3.1).The solvency constraint is automatically satisfied as all control actions αn(x,y,p),(n,x,y,p)∈{0, ...,T}×Zl belong to the discrete admissible policy set CM,R(x,y,p).An admissible policy α is a set of T functions α= {α0(x0,y0,p0),α1(x1,y1,p1), ...,αT−1(xT−1,yT−1,pT−1}, with αn(xn,yn,pn) : Zl 7→ CM,R(xn,yn,pn). So the investorusing policy α in state (x,y,p) at epoch n would apply action αn(x,y,p). Comparethis strategy α in the discrete problem to the impulse strategy α = {τ0, ζ0, ...} - inthe discrete case there are now predetermined, truncated action epochs, so keepingtrack of impulse instances is no longer needed.Let cn(i, j,αm) ∈ R+ denote the immediate reward at time n when the state isi= (x,y,p)∈Zl, action α∈ CM,R(i) is taken, and the state transitions to state j ∈Zl.At the end of the liquidation period at interval T , a final reward will be gained bythe investor for all remaining shares of the asset in the portfolio.Define the following MDP components for the portfolio liquidation problem:Definition 4.1.1. (Countable state, countable action, discrete time MDP compo-nents for the portfolio liquidation problem)For (x,y,p) ∈ Zl,1. Decision Epochs: n ∈ Tm,2. States: (x,y,p) ∈ S¯loc,3. Actions: αn(x,y,p) ∈ CM,R from (4.9), n ∈ Tm,274. Immediate Rewards: cn(x,y,p,αn(x,y,p)) ∈ R+, with Assumption 3.0.25. Terminal Reward: CT (x,y,p) = CT (y,p) ∈ R+, with Assumption 3.2.16. Transition Probabilities: Pn(i, j,a) defined in (4.17)4.2 Convergence of the Approximating SchemeWe will show in this section that this countable state MDP can be used as an ap-proximationg model for the same liquidation problem as the HJBQVI. This sectionuses a large part of [30] to show convergence analysis results, and keeps notation con-sistent. As a result, this section summarizes convergence results for approximatingschemes to the HJBQVI described by the authors.First , the impulse instance truncation convergence is shown. Assume the sametruncation scheme (4.18), and the corresponding value function vn, which describesthe value function when the trading agent is allowed at most up to n impulse op-erations In the following lemma, the convergence of the value functions vN towardsthe initial value function v is shown.Lemma 4.2.1. (Impulse truncation convergence) For all (t,x,y,p) ∈ [0,T ]×S¯ andl ∈ N\{0}, the truncated value function defined in (4.19) converges to the valuefunction of the non-truncated HJBQVI (3.9) such thatliml→∞vN (t,x,y,p) = v(t,x,y,p).Proof. See Appendix B.Theorem 4.2.1. Define φn(t,x,y,p) iteratively as a sequence of optimal stoppingproblems s.t.φn+1(t,x,y,p) = supτ∈Tt,TE[Mφn(τ,X0,t,x,y,pτ )],φ0(t,x,y,p) = v0(t,x,y,p),where Tt,T is the set of all Fs-stopping times in [t,T ], t≤ s≤ T . Thenφn(t,x,y,p) = vn(t,x,y,p).Proof. See Theorem 3.1 of [30].28Given j ∈ N\{0}, introduce the following subsets of A(x,y,p), the set of theadmissible impulse control strategies:Aj(x,y,p) := {α= (τi, ζi)i=0,...,j ∈ A(x,y,p)}, (4.18)and the corresponding value function vj , which describes the value function whenthe trading agent is allowed at most up to j impulse operations:vα,h,N,M,Rj (t,x,y,p) = Ex,y,p∑τi≤τjc(Pατ−i, ζi) , (t,x,y,p) ∈ [0,T ]×S¯.Now, given the state, time, action, and decision epoch truncations of the MDPcomponents, the MDP value function can be written asvα,h,N,M,R(n,x,y,p) = Ex,y,p,α{T−1∑n=0[cn(Xn,Yn,Pn,αn(Xn,Yn,Pn)] +CT (YT ,PT ))},where α= (n,αn)n=0,1,...,T ∈AT (x,y,p) and (t,x,y,p)∈ [0,T ]×S¯loc. This is a similarvalue function as in the continuous scenario (3.9), but in this altered framework thereare only a finite number of decision epochs while having a continuum of impulseintervention times in the previous interpretation. Note that in the time discretizationand trade interval truncation limit (3.9) is obtained. The goal is to find an admissiblestrategy α∗ s.t.vα∗,h,N,M,R(n,x,y,p) = maxα∈AT (x,y,p)vα,h,N,M,R(n,x,y,p)(x,y,p) ∈ Zl,α ∈ CM,R(x,y,p),(4.19)andα∗(n,x,y,p) = arg maxα∈AT (x,y,p)vα,h,N,M,R(n,x,y,p)(x,y,p) ∈ Zl,α ∈ CM,R(x,y,p).(4.20)Theorem 4.2.2. (Local Uniform Value Function Convergence)For all (t,x,y,p) ∈ [0,T )×S¯, l ∈N\{0}, given discretization definitions (4.8), (4.9),(4.10), and (4.1), the value function vh,R,N,Ml (t,x,y,p) on Tm×Zl of the discrete,epoch truncated MDP (4.19) which emulates the control structure of the HJBQVIdiscrete scheme (4.2) and (4.3) converges locally uniformly to the viscosity sub- and29supersolution v¯ and¯v, respectively, of the value function v(t,x,y,p) of the HJBQVI(3.9), which are equivalent s.t. v¯ =¯v = v(t,x,y,p), i.e.,lim(t′,x′,y′,p′)→(t,x,y,p)(h,M,N,R)→(0,+∞,+∞,+∞)(t′,x′,y′,p′)∈Tm×Zlvh,N,M,Rl (t′,x′,y′,p′) = v(t,x,y,p). (4.21)Proof. See proof in Appendix B which follows the proof methodology of [30], or seealso [2], [27], or [32] for similar treatments.It is now possible to define a countable state, countable action MDP whosevalue function converges to the value function of the original HJBQVI (3.11), sincewe have convergence in truncated sell instances from Lemma 4.2.1, and convergencein state, action, and time discretizations from Theorem 4.2.2. In the next sectionwe use these results to determine the structure of the optimal policy (4.20) of theMDP discretization scheme (Definition 4.1.1) for the HJBQVI.4.3 Optimal Policy StructureThe optimal liquidation total reward function and optimal liquidation strategyvα∗n (x,y,p) and α∗ = {α∗0,α∗1, ...,α∗T−1}, respectively are obtained from the follow-ing equations, where each decision rule α∗n,n = 0,1, ...,T − 1 are the solutions toBellman’s recursive equation, which can be rewritten and solved using a backwardsinduction algorithm s.t.vn(x,y,p) = maxαn∈CM,R(x,y,p)Qn(x,y,p,αn), (4.22)α∗n(x,y,p) = arg maxαn∈CM,R(x,y,p)Qn(x,y,p,αn), (4.23)where Q(·, ·, ·, ·) is the state-action reward function given byQn(x,y,p,αn) ={c(p,αn)+∑j∈Zl[Pn(j,(x,y,p),αn)×vn+1(x+pαn,y−αn,pj)]},(4.24)and with vT (x,y,p) = CT (y,p).The following structural result has been adapted from [11], where the probability30of receiving an investment opportunity is given as a condition on the HJBQVI itself.At the initial time n= 0, there are y0 = y¯ shares to liquidate over the time inter-val Tm = {0,1, ...,T}. At each integer epoch n ∈ Tm the trader decides how manyshares to sell αn ∈ [0,yn],0≤ yn≤ y¯, where yn is the number of shares owned at epochn. The reward for selling αn shares is given by (5.5), and the admissible actionsare given by AT (x,y,p) defined in (4.18). Apply assumption 3.0.2 and c(p,0) = 0.WLOG assume a constant price p and write c(α,p) = c(a,p) = c(a). When a timevariable is used in an expression such as vn(xn,yn,pn), then the subscripts are gen-erally ignored and written as vn(x,y,p). Let vn(x,y,p) denote the maximal expectedadditional profit attainable when there are T −n decision epochs to go and y sharesof the asset left in the portfolio, when a selling opportunity is at hand.From the HJBQVI definition (3.11), a selling opportunity is at hand whenmax{∂v∂t+Lv,Mv−v}=Mv−v = 0. (4.25)In the discrete time scheme, let the probability that a selling opportunity is at handbe q be given by the fraction of the time between decision epochs i, i+ 1 ∈ Tm s.t.condition (4.25) holds. Thenqn ∈ [0,1],n ∈ Tm. (4.26)For simplicity let qn = q for all n ∈ Tm, as the numerical value of q has no impacton the structural argument.The function vn(·, ·, ·, ·) satisfies the optimality equationvn(x,y,p) = max0≤α≤y{c(α,p) + v¯n+1(x,y−α,p−m(α))} , n ∈ Tm,vT (x,y,p) = CT (y,p),vT+1(x,y,p) = 0,(4.27)wherev¯m(x,y,p) =T∑i=T−mq(1− q)ivm−1(x,y,p).Here v¯m is the expected maximal addition sum of rewards when y asset shares31remain to sell at price p, there are T −m decision epochs to go, and it is not yetknown if a selling opportunity is available. By conditioning on whether or not aselling opportunity occurs, obtainv¯m(x,y,p) = qvm(x,y,p) + (1− q)v¯m+1(x,y,p).Begin by showing that v inherits the concavity property of c(·, ·) in y.Lemma 4.3.1. (Concave Value Function)Given the discrete MDP (Definition 4.1.1) with immediate and terminal reward func-tions defined by Assumptions 3.0.2 and 3.2.1 and price impact function defined byAssumption 3.0.1, the value function vn(x,y,p) defined in (4.27) is a nondecreasing,concave function of y.Proof. See Appendix B.Next let the process αn(x,y,p) to be the value of α (or the smallest value ofα if there is more than one) that maximizes the right hand side of (4.27). Thenαn(x,y,p) is the optimal number of shares to sell at time epoch n when there are yshares remaining in the portfolio, a selling opportunity is present, and the price ofthe asset is p. Theorem 4.3.1 below gives the structure of the optimal policy, butfirst consider the following assumptionAssumption 4.3.1. (Probability of Incurring Market Impact) Assume that after asell action a is made, the market impact term m(a) : R+→ R+ does not occur withprobability 1. Instead, assume that there is a function f(a,p) : CM,R×Zl∩P → [0,1]given byfm(a,p) =0 if a= 0pm(p) if a > 0,(4.28)where pm(·) : Zl ∩P → [0,1] is the market impact probability function, given that asell action has been made.Theorem 4.3.1. (Optimal Policy Structure)Assume the MDP (Definition 4.1.1) with market impact function m(·) defined byAssumption 3.0.1 and with immediate and terminal reward functions defined byAssumptions 3.0.2 and 3.2.1. If the MDP has a value function defined by (4.22)32that is a nondecreasing and concave function of y and is convergent to the valuefunction of HJBQVI (3.11) by Lemma 4.2.1 and Theorem 4.2.2, having investmentopportunity parameter qn defined in (4.26), and market impact probability functiongiven by Assumption 4.3.1, then the optimal policy (4.23) of MDP (Definition 4.1.1)has the following structural properties:(i.) α∗n(x,y,p) is a nondecreasing function of y,(ii.) α∗n(x,y,p) is a nondecreasing function of n.(iii.) α∗n(x,y,p) is a nondecreasing function of p.Proof. See Appendix B.Intuitively, Theorem 4.3.1 states that the number of shares sold at a time n willincrease if there are more shares owned, since the end goal is a complete liquida-tion by the horizon. Similarly, if the number of decision epochs that have passedincreases, the investor will want to sell a larger number of shares to meet the liq-uidation deadline. Moreover, a higher asset price will entice the decision maker tomore readily sell more shares than if the price is lower.Lemma 4.3.1 and Theorem 4.3.1 state that as long as the instantaneous and ter-minal reward functions follow Assumptions 3.2.1 and 3.0.2 (are nondecreasing, con-cave functions of the trading action), then the optimal liquidation actions are nonde-creasing in the number of shares held Y , and nondecreasing in the total time elapsedn. Lemma 4.2.1 and Theorem 4.2.2 shows that the countable state, countable ac-tion MDP from definition 4.1.1 is a valid approximation as it adheres to the impulsetruncation, state, action, and time discretizations, and converges to the originalHJBQVI (3.11) when discretization parameters (h,M,N,R)→ (0,+∞,+∞,+∞).As such, it has been shown that the HJBQVI has the same value function as adiscretized MDP with a monotone optimal policy when converged.33Chapter 5Numerical Results5.1 Finite Difference Approximations: Ground TruthEstimatesTo solve the HJBQVI problem numerically, an appropriate discretization is needed.Let the localized problem conform to the uniform grid DL≡ [x0, ...,xL]× [y0, ...,yL]×[p0, ...,pL] where x0≡ xmin,xL≡ xmax,y0≡ ymin,etc. A similar discretization schemeis used as in [27] where∂hp =v(x,y,p+h,t)−v(x,y,p)h∂2,hpp =v(x,y,p+h,t) + 2v(x,y,p)−v(x,y,p−h)h2,with (x,y,p+h),(x,y,p), and (x,y,p−h) ∈ S¯. Consider the time discretization {ti}with t0 = 0 and tN = T such that i = TN . A new discrete generator function isobtainedLhφ(s) =−12σ2p2∂2,hpp φ(s)−µp∂hpφ(s). (5.1)The state discretized HJBQVI then becomesmax{∂v∂t+Lhv,v−Mv}= 0, in R+. (5.2)34Assume zero Neumann boundary conditions on the boundary such that∂v∂p(x,y, ·,pmax) = 0, ∀x,y ∈ [x0, ...,xL]× [y0, ...,yL]. (5.3)The localized, discretized problem (5.2) can be solved using the following iterativemethod over discrete time steps:LhvT = 0 ∀p ∈ [pmin,pmax],max{vn+1−vn+Lvn+1,Mvn−vn+1}= 0, ∀n≥ 0,(5.4)with boundary condition (5.3) and constraintmax{vT −CT (YT ,PT ),MvT−1−vT }= 0, on {T}×Sˆ.The right hand side of (5.2) is equated to zero and the value function is solved for,with results given in Fig. 7.1, with values used in the iteration process. The shapeof this value function can be compared to that of the discrete MDP, shown in theSec. 4-B, and demonstrates the convergence of the value function studied in Lemma4.2.1 and Theorem 4.2.2.5.2 Countable State, Action, and Time MDPTo derive the dynamic programming algorithm for the countable state, action andtime finite horizon MDP [12] is used, which is the discrete time equivalent to theHJBQVI. To develop this algorithm, first the transition probabilities are defined.Given that the system in state (xn,yn,pn) at epoch n, the probability of tran-sitioning to the state (xn+1,yn+1,pn+1) = (x¯, y¯, p¯), given that at time n the actionαn = a∈An(x,y,p) was applied, is obtained by solving for the diffusion density whilealso considering the deterministic, permanent price impact function. Since (3.1) ad-mits a unique solution (the coefficients of the stochastic differential equation areLipschitz and following linear growth [33] (Sec. 5.2)), given Fn = σ(Bm : 0≤m≤ n)the transition probabilities are given by35P((xn+1,yn+1,pn+1) = (x¯, y¯, p¯)|(xn,yn,pn) = (x,y,p),αn = a,Fn)= P(Pn+1−Pn = p¯−p+m(a))=1σ√2pi∆np¯+i∫p¯−i1uexp−(ln(u)− ln(p−m(a))− (µ− σ22 )∆n)22σ2∆ndu≡ Pn(p¯,p,a)where ∆n is the time difference between decision epochs n+1 and n (if not equal to1). Denote p+i and p−i to be the upper and lower values of the price discretizationp ∈ {p0,p1, ...,pT } s.t. p+i = pi+1/(2T ) and p−i = pi−1/(2T ). Note that if no actionis taken at decision epoch n and defining the market impact term m(a) = 0 if a= 0,the transition probability becomesP((xn+1,yn+1,pn+1) = (x¯, y¯, p¯)|(xn,yn,pn) = (x,y,p),αn = 0,Ft)= P(Pn+1−Pn = p¯−p+m(0))= P(Pn+1−Pn = p¯−p)=1σ√2pi∆np¯+i∫p¯−i1uexp−(ln(u)− ln(p− (µ− σ22 )∆n)22σ2∆ndu≡ Pn(p¯,p,0)So given an action a in state p, the probability of successfully transitioning to anew state p¯ is given by Pn(i, j,a), where Pn(i, j,0) < Pn(i, j,a), for all i, j ∈ Zl,n ∈{0,1, ...,T −1} and a0≥ a1⇒ Pn(p¯,p,a0)≥ Pn(p¯,p,a1) under the natural assumptionthat m(a0)≥m(a1).36Assume that the instantaneous reward function is given by cn(p,a) = cpa for apositive constant c. Similarly, the final reward function is given by CT (YT ,PT ) =CYT for another positive constant C. Note that these functions adhere to ourassumptions 3.0.2 and 3.2.1. A discounting factor δ is used to attenuate future ex-pected rewards, which does not affect the structural result. The discounted optimalvalue function is given byVn(s) = maxa∈A(s)Qn(x,y,p,a) = maxa∈A(s){c(p,a)++ δ×∑pj∈Zl[Pn(pj ,p,a)×Vn+1(x+pa,y−a,pj)]}.Assume a uniform discretization for each state variable so that hx = hy = hp = h.Table 1 in Appendix A provides a list of parameter values used in this simulation.Adjust Assumption 3.0.2 to the immediate reward function as followsAssumption 5.2.1. (Concave immediate reward function)Assume an immediate reward function cn(x,y,p,αn(x,y,p)) = cn(p,αn(x,y,p)) equiv-alent to (3.2), which is nondecreasing concave in α(·).For example, c(·, ·) can be written as an intuitive ”reward with transaction cost”functionc(Pn,α(·)) = Pnα(·)−K(α(·)), (5.5)where K(a) is a variable transaction cost incurred when initiating a trade of quan-tity a > 0. It is clear that cn(·, ·) is an increasing function in pn and increasing insell action a if K(a)< pna. This reward structure will be used to simulate the MDPin Sec. 4.The liquidation horizon is taken as an arbitrary amount of time from the initialstate, and has arbitrary units. The drift and diffusion coefficients, as well as themin/max state values are chosen to approximately emulate the values used in Test1 of [2]. The instantaneous reward is taken as unity to match the expectation thatselling a share at a given price will give you exactly that price in return. Theterminal reward is chosen as less than unity to coerce the trader into pre-terminaltrades. This emphasizes the optimal action structure as is shown below in Fig. 7.637The transition probabilities for various actions (ranging from the ”no trade”action to the ”maximum trade” action) are represented in Fig. 7.2 and Fig. 7.3.Notice that the transition probabilities begin to cluster around a final price whichis lower when larger and larger trade actions are applied. This coincides with ourassumption that large trade orders have a greater negative impact on the asset pricecompared with smaller trades. On the other hand, a ”no trade” action allows theprocess to follow a typical geometric brownian motion transition pattern (in thiscase with a positive drift term), and transitions from lower prices to higher pricesare more common. These transition probabilities are now used in the discrete timeMDP to find optimal actions dependent on number of trade intervals remaining andthe number of shares owned by the trading agent.Fig. 7.4 and Fig. 7.5 show the form of the value function over varying price andtime taking snapshots in shares owned. As was shown in Lemma 4.3.1, the valuefunction is monotonically decreasing in time and monotonically increasing in thenumber of shares owned.Finally, the corresponding liquidation strategies are given in Fig. 7.6 and Fig.7.7, in cross sections of constant price. The anticipated threshold nature of theoptimal policy is apparent, and dependent on the price process. The shape of thepolicy appears exponential in both dimensions, and this fact will be exploited in thenext section by using the SPSA algorithm.5.3 Policy Search Algorithm for Unknown PriceEvolution StructureUp to this point, the state transition matrix is assumed to governed solely by known,constant drift and diffusion coefficients µ and σ. In practice, these coefficients maybe difficult to estimate effectively. An on-line machine learning type algorithm isproposed to aid the agent in formulating an optimal threshold strategy under thisuncertainty. This is done by approximating the parameters which describe the op-timal threshold policy. Here we assume the MDP (Definition 4.1.1) has immediateand terminal reward functions defined by Assumptions 3.0.2 and 3.2.1, respectively,and has a value function (4.22) that is a nondecreasing and concave function of y andp and convergent to the value function of HJBQVI (3.11) with investment oppor-tunity parameter qn defined in (4.26). Under these conditions, from Theorem 4.3.1the optimal policy α∗n(x,y,p) defined in (4.23) is a nondecreasing function of p, y,38and n, and a suitable policy gradient algorithm can be used to estimate the param-eters of the underlying threshold policy. The Simultaneous Perturbation StochasticApproximation (SPSA) algorithm is utilized, a gradient estimation methodologywhich allows the agent to apply the dynamic programming principles described inprevious sections under this new assumed uncertainty. Detailed information aboutthis algorithm can be found in [24], and its use in this thesis has been adaptedfrom [34]. As opposed to other stochastic approximation methods based on finitedifference-based algorithms, SPSA is beneficial because it reduces the number of lossmeasurements required to obtain a specified level of accuracy in the optimizationprocedure. When applying an optimization process to optimal action in financialmarkets, the difference in efficiency of the algorithm can lead directly to monetaryloss for the agent and as such the quickest method is crucial. Algorithm 1 belowuses the SPSA algorithm to generate a sequence of estimates φˆn,n = 1,2, ..., thatconverges to a local maximum of the optimal threshold descriptor φ∗ with policyαφ∗(x,y,p) for the discrete MDP used to describe the optimal portfolio liquidationprocedure in Definition 3.1.39Algorithm 1 SPSA Algorithm for Computing a Threshold PolicyAssume that the optimal liquidation policy is characterized by a threshold switch-ing curve as described in the previous section.Step 1: Choose initial threshold coefficients φˆ0 and threshold policy αφˆ0 .Step 2: For iterations n= 0,1,2, ...,1. Evaluate trade cost vn(φˆn) computed fromv∗T (sN ) = CT (YT ,PT ), ∀sN ∈ S¯and compute the gradient estimate∇ˆφvn(φˆn) =vn(φˆn+ ∆nωn)−vn(φˆn−∆nωn)2∆nωn,ωn ={−1 with probability 0.5+1 with probability 0.5.Here, ∆n = ∆/(n+ 1)γ denotes the gradient step size with 0.5≤ γ ≤ 1 and∆> 0.2. Update threshold coefficients φˆn via the stochastic approximation algorithmφˆn+1 = φˆn− n+1∇ˆφvn(φˆn),n = /(n+ 1 + b)β,0.5≤ β ≤ 1, , b > 0.(5.6)The above SPSA algorithm picks a single random direction ωn along whichdirection the derivative is evaluated at each batch n. Unlike the Kiefer-Wolfowitzfinite difference algorithm to evaluate the gradient estimate ∇ˆφvn in (5.6), SPSArequires only 2 batch simulations, i.e., the number of evaluations is independent ofdimension of parameter φ. Because the stochastic gradient algorithm (5.6) convergesto local optima, it is necessary to try several initial conditions φˆ0. The computationalcost at each iteration is linear in the dimension of θ and is independent of theobservation size.For fixed θ, the samples vn(µθ) are simulated independently and have identicaldistribution. Thus, the proof that θn = θφˆn generated by Algorithm 1 convergesto a local optimum of E[vn(µθ)] with probability one, is a straightforward applica-tion of techniques in [35](which gives general convergence methods for Markoviandependencies).40The model used to fit the threshold structure of the optimal policy is a tripleexponential with maximal points at time T , ymax, and pmax (since the policy ismonotone increasing in each state variable) s.t.A(t,y,p) =∥∥∥∥(θ1eθ2(y−ymax),θ3eθ4(t−T ),θ5eθ6(p−pmax))∥∥∥∥,where ‖ · ‖ is the L2-norm. So φ= {θ1,θ2,θ3,θ4,θ5,θ6}. The sub-optimal monotonepolicy resulting from this algorithm for the unknown parameter scenario are givenin Fig. 7.8 and Fig. 7.9, and can be compared to the optimal policy simulatedearlier in Fig. 7.6 and Fig. 7.7.41Chapter 6ConclusionA Hamilton-Jacobi-Bellman quasivariational inequality in a optimal portfolio liq-uidation scenario has been formulated with immediate market impact considered.This continuous time, state, and action space problem has been discretized usinguniform grids, and approximated using a countable MDP that can be solved usinga backward induction algorithm. The optimal decision rule and value function areexplored and it is found that the value function is concave, nondecreasing in thenumber of shares owned, the price of the underlying asset, and the time until theliquidation horizon. Further, the optimal policy is found to be nondecreasing inthe number of shares owned, the asset price, and the time elapsed. Finally, it isshown that the monotonic and threshold characteristics of the discretized decisionprocess are maintained when converging to the HJBQVI, thus presenting a methodfor demonstrating optimal policy structural characteristics of an otherwise compleximpulse control problem. In a situation where it is unsure how to model the as-set price evolution (i.e., the model’s drift and diffusion coefficients are unknown),a reinforcement learning technique called the Simultaneous Perturbation Stochas-tic Approximation algorithm is applied. This algorithm is able to approximate thefunctional structure of the threshold policy when determining the MDP solution.Therefore, a method for applying a machine learning approach to the solution struc-ture of a Hamilton-Jacobi-Bellman Quasivariational Inequality has been presented.In future work it is worthwhile considering bid-ask spreads and penalizing rapidtransaction requests as explored in [2]. Moreover, including an option to buy andsell shares of multiple stocks may provide new ways to minimize risk through ex-ploiting correlations (or anti-correlations) in stock price evolutions. Other on-line42learning techniques may be explored if alternative scenarios deserve application,and multiple-investor quasivariational inequality applications with game theoreticstructures may have interesting machine learning applications as well.43Chapter 7Tables and Figures7.1 TablesParameter Description ValueT liquidation horizon 10µ drift coefficient 0.1σ diffusion coefficient 0.3h discretization parameter 6xmin minimum cash in pocket 0xmax maximum cash in pocket 1e5ymin minimum shares owned 0ymax maximum shares owned 50pmin minimum price of asset 0pmax maximum price of asset 60c instantaneous reward constant 1C terminal reward constant 0.6∆n time discretization size 1δ discounting factor on expected future value 0.9κ price impact constant 0.1Table 7.1: Simulated Liquidation Problem Parameters447.2 FiguresFigure 7.1: Solving for the Value Function from Generator Function (5.1),Snapshot at Constant Shares Owned45Figure 7.2: Small action taken (”Sell 10 shares”)Figure 7.3: Maximum action taken (”Sell 50 shares”)46Figure 7.4: Value function over time and price, with constant shares held.Figure 7.5: Value function over time and price, with constant shares held.47Figure 7.6: Optimal Liquidation Strategy - Constant Price P = 10.Figure 7.7: Optimal Liquidation Strategy - Constant Price P = 60.48Figure 7.8: SPSA: Sub-Optimal Liquidation Strategy -Constant Price P = 10.Figure 7.9: SPSA: Sub-Optimal Liquidation Strategy -Constant Price P = 60.49Bibliography[1] Alain Bensoussan and Jacques Lions, Impulse control and quasi-variationalinequalities, Gaunthier-Villars, 1984. → pages 1, 2, 8[2] F. Guilbaud, M. Mnif, and H. Pham, “Numerical methods for an optimalorder execution problem,” ArXiv e-prints, June 2010. → pages 1, 2, 3, 4, 15,16, 24, 30, 37, 42[3] H. Pham, Continuous-time Stochastic Control and Optimization withFinancial Applications, Stochastic Modelling and Applied Probability (Book61). Springer, first edition, 2009. → pages 1, 2[4] M. Ohnishi and M. Tsujimura, “An impulse control of a geometric brownianmotion with quadratic costs,” European Journal of Operational Research, vol.168, no. 2, pp. 311 – 321, 2006. → pages 1, 19[5] A. Schied and T Scho¨neborn, “Risk aversion and the dynamics of optimalliquidation strategies in illiquid markets,” Finance and Stochastics, vol. 13,no. 2, pp. 181–204, 2009. → pages 1, 2[6] F. Black and M. Scholes, “The Pricing of Options and Corporate Liabilities,”Journal of Political Economy, vol. 81, no. 3, pp. 637–54, May-June 1973. →pages 1, 15[7] G. Pages, H. Pham, and J. Printems, “Optimal quantization methods andapplications to numerical problems in finance,” in Handbook of Computationaland Numerical Methods in Finance, SvetlozarT. Rachev, Ed., pp. 253–297.BirkhÃďuser Boston, 2004. → pages 1[8] R. Rishel and K. Helmes, “A variational inequality sufficient condition foroptimal stopping with application to an optimal stock selling problem,” SIAMJournal on Control and Optimization, vol. 45, no. 2, pp. 580–598, 2006. →pages 1[9] J. Getheral, “No-dynamic-arbitrage and market impact,” QuantitativeFinance, vol. 10, no. 7, pp. 749–759, sep 2010. → pages 2, 1550[10] E. Moro, J. Vicente, L. Moyano, A. Gerig, J. Farmer, G. Vaglica, F. Lillo, andR. Mantegna, “Market impact and trading profile of large trading orders instock markets,” Papers, arXiv.org, Aug. 2009. → pages 2[11] Sheldon Ross, Introduction to Stochastic Dynamic Programming, AcademicPress, 1991. → pages 2, 30[12] Martin Puterman, Markov Decision Processes: Discrete Stochastic DynamicProgramming, John Wiley & Sons, Inc., New York, NY, USA, 1st edition,1994. → pages 2, 5, 7, 35[13] N. Bauerle and U. Rieder, Markov Decision Processes with Applications toFinance, Springer, 2011. → pages 2[14] A. Friedman, “Optimal stopping problems in stochastic control,” SIAMReview, vol. 21, no. 1, pp. 71–80, 1979. → pages 2, 19[15] I. Kharroubi, J. Ma, H. Pham, and J. Zhang, “Backward sdes withconstrained jumps and quasi-variational inequalities,” The Annals ofProbability, vol. 38, no. 2, pp. 794âĂŞ840, 2010. → pages 2[16] L. Rogers and S. Singh, “The cost of illiquidity and its effects on hedging,”Mathematical Finance, vol. 20, no. 4, pp. 597–615, 2010. → pages 2[17] Bernt Øksendal and Agnes Sulem, Applied Stochastic Control of JumpDiffusions, Springer, 2007. → pages 2[18] Nicola Bruti-Liberati, Numerical Solution of Stochastic Differential Equationswith Jumps in Finance, Finance Discipline Group, UTS Business School,University of Technology, Sydney, 2007. → pages 2[19] M. Davis, X. Guo, and G. Wu, “Impulse control of multidimensional jumpdiffusions.,” SIAM J. Control and Optimization, vol. 48, no. 8, pp. 5276–5293,2010. → pages 2[20] A. Bensoussan, R. Liu, and S. Sethi, “Optimality of an (s,s) policy withcompound poisson and diffusion demands: A quasi-variational inequalitiesapproach,” SIAM Journal on Control and Optimization, vol. 44, no. 5, pp.1650–1676, 2005. → pages 2[21] R. Bass, “Stochastic differential equations with jumps. probability survey,”Trans. Amer. Math. Soc, pp. 1–19, 2004. → pages 2[22] P. Glasserman and N. Merener, “Convergence of a discretization scheme forjump-diffusion processes with state-dependent intensities,” in Proceedings ofthe Royal Society: Series A, 2001, pp. 111–127. → pages 251[23] R. Seydel, Impulse Control for Jump Diffusions: Viscosity Solutions of QuasiVariational Inequalities and Applications in Bank Risk Management, 2010. →pages 2[24] J. Spall, Introduction to Stochastic Search and Optimization: Estimation,Simulation, and Control, Wiley, 2003. → pages 3, 13, 39[25] H. Ishii M. Crandall and P. Lions, “User’s guide to viscosity solutions ofsecond order partial differential equations,” Bulletin of the AmericanMathematical Society, vol. 27, no. 12, pp. 1–69, 1992. → pages 12[26] L. Prashanth S. Bhatnagar, H. Prasad, Stochastic Recursive Algorithms forOptimization, Springer, first edition, 2013. → pages 13[27] J. Chancelier, B. Øksendal, and A. Sulem, “Combined stochastic control andoptimal stopping, and application to numerical approximation of combinedstochastic and impulse control,” Stochastic Financial Mathematics, pp.149–172, 2002. → pages 15, 30, 34[28] M. Jose Junca Pelaez, “Optimal execution strategy: Price impact andtransaction cost,” 2011. → pages 19[29] V. Zakamouline, “Optimal portfolio selection with both fixed andproportional transaction costs for a crra investor with finite horizon,” inDiscussion Paper 2002, Institute of Finance and Management Science.Norwegian School of Economics and Business Administration, 2002. → pages19[30] M. Gaigi, V. Vath, M. Mnif, and S. Toumi, “Numerical approximation for aportfolio optimization problem under liquidity risk and costs,” 2013. → pages23, 24, 25, 28, 30, 56, 57[31] I. Kharroubi and H. Pham, “Optimal portfolio liquidation with execution costand risk,” SIAM Journal of Financial Mathematics, vol. 1, pp. 897–931, 2010.→ pages 24[32] Zhuliang Chen and Peter A. Forsyth, “A numerical scheme for the impulsecontrol formulation for pricing variable annuities with a guaranteed minimumwithdrawal benefit (gmwb),” 2008. → pages 30[33] B. Øksendal, Stochastic Differential Equations - An Introduction withApplications, Springer, 2003. → pages 35[34] V. Krishnamurthy, “Bayesian sequential detection with phase-distributedchange time and nonlinear penalty - a POMDP lattice programmingapproach.,” IEEE Transactions on Information Theory, vol. 57, no. 10, pp.7096–7124, 2011. → pages 3952[35] H.J. Kushner and G. Yin, Stochastic Approximation and Recursive Algorithmsand Applications, Applications of mathematics. Springer, 2003. → pages 40[36] V. Ly Vath, M. Mnif, and H. Pham, “A model of optimal portfolio selectionunder liquidity risk and price impact,” Finance and Stochastics, vol. 11, no. 1,pp. 51–90, 2007. → pages 55, 60[37] M. Ngo and V. Krishnamurthy, “Optimality of threshold policies fortransmission scheduling in correlated fading channels,” Transactions onCommunications, vol. 57, no. 8, pp. 2474–2483, Aug. 2009. → pages 6353Appendix AProofsA.1 Proof of Theorem 4.2.1Proof. Given l ∈ N\{0}, from the definition of Al(x,y,p) in (4.18),Al(x,y,p)⊂Al+1(x,y,p)⊂A(x,y,p).The above equation implies that the corresponding value functions are monotoneincreasing in lvl(t,x,y,p)≤ vl+1(t,x,y,p)≤ v(t,x,y,p).So in the limitliml→∞vl(t,x,y,p)≤ v(t,x,y,p). (A.1)Next, choose  > 0 and α= {τ1, τ2, ...;ζ1, ζ2, ...) ∈ A(x,y,p) such that, from (3.8)E(x,y,p)∑τi≤Tc(Pτ−i, ζi)= vα(t,x,y,p)≥ v(t,x,y,p)− . (A.2)With Pαt diffusing under impulse control α. Next choose l s.t.αl = (τ1, ζ1, ..., τl−1, ζl−1, τ¯),54where τl−1 < τ¯ 0. We can also assume such a reward function stucture, andassume it holds throughout the sketch as Assumption 3.0.2 is maintained.Consider the approximating scheme to HJBQVI (3.11), with time discretizationfrom (4.1):Sh(t,x,y,p,vh(t,x,y,p),vh) = 0, (t,x,y,p) ∈ [0,T ]×S¯, (A.4)55where Sh : [0,T ]×S¯ ×G([0,T ])×S¯)→ R is defined bySh(t,x,y,p,z,φ)≡max[z−E[φ(t+h,X0,t,x,y,pt+h ,Y0,t,x,y,pt+h ,P0,t,x,y,pt+h )],z−Mφ(t,x,y,p)] if t ∈ [0,T −h]max[z−E[φ(T,X0,t,x,y,pT ,Y0,t,x,y,pT ,P0,t,x,y,pT )],z−Mφ(t,x,y,p)] if t ∈ (T −h,T )max[r−CT (YT ,PT ),z−Mφ(t,x,y,p)] if t= T.(A.5)The notation Ξ0,t,x,y,pt′ indicates the value of a state process Ξt(ω) at time t′ whenstarting at time t in state (x,y,p) with no control action applied.Scheme (A.5) is formulated as a backward scheme for vh as follows:vh(T,x,y,p) = max[CT (y,p),Mv(T,x,y,p)], for t= T,vh(t,x,y,p) = max[E[vh(t+h,X0,t,x,y,pT ,Y0,t,x,y,pT ,P0,t,x,y,pT )],Mvh(t,x,y,p)], for t ∈ [0,T −h],vh(t,x,y,p) = vh(T −h,x,y,p), for t ∈ (T −h,T ).(A.6)This approximating scheme, as noted in [30], is a priori implicity due to the nonlocal obstacle term M. The cure is to iterate the scheme as a sequence of optimalstopping problems as follows:vh,n+1(T,x,y,p) = max[CT (y,p),Mv(T,x,y,p)] for t= T,vh,n+1(t,x,y,p) = max[E[vh,n+1(t+h,X0,t,x,y,pt+h ,Y0,t,x,y,pt+h ,P0,t,x,y,pt+h )],Mvh,n(t,x,y,p)], for t ∈ [0,T −h],vh,n+1(t,x,y,p) = vh,n+1(T −h,x,y,p), for t ∈ (T −h,T ).(A.7)Next, considering the state and action discretizations (4.8) and (4.9), respectively,along with the iterated optimal stopping problem treatment of scheme (A.5) (namely,(A.7)) and the approximation of the expected value arising in the scheme given by56(4.13), [30] states that the approximating scheme (A.5) becomesSh,R,N,M (t,x,y,p,z,φ)≡max[z−E [φ(t+h,X0,t,x,y,pt+h ,Y0,t,x,y,pt+h ,P0,t,x,y,pt+h )],z−MM,Rφ(t,x,y,p)] if t ∈ [0,T −h]max[z−E [φ(T,X0,t,x,y,pT ,Y0,t,x,y,pT ,P0,t,x,y,pT )],z−MM,Rφ(t,x,y,p)] if t ∈ (T −h,T )max[r−CT (YT ,PT ),z−MM,Rφ(t,x,y,p)] if t= T.(A.8)The iterative optimal stopping problem scheme, combined with the discrete state,action and time scheme (A.8) yield the following inductive schemevh,n+1(tm,x,y,p) = maxCT (y,p),∑ζ∈CM,R(x,y,p)v(T,Γ(x,y,p,ζ))vh,n+1(ti,x,y,p)= max[EN,R[vh,n+1(ti+1,X0,t,x,y,pti+1 ,Y0,t,x,y,pti+1 ,P0,t,x,y,pti+1 )], supζ∈CM,R(x,y,p)vh,n(ti,Γ(x,y,p,ζ))](A.9)where MM,R is given by (4.16). The development of this numerical scheme is themain focus of [30], and the proof of the convergence of the value function vh,n from(A.9) to the value function v which solves the HJBQVI (3.11) is accomplished when(A.9) satisfies monotonicity, stability, and consistency properties. Since this proof isexplained in thourough detail in [30], only the propositions will be provided, withoutproof.Proposition A.2.1. (Monotonicity) For all h> 0,(t,x,y,p)∈ [0,T ]×S¯,g ∈R, and φ,ψ ∈G([0,T ]) with φ≤ ψ,Sh,R,N,M (t,x,y,p,φ)≥ Sh,R,N,M (t,x,y,p,ψ).Proposition A.2.2. (Consistency) Take N = 1hq s.t. q > 2.1. Let Z0,t′,x,y,pt′+h = (X0,t′,x,y,pt′+h ,Y0,t′,x,y,pt′+h ,P0,t′,x,y,pt′+h ) and z= (x,y,p). For all (t,x,y,p)∈57[0,T )×S¯ and Lipschitz function φ ∈ C1,2([0,T )×S¯):limsup(h,t′,z′)→(0,z)(M,N,R)→+∞maxφ(t′,z′)−EN,R[φ(t′+h,Z0,t′,x,y,pt′+h )]h,(φ(t′,z′)−MM,Rφ(t′,z′))≤max{(∂φ∂t+Lφ)(t,z),Mφ(t,z)−φ(t,z)}andliminf(h,t′,z′)→(0,z)(M,N,R)→+∞maxφ(t′,z′)−EN,R[φ(t′+h,Z0,t′,x,y,pt′+h )]h,(φ(t′,z′)−MM,Rφ(t′,z′))≥max{(∂φ∂t+Lφ)(t,z),Mφ(t,z)−φ(t,z)}2. For all z = (x,y,p) ∈ S¯ and Lipschitz function φ ∈ C1,2([0,T ]×S¯)limsup(h,t′,z′)→(0,z)(M,N,R)→+∞max{φ(t′,z′)−CT (y′,p′),MM,RCT (y′,p′)}≤max{φ(T,z)−CT (y,p),Mφ(T,z)−φ(T,z)}andliminf(h,t′,z′)→(0,z)(M,N,R)→+∞max{φ(t′,z′)−CT (y′,p′),MM,RCT (y′,p′)}≥max{φ(T,z)−CT (y,p),Mφ(T,z)−φ(T,z)} .Lemma A.2.1. For all (t,x,y,p) ∈ Tm×S¯loc,limn→+∞vh,R,N,Mn (t,x,y,p) = vh,R,N,M (t,x,y,p)Proposition A.2.3. (Iteratated optimal stopping problems) Define φh,R,N,Mn itera-tively as a sequence of optimal stopping problemsφh,R,N,Mn+1 (t,x,y,p) = supτ∈Sht,TEN,R[MM,Rφh,R,N,Mn (τ,Z0,t,x,y,pN,R (τ))] ∀(t,x,y,p) ∈ Tm×S¯locφh,R,N,M0 (t,x,y,p) = vh,R,N,M0 (t,x,y,p) for all (t,x,y,p) ∈ Tm×S¯loc.58where Sht,T is the set of all Fs-stopping times in [t,T ], t≤ s≤ T , with time discretiza-tion h. Given (t,x,y,p) ∈ Tm×S¯loc, it is shown thatφh,R,N,Mn (t,x,y,p) = vh,R,N,Mn (t,x,y,p).Corollary A.2.1. Given (t,x,y,p) ∈ Tm×S¯loc,limn→+∞φh,R,N,Mn (t,x,y,p) = vh,R,N,M (t,x,y,p).Proposition A.2.4. (Stability) ∀h > 0,∃! solution vh,R,N,Mn (t,x,y,p) ∈ G([0,T ]×S¯) to the approximating scheme (A.5) and the sequence (vh,R,N,Mn )h is uniformlybounded in G([0,T ]× S¯), i.e., ∃w ∈ G([0,T ]× S¯) such that |vh,R,N,Mn | ≤ w for allh > 0.Proposition A.2.5. 1. Let O⊂ S¯. A locally bounded function v¯ on [0,T )×S¯ isa viscosity subsolution (resp.¯v is a supersolution) of (3.11) in [0,T )×O if forall (t¯, x¯, y¯, p¯)∈ [0,T )×O and φ∈C1,2([0,T )×S¯ s.t. (v¯−φ)(t¯, x¯, y¯, p¯) = 0 (resp.(¯v−φ)(¯t, x¯, y¯, p¯) = 0) and (t¯, x¯, y¯, p¯) is a maximum of v¯−φ (resp. a minimumof¯v ∗−φ) on [0,T )×O. Thenmax{∂φ∂t(t¯, x¯, y¯, p¯) +Lφ(t¯, x¯, y¯, p¯),Mv¯(t¯, x¯, y¯, p¯)− v¯(t¯, x¯, y¯, p¯)}≤ 0, in [0,T )×O,(A.10)max{∂φ∂t(t¯, x¯, y¯, p¯) +Lφ(t¯, x¯, y¯, p¯),M¯v(t¯, x¯, y¯, p¯)−¯v(t¯, x¯, y¯, p¯)}≥ 0, in [0,T )×O,(A.11)2. A locally bounded function u on [0,T )×S¯ is a constrained viscosity solutionof (3.11) in [0,T )×S if u is a viscosity subsolution of (3.11) in [0,T )×S¯ anda viscosity supersolution of (3.11) in [0,T )×S.Proposition A.2.6. (Convergence) For all (t′,x′,y′,p′) ∈ Tm×Zl and (t,x,y,p) ∈[0,T )×S,lim(t′,x′,y′,p′)→(t,x,y,p)(h,M,N,R)→+(0,+∞,+∞,+∞)vh,R,N,M (t′,x′,y′,p′) = v(t,x,y,p),where vh,R,N,M (t′,x′,y′,p′) is the solution to the approximating scheme (A.6), andv(t,x,y,p) is the solution the the HJBQVI (3.11)59Proof. A sketch of this proof is given as follows, using the propositions determinedthus far:1. Assume v¯(t,x,y,p) and¯v(t,x,y,p) are sub- and supersolutions respectively ofthe value function v(t,x,y,p) solving (3.11). These solutions are defined asfollows:v¯(t,x,y,p) = limsup(t′,x′,y′,p′)→(t,x,y,p)(h,M,N,R)→+(0,+∞,+∞,+∞)vh,R,N,M (t′,x′,y′,p′)¯v(t,x,y,p) = liminf(t′,x′,y′,p′)→(t,x,y,p)(h,M,N,R)→+(0,+∞,+∞,+∞)vh,R,N,M (t′,x′,y′,p′).(A.12)2. Since v¯(t,x,y,p) is upper semicontinuous and¯v(t,x,y,p) is lower semicontin-uous, the comparison principle shown in [36] gives v¯(t,x,y,p)≤¯v(t,x,y,p) forall (t,x,y,p) ∈ [0,T ]×S. However, from (A.12) it is clear that v¯(t,x,y,p ≥¯v(t,x,y,p) for all (t,x,y,p) ∈ [0,T ]×S, so it must be the case that v¯(t,x,y,p=¯v(t,x,y,p) = v(t,x,y,p), the unique continuous solution of (3.11), and as suchvh,R,N,M converges uniformly to v.3. Consider v¯. Let (t¯, x¯, y¯, p¯) strict global maximum of v¯−φ on [0,T ]×S¯, whereφ∈C1,2([0,T ]×S¯, such that v¯(t,x,y,p)−φ(t,x,y,p)≤ v¯(t¯, x¯, y¯, p¯)−φ(t¯, x¯, y¯, p¯) =0 on [0,T ]×S¯.4. Through the combination of the stability, monotonicity, and consistency prop-erties shows that for the subsolution v¯:0≤max[∂φ∂t(t¯, x¯, y¯, p¯) +Lφ(t¯, x¯, y¯, p¯),Mφ(t¯, x¯, y¯, p¯)−φ(t¯, x¯, y¯, p¯)].and likewise that for the supersolution¯v:0≥max[∂φ∂t(t¯, x¯, y¯, p¯) +Lφ(t¯, x¯, y¯, p¯),Mφ(t¯, x¯, y¯, p¯)−φ(t¯, x¯, y¯, p¯)],which, by Propostion A.2.5, shows that v¯ and¯v are true sub- and supersolutionsto (3.11), and when combined with the convergence of the approximating schemevh,R,N,M to both sub- and supersolutions, shows that the approximating schemedoes indeed converge to the solution of HJBQVI (3.11)This completes the proof of Theorem 4.2.260A.3 Proof of Lemma 4.3.1Proof. The proof is by induction on n. Because vT (x,y,p) = CT (y,p) is concavein y and p by Assumption 3.2.1, assume that vi(x,y,p) is concave in y and p fori= T −1, ...,n+ 1.In order to show that vn(·, ·, ·) is concave in y, it is required that for 0< λ < 1,vn(x,λy1 + (1−λ)y2,p)≥ λvn(x,y1,p) + (1−λ)vn(x,y2,p).Now, for some α1 ≤ y1 and α2 ≤ y2,vn(x,y1,p) = c(α1,p) + v¯n+1(x,y1−α1,p−m(α1)),vn(x,y2,p) = c(α2,p) + v¯n+1(x,y2−α2,p−m(α2)).However, because λα1 + (1−λ)α2 ≤ λy1 + (1−λ)y2, it follows from (4.27) thatvn(x,λy1 + (1−λ)y2),p)≥ c(λα1 + (1−λ)α2,p−m(λα1 + (1−λ)α2))+ v¯n+1(x,λ(y1−α1) + (1−λ)(y2−α2),p−m(λα1 + (1−λ)α2))≥ λc(α1,p) + (1−λ)c(α2,p) +λv¯n+1(x,y1−α1,p−m(α1))+ (1−λ)v¯n+1(x,y2−α2,p−m(α2))= λvn(x,y1,p) + (1−λ)vn(x,y2,p),where the second inequality follows from the concavity of c(·, ·), and the concavityof v¯n+1(·, ·, ·), the latter of which follows from the induction hypothesis. Thereforevn(x,y,p) is concave in y.61A.4 Proof of Theorem 4.3.1Proof. To prove (i), let α¯ = αn(x,y,p). Then, by its definition, it follows that forα < α¯c(α¯,p) + v¯n+1(x,y− α¯,p−m(α¯))> c(α,p) + v¯n+1(x,y−α,p−m(α)). (A.13)For  > 0, it is shown next that αn(x,y+ ,p)≥ α¯ by proving that whenever α < α¯,c(α¯,p) + v¯n+1(x,y+ − α¯,p−m(α¯))≥ c(α,p) + v¯n+1(x,y+ −α,p−m(α)).From (A.13) this will be proven if it can be shown that whenever α < α¯,v¯n+1(x,y−α,p−m(α))− v¯n+1(x,y− α¯,p−m(α¯))≥ v¯n+1(x,y+ −α,p−m(α))− v¯n+1(x,y+ − α¯,p−m(α¯)).This follows from the concavity of vn in y, which implies the concavity of v¯n+1.Therefore (i) is proven.To prove that αn(x,y,p) ≥ αn−1(x,y,p), let α¯ = αn(x,y,p). If a¯ = y, then theresult is immediate; so suppose that α¯ < y. Fix  < y− α¯ and let α∗ = αn(x,y− a¯−,p−m(α¯)). Note that by part (i) of Thm. 4.3.1, it follows that α∗ ≤ α¯. Now,c(α¯+ ,p) +vn(x,y− α¯− ,p−m(α¯+ ))= c(α¯+ ,p) + c(α∗,p)+ v¯n+1(x,y−α− −α∗,p−m(α+ )−m(α∗))≤ c(α¯,p) + c(α∗+ ,p) + v¯n+1(x,y− α¯− −α∗,p−m(α¯+α∗+ ))≤ c(α¯,p) +vn(x,y− α¯,p−m(α¯)), (A.14)where the first inequality follows from the concavity of c in y and p and the fact62that α∗ ≤ α¯. By using the relationshipv¯n(x,y,p) = qvn(x,y,p) + (1− q)v¯n+1(x,y,p),it follows thatc(α¯+ ,p) + v¯n(x,y− α¯− ,p−m(α¯+ ))− c(α¯,p)− v¯n(x,y− α¯,p−m(α¯))= q[c(α¯+ ,p) +vn(x,y− α¯− ,m(α¯+ ))− c(α¯,p)−vn(x,y− α¯,p−m(α¯))]+ (1− q)[c(α¯+ ,p) + v¯n+1(x,y− α¯− ,m(α¯+ ))− c(α¯,p)− v¯n+1(x,y− α¯,p−m(α¯))].The first term in the square brackets is non-positive by (A.14), and the second isnon-positive because a¯= an(Y ). Thereforec(α¯+ ,p) + v¯n(x,y− α¯− ,p−m(α¯+ ))≤ c(α¯,p) + v¯n(x,y− α¯,p−m(α¯)),implying thatαn−1(x,y,p)≤ α¯= αn(x,y,p).The proof of (iii.) does not utilize the sequential allocation methodology usedabove in the proofs of (i.) and (ii.), since the price state variable p differs fromshares y and time index n in that it is stochastic in nature. Therefore we makean argument involving the supermodularity adapted from [37] of the state-actionreward function given in (4.24) in variables (a,p).Definition A.4.1. (Supermodularity)A function F (x,y) :X×Y →R is supermodular in F (x1,y1)+F (x2,y2)≥F (x1,y2)+F (x2,y1) for all x1,x2 ∈X and y1,y2 ∈ Y , such that x1 > x2,y1 > y2. Similarily, ifthe inequality is reversed, the function F (·, ·) is called submodular.The methodology to the proof is given in two steps:1. Monotonicity of the optimal reward function vn(·, ·, ·) in p2. Supermodularity: show that the state-action reward function Qn(x,y,p,a) is63supermodular in (a,p) using mathematical induction. The monotonicity andtherefore the nondecreasing structure of the optimal liquidation policy α∗n(·, ·, ·)given in (4.23) follows.Step 1 is shown in the following lemma.Lemma A.4.1. The optimal cost to go function vn(x,y,p) defined by (4.22) isincreasing in the price of the asset p.Proof. It is shown that vn(x,y,p) is monotone in p by induction. VT (x,y,p) isincreasing in p since C(·, ·) is increasing from Assumption 3.2.1. From the definitionof vn(x,y,p) given by (4.22) and (4.24), the monotonicity of vn(x,y,p) in p followsimmediately by induction.The next theorem outlines Step 2.Theorem A.4.1. If the terminal reward function C(·, ·) is an increasing functionin p and is integer convex such thatC(y,p+ 2)−C(y,p+ 1)≥ C(y,p+ 1)−C(y,p) ∀y,p≥ 0, (A.15)then the state-action reward fucntion Qn(x,y,p,a) is supermodular in (a,p), i.e.Qn(x,y,p,a)−Qn(x,y,p,0)≤Qn(x,y,p+ 1,a)−Qn(x,y,p+ 1,0), ∀a≤ 0. (A.16)and as a result, the optimal liquidation policy α∗n(x,y,p) is nondecreasing with respectto the asset price p.Proof. First, rewrite the state-action reward function Qn(x,y,p,a) given by (4.24)when considering (4.28) such thatQn(x,y,p,a) = c(p,a) +∑p′∈Zl∩PP(p′|p)×[fm(a,p)vn+1(x+pa,y−a,p′−m(a)) + (1−fm(a,p))vn+1(x+pa,y−a,p′)]= c(p,a) +∑p′∈Zl∩PP(p′|p)× (vn+1(x+pa,y−a,p′−m(a))−vn+1(x+pa,y−a,p′))+∑p′∈Zl∩PP(p′|p)vn+1(x+pa,y−a,p′).(A.17)64From (A.17) we can see that Qn(x,y,p,a) is supermodular in (a,p) iff vn(x,y,p) hasincreasing differences in p. The following lemma completes the proof of (iii.)Lemma A.4.2. If the terminal reward function C(·, ·) is an increasing functionand satisfies (A.15) then the optimal value function vn(x,y,p) given by (4.22) hasincreasing differences in the price state state:vn(x,y,p+ 2)−vn(x,y,p+ 1)≥ vn(x,y,p+ 1)−vn(x,y,p), (A.18)for all x,y,p ∈ S¯loc. The result is that the state-action reward function Qn(x,y,p,a)satisfies (A.16), and as such is supermodular in (a,p).Proof of (A.18) by mathematical induction: First notice that (A.18) holds forn= T due to (A.15). Assume (A.18) holds for n= k. We prove that (A.18) holds forn= k−1. Let vk−1(x,y,p+2) =Qk−1(x,y,p+2,a2),vk−1(x,y,p+1) =Qk−1(x,y,p+1,a1),vk−1(x,y,p) = Qk−1(x,y,p,a0) for some a2,a1,a0 ∈ CM,R which are optimalcontrols at time k− 1 for states (x,y,p+ 2),(x,y,p+ 1), and (x,y,p) respectively.We must show thatQk−1(x,y,p+ 2,a2)−Qk−1(x,y,p+ 1,a1)−Qk−1(x,y,p+ 1,a1) +Qk−1(x,y,p,a0)≤ 0mQk−1(x,y,p+ 2,a2)−Qk−1(x,y,p+ 1,a2)︸ ︷︷ ︸A+Qk+1(x,y,p+ 1,a2)−Qk−1(x,y,p+ 1,a1)︸ ︷︷ ︸≤0 by optimality.−Qk−1(x,y,p+ 1,a1) +Qk−1(x,y,p+ 1,a0)︸ ︷︷ ︸≤0 by optimality.−(Qk+1(x,y,p+ 1,a0)−Qk−1(x,y,p,a0)︸ ︷︷ ︸B≤ 0.It follows from the induction hypothesis thatA=∑p′∈Zl∩PP(p′|p)[fm(a,p)(vk(x,y,p′+ 1)−vk(x,y,p′))+ (1−fm(a,p))(vk(x,y,p′+ 2)−vk(x,y,p′+ 1))]≤∑p′∈Zl∩PP(p′|p)(vk(x,y,p′+ 1)−vk(x,y,p′)).Similarly, B ≥∑p′∈Zl∩PP(p′|p)(vk(x,y,p+ 1)− vk(x,y,p)), so A−B ≤ 0. Therefore65vn(x,y,p) satisfies (A.18), which implies that Qn(x,y,p,a) is supermodular in (a,p)due to (A.17). As such, part (iii.) of theorem 4.3.1 has been shown.66