{"Affiliation":[{"label":"Affiliation","value":"Applied Science, Faculty of","attrs":{"lang":"en","ns":"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool","classmap":"vivo:EducationalProcess","property":"vivo:departmentOrSchool"},"iri":"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool","explain":"VIVO-ISF Ontology V1.6 Property; The department or school name within institution; Not intended to be an institution name."},{"label":"Affiliation","value":"Electrical and Computer Engineering, Department of","attrs":{"lang":"en","ns":"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool","classmap":"vivo:EducationalProcess","property":"vivo:departmentOrSchool"},"iri":"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool","explain":"VIVO-ISF Ontology V1.6 Property; The department or school name within institution; Not intended to be an institution name."}],"AggregatedSourceRepository":[{"label":"AggregatedSourceRepository","value":"DSpace","attrs":{"lang":"en","ns":"http:\/\/www.europeana.eu\/schemas\/edm\/dataProvider","classmap":"ore:Aggregation","property":"edm:dataProvider"},"iri":"http:\/\/www.europeana.eu\/schemas\/edm\/dataProvider","explain":"A Europeana Data Model Property; The name or identifier of the organization who contributes data indirectly to an aggregation service (e.g. Europeana)"}],"Campus":[{"label":"Campus","value":"UBCV","attrs":{"lang":"en","ns":"https:\/\/open.library.ubc.ca\/terms#degreeCampus","classmap":"oc:ThesisDescription","property":"oc:degreeCampus"},"iri":"https:\/\/open.library.ubc.ca\/terms#degreeCampus","explain":"UBC Open Collections Metadata Components; Local Field; Identifies the name of the campus from which the graduate completed their degree."}],"Creator":[{"label":"Creator","value":"Aprem, Anup","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/creator","classmap":"dpla:SourceResource","property":"dcterms:creator"},"iri":"http:\/\/purl.org\/dc\/terms\/creator","explain":"A Dublin Core Terms Property; An entity primarily responsible for making the resource.; Examples of a Contributor include a person, an organization, or a service."}],"DateAvailable":[{"label":"DateAvailable","value":"2017-12-04T23:13:56Z","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/issued","classmap":"edm:WebResource","property":"dcterms:issued"},"iri":"http:\/\/purl.org\/dc\/terms\/issued","explain":"A Dublin Core Terms Property; Date of formal issuance (e.g., publication) of the resource."}],"DateIssued":[{"label":"DateIssued","value":"2017","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/issued","classmap":"oc:SourceResource","property":"dcterms:issued"},"iri":"http:\/\/purl.org\/dc\/terms\/issued","explain":"A Dublin Core Terms Property; Date of formal issuance (e.g., publication) of the resource."}],"Degree":[{"label":"Degree","value":"Doctor of Philosophy - PhD","attrs":{"lang":"en","ns":"http:\/\/vivoweb.org\/ontology\/core#relatedDegree","classmap":"vivo:ThesisDegree","property":"vivo:relatedDegree"},"iri":"http:\/\/vivoweb.org\/ontology\/core#relatedDegree","explain":"VIVO-ISF Ontology V1.6 Property; The thesis degree; Extended Property specified by UBC, as per https:\/\/wiki.duraspace.org\/display\/VIVO\/Ontology+Editor%27s+Guide"}],"DegreeGrantor":[{"label":"DegreeGrantor","value":"University of British Columbia","attrs":{"lang":"en","ns":"https:\/\/open.library.ubc.ca\/terms#degreeGrantor","classmap":"oc:ThesisDescription","property":"oc:degreeGrantor"},"iri":"https:\/\/open.library.ubc.ca\/terms#degreeGrantor","explain":"UBC Open Collections Metadata Components; Local Field; Indicates the institution where thesis was granted."}],"Description":[{"label":"Description","value":"Due to large scale use of online social media there has been growing interest in modeling and analysis of data from online social media. The unifying theme of this thesis is to develop a set of mathematical tools for detection, estimation and control in online social media. The following are the main contributions of this thesis: Chapter 2 deals with nonparametric change detection for dynamic utility maximization agents. Using the revealed preference framework, necessary and sufficient conditions for detecting the change point are derived. In the presence\r\nof noisy measurements, we construct a decision test to check for dynamic utility maximization behaviour and the change point. Experiments on the Yahoo! Tech Buzz dataset show that the framework can be used to detect changes in ground truth using online search data. Chapter 3 studies engagement dynamics and sensitivity analysis of YouTube\r\nvideos. Using machine learning and sensitivity analysis techniques it is shown that the video view count is sensitive to 5 meta-level features. In addition, changing the meta-level after the video has been posted increases the popularity of the video. In addition, we examine how the social dynamics of a YouTube channel affect it's popularity. The results are empirically validated on a real-world data consisting of about 6 million videos spread over 25 thousand channels. Chapter 4 considers the problem of scheduling advertisements in live personalized online social media. Broadcasters aim to opportunistically schedule advertisements (ads) so as to generate maximum revenue. The problem is formulated as a multiple stopping problem and is addressed in a partially observed Markov decision process (POMDP) framework. Structural results are provided on the optimal ad scheduling policy. By exploiting the structure of the optimal policy, optimum linear threshold policies are computed using a stochastic gradient algorithm.\r\nThe proposed model and framework are validated on a Periscope dataset and it was found that the revenue can be improved by 25% in comparison to currently employed periodic scheduling.","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/description","classmap":"dpla:SourceResource","property":"dcterms:description"},"iri":"http:\/\/purl.org\/dc\/terms\/description","explain":"A Dublin Core Terms Property; An account of the resource.; Description may include but is not limited to: an abstract, a table of contents, a graphical representation, or a free-text account of the resource."}],"DigitalResourceOriginalRecord":[{"label":"DigitalResourceOriginalRecord","value":"https:\/\/circle.library.ubc.ca\/rest\/handle\/2429\/63811?expand=metadata","attrs":{"lang":"en","ns":"http:\/\/www.europeana.eu\/schemas\/edm\/aggregatedCHO","classmap":"ore:Aggregation","property":"edm:aggregatedCHO"},"iri":"http:\/\/www.europeana.eu\/schemas\/edm\/aggregatedCHO","explain":"A Europeana Data Model Property; The identifier of the source object, e.g. the Mona Lisa itself. This could be a full linked open date URI or an internal identifier"}],"FullText":[{"label":"FullText","value":"Detection, Estimation andControl in Online Social MediabyAnup Aprem 2017M.E., Indian Institute of Science, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2017c\u00a9 Anup Aprem 2017AbstractDue to large scale use of online social media there has been growing interest in model-ing and analysis of data from online social media. The unifying theme of this thesis isto develop a set of mathematical tools for detection, estimation and control in onlinesocial media. The following are the main contributions of this thesis:\u2022 Chapter 2 deals with nonparametric change detection for dynamic utility max-imization agents. Using the revealed preference framework, necessary and suf-ficient conditions for detecting the change point are derived. In the presenceof noisy measurements, we construct a decision test to check for dynamic util-ity maximization behaviour and the change point. Experiments on the Yahoo!Tech Buzz dataset show that the framework can be used to detect changes inground truth using online search data.\u2022 Chapter 3 studies engagement dynamics and sensitivity analysis of YouTubevideos. Using machine learning and sensitivity analysis techniques it is shownthat the video view count is sensitive to 5 meta-level features. In addition,changing the meta-level after the video has been posted increases the popu-larity of the video. In addition, we examine how the social dynamics of aYouTube channel affect it\u2019s popularity. The results are empirically validated ona real-world data consisting of about 6 million videos spread over 25 thousandchannels.\u2022 Chapter 4 considers the problem of scheduling advertisements in live personal-ized online social media. Broadcasters aim to opportunistically schedule adver-tisements (ads) so as to generate maximum revenue. The problem is formulatedas a multiple stopping problem and is addressed in a partially observed Markovdecision process (POMDP) framework. Structural results are provided on theoptimal ad scheduling policy. By exploiting the structure of the optimal policy,optimum linear threshold policies are computed using a stochastic gradient algo-rithm. The proposed model and framework are validated on a Periscope datasetand it was found that the revenue can be improved by 25% in comparison tocurrently employed periodic scheduling.iiLay SummaryThis thesis studies three problems in online social media:\u2022 A framework for detecting changes in ground truth, such as onset of epidemicdiseases, using online search data. The proposed framework is quite general; theonly assumption is that the data is generated by users with rational behaviour.\u2022 Estimating the parameters that make YouTube videos popular. In addition, weexamine how the social dynamics of a YouTube channel affect the popularity ofit\u2019s videos.\u2022 A framework for scheduling advertisements in personalized live social mediaapplications like Periscope so as to maximize the advertisement revenue. Nu-merical results show that the proposed framework performs 25% better thanexisting methods.iiiPrefaceThe work presented in this thesis is based on the research and development con-ducted in the Statistical Signal Processing Laboratory at the University of BritishColumbia (Vancouver). The research work presented in the chapters of this disser-tation is performed by the author with feedback and assistance provided by Prof.Vikram Krishnamurthy. The author is responsible for writeup, problem formulation,research development, data analyses and numerical studies presented in this disser-tation with frequent suggestions, technical and editorial feedback from Prof. VikramKrishnamurthy. The ELM formulation in Chapter 3 is due in part to Dr. WilliamHoiles. The dataset used in Chapter 3 is from BroadBandTV Corp. The mathemat-ical proofs in Chapter 4 were by Prof. Krishnamurthy. For Chapter 4, Sujay Bhattprovided valuable insights into the application and formulation of the problem.The work presented in different chapters of the thesis has appeared in severalpublications which are listed below. In these publications, all co-authors contributedto the editing of the manuscript.\u2022 The work of Chapter 2 has been presented in the following publications:\u2013 [Journal Paper] A. Aprem and V. Krishnamurthy, Utility Change PointDetection in Online Social Media: A Revealed Preference Framework, IEEETransactions on Signal Processing, vol.64, no.7, pp.1869-1880\u2013 [Conference Paper] A. Aprem and V. Krishnamurthy, A Data Cen-tric Approach to Utility Change Detection in Online Social Media., IEEEConf. on Acoustics, Speech, and Signal Processing, New Orleans, LA, USA,March 59, 2017\u2022 The work of Chapter 3 has been presented in the following publications:\u2013 [Journal Paper] W. Hoiles, A. Aprem and V. Krishnamurthy, Engage-ment dynamics and sensitivity analysis of YouTube videos, IEEE Transac-tions on Knowledge and Data Engineering, vol.7, pp.1426-1437\u2022 Materials in Chapter 4 have appeared in the following publications and pre-prints for possible publicationivPreface\u2013 [Conference Paper] V. Krishnamurthy, A. Aprem and S. Bhatt, Mul-tiple Stopping Time Problems: Structural Results, 54th Annual AllertonConference on Communication, Control, and Computing, 2016\u2013 [Journal Paper] V. Krishnamurthy, A. Aprem and S. Bhatt, Opportunis-tic Advertisement Scheduling in Live Social Media: A Multiple StoppingTime POMDP Approach, https:\/\/arxiv.org\/abs\/1611.00291\u2022 Although not presented in this thesis, the discussion and results in Chapter 2was inspired by the work presented in the following publication\u2013 [Journal Paper] W. Hoiles, V. Krishnamurthy and A. Aprem, PAC Al-gorithms for Detecting Nash Equilibrium Play in Social Networks: FromTwitter to Energy Markets, IEEE Access, Special Section: Socially En-abled Networking and Computing.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Utility change detection in online social media . . . . . . . . 21.1.2 Engagement dynamics and sensitivity analysis of YouTube videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Interactive advertisement scheduling in personalized live socialmedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Utility change detection in online social media . . . . . . . . 61.2.2 Engagement dynamics and sensitivity analysis of YouTube videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.3 Interactive advertisement scheduling in personalized live socialmedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.1 Utility change detection in online social media . . . . . . . . 9viTable of Contents1.3.2 Engagement dynamics and sensitivity analysis of YouTube videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.3 Interactive advertisement scheduling in live personalized socialmedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Utility Change Detection in Online Social Media . . . . . . . . . . 132.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Background: Utility maximization and Revealed preference . . . . . 142.2.1 Revealed preference in a noisy setting . . . . . . . . . . . . . 162.3 Revealed preference: Utility change point detection (deterministic case). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Utility change model . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Recovery of minimum perturbation and base utility function . 202.3.3 Comparison with classical change detection . . . . . . . . . . 212.4 Revealed preference: Utility change point detection in noise . . . . . 212.4.1 Estimation of unknown change point . . . . . . . . . . . . . . 222.4.2 Recovering the linear perturbation coefficients for minimumfalse alarm probability . . . . . . . . . . . . . . . . . . . . . . 222.5 Dimensionality reduction: Revealed preference for big data . . . . . 242.6 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.6.1 Detection of unknown change point in the presence of noise . 262.6.2 Yahoo! Buzz Game . . . . . . . . . . . . . . . . . . . . . . . 272.6.3 Youstatanalyzer database . . . . . . . . . . . . . . . . . . . . 302.7 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.8 Proof of theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.8.1 Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . 322.8.2 Negative dependence of random variables . . . . . . . . . . . 342.8.3 Proof of Theorem 2.2.2 . . . . . . . . . . . . . . . . . . . . . 352.8.4 Proof of Theorem 2.4.1 . . . . . . . . . . . . . . . . . . . . . 352.8.5 Proof of Lemma 2.8.2 . . . . . . . . . . . . . . . . . . . . . . 372.8.6 Proof of Lemma 2.8.3 . . . . . . . . . . . . . . . . . . . . . . 382.8.7 CUSUM algorithm for utility change point detection . . . . . 393 Engagement Dynamics and Sensitivity Analysis of YouTube videos 403.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2 Sensitivity analysis of YouTube meta-level features . . . . . . . . . . 42viiTable of Contents3.2.1 Extreme learning machine (ELM) . . . . . . . . . . . . . . . 423.2.2 Sensitivity analysis (Background) . . . . . . . . . . . . . . . . 433.2.3 Sensitivity of YouTube meta-level features and predicting viewcount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.4 Sensitivity to meta-level optimization . . . . . . . . . . . . . 503.3 Social interaction of the channel with YouTube users . . . . . . . . . 533.3.1 Causality between subscribers and view count in YouTube . . 543.3.2 Scheduling dynamics in YouTube . . . . . . . . . . . . . . . . 563.3.3 Modeling the view count dynamics of videos with exogenousevents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.3.4 Video playthrough dynamics . . . . . . . . . . . . . . . . . . 603.4 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.5 Supplementary material . . . . . . . . . . . . . . . . . . . . . . . . . 623.5.1 Description of YouTube dataset . . . . . . . . . . . . . . . . . 623.5.2 Background: Statistical learning algorithms . . . . . . . . . . 644 Interactive Advertisement Scheduling in Personalized Live SocialMedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.1 Sequential multiple stopping and stochastic dynamic programming . 704.1.1 Optimal multiple stopping: POMDP formulation . . . . . . . 714.1.2 Belief state formulation of the objective . . . . . . . . . . . . 734.1.3 Stochastic dynamic programming . . . . . . . . . . . . . . . . 744.2 Optimal multiple stopping: Structural results . . . . . . . . . . . . . 744.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2.3 Main result: Optimality of threshold policies . . . . . . . . . 774.3 Stochastic gradient algorithm for estimating optimal linear thresholdpolicies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.3.1 Structure of optimal linear threshold policies for multiple stop-ping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.3.2 Simulation-based stochastic gradient algorithm for estimatinglinear threshold policies . . . . . . . . . . . . . . . . . . . . . 804.4 Numerical examples: Interactive advertising in live social media . . . 824.4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.2 Real dataset: Interactive ad scheduling on Periscope using viewerengagement . . . . . . . . . . . . . . . . . . . . . . . . . . . 84viiiTable of Contents4.4.3 Large state space models & Comparison with SARSOP . . . 884.5 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.6 Proof of theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.6.1 Preliminaries and Definitions . . . . . . . . . . . . . . . . . . 904.6.2 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 924.6.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.1 Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.2 Directions for future research . . . . . . . . . . . . . . . . . . . . . . 101Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103ixList of Tables2.1 Comparison of revealed preference with classical change detection al-gorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1 Performance and feature sensitivity . . . . . . . . . . . . . . . . . . . 503.2 Sensitivity to meta-level optimization . . . . . . . . . . . . . . . . . . 523.3 Sensitivity of various traffic sources to meta-level optimization . . . . 533.4 Fraction of channels satisfying the Granger causality hypothesis . . . 563.5 Metadata of YouTube channel and video . . . . . . . . . . . . . . . . 623.6 Dataset summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.7 YouTube dataset categories (out of 6 million videos) . . . . . . . . . 633.8 Popularity distribution of videos in the dataset . . . . . . . . . . . . . 643.9 Optimization summary statistics . . . . . . . . . . . . . . . . . . . . . 644.1 BIC model order selection . . . . . . . . . . . . . . . . . . . . . . . . 874.2 Comparison of expected reward and the computational complexity . . 89xList of Figures1.1 Meta-level features of a YouTube video . . . . . . . . . . . . . . . . . 42.1 Upper bound on the CDF of M . . . . . . . . . . . . . . . . . . . . . 162.2 Estimated change point detection and comparison to CUSUM algorithm 282.3 Buzz scores and trading price for WIFI and WIMAX . . . . . . . . . 292.4 Recovered utility function. . . . . . . . . . . . . . . . . . . . . . . . . 303.1 View count and subscriber count of YouTube videos . . . . . . . . . . 423.2 Sensitivity of meta-level features . . . . . . . . . . . . . . . . . . . . . 483.3 ELM predictive accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 513.4 Granger causality test . . . . . . . . . . . . . . . . . . . . . . . . . . 563.5 View count due to virality, migration and exogenous events. . . . . . 603.6 Video playthrough dynamics . . . . . . . . . . . . . . . . . . . . . . . 613.7 Fraction of YouTube videos in the dataset as a function of the age ofthe videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.1 Advertisement scheduling in personalized live social media . . . . . . 704.2 Visual illustration of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . 774.4 Periscope: Plot of likes and QQ-plot . . . . . . . . . . . . . . . . . . 87xiAcknowledgementsI owe my sincere gratitude to Prof. Vikram Krishnamurthy. I am deeply indebted tohim for his guidance, encouragement and support that I received during the comple-tion of this work. This feat was not possible without his consistent support, fruitfuldiscussions, valuable feedbacks, and constructive suggestions. It has been a pleasureworking under his supervision.The Statistical Signal Processing Lab in ECE department provided an engaging,productive, and stimulating lab environment to conduct research. Special thanks toDr. William Hoiles for his time, effort and helpful discussions during our collabora-tions.Last but certainly not the least, I am indebted to my parents for their unfailingconfidence, encouragement and love. The opportunities that they have given me andtheir unlimited sacrifices are the reasons where I am and what I have accomplishedso far.Above all I thank the Almighty, for the blessings that He has showered on mewhich gave me the strength and power to sail through difficult times.xiiDedicationTo my parents. . .xiiiChapter 1Introduction1.1 OverviewIn recent years, social media has become ubiquitous for networking and content shar-ing and have become the preferred means by which humans interact online. Forexample, close to 1.2 billion users log onto Facebook daily to share content withother people on the social network and Twitter, a micro-blogging website, gets over50 million tweets a day. Facebook posts and tweets on Twiter typically reflect eventsas they happen in real life. With the large scale usage of social media and the avail-ability of data, online social media has become the new sensors for social behaviour.Another paradigm shift in online social media is that more and more users arecreating, sharing and distributing content on the Web. One of the most popular sitesfor user-generated video content is YouTube. YouTube has grown so popular thataround 300 hours of video are uploaded every minute. YouTube earns revenue fromuser-generated content through advertising. In 2015, YouTube grossed 8.5 billiondollars in advertisement revenue. Through the Partner program YouTube sharesthe advertising revenue with the content creators. In recent years, due to improvedbandwidth and the ability to create content in real-time, live streaming has becomea popular means to generate and share content online. Live streaming was madepopular by Twitch.tv which deals with live video gaming, play through of videogames, and e-sport competitions. The revenue of Twitch.tv is around 3.8 billion forthe year 2015, out of which 77% of the revenue was generated from advertisement.New applications like Periscope and Meerkat brought personalized live streaming tosmartphones and made it even more popular.As a result of the enormous importance and potential for generating revenue inonline social media, there has been growing interest in modeling and analysis of onlinesocial media. Statistical inference from social media is an area that has witnessedtremendous progress recently. For example, [1] shows that using online search data theonset of influenza-type disease can be detected with a lag of 1-2 days, and outperformscomparable methods from the Centre for Disease Control (CDC), which usually hasa lag of 1-2 weeks. However, online behaviour in social media is not influenced by11.1. Overvieweconomic incentives but rather by intrinsic utility like \u201ccuriosity\u201d, \u201cfun\u201d and \u201cfame\u201d.Hence, these problems are non-standard and traditional model-based methods arenot suitable to analyze these problems.Advertising revenue in online social media, for example YouTube videos, is relatedto the \u201cpopularity\u201d of the content, i.e. the number of views gathered by the video.One of the important features that affect popularity of a YouTube video is the meta-level features such as the title, tag, keywords and description of the video. The studyof popularity of YouTube videos based on meta-level features is a challenging problemgiven the diversity of users and content providers.In additional to traditional offline videos in YouTube, live streaming in YouTubeand Twitch.tv and personalized live streaming in applications like Periscope also offersopportunities to generate revenue through advertisement. The revenue obtained inlive videos depends on the click rate and completion rate of the advertisements. Itis well known that the click rate and completion rate depends on the interest in thecontent of the video at any time. Viewers click and watch the advertisement if it isinserted when the content is interesting. Hence, advertisements need to be scheduledwhen the interest is high, so as to obtain maximum revenue.With the above applications of online social media, there is a strong motivationto develop a set of mathematical models and algorithmic tools to study online socialmedia. The generation and consumption of social media is driven by humans andhence research in social media requires borrowing techniques from multiple disciplinessuch as micro-economics, decision theory and machine learning. The thesis is devotedto development of algorithms and procedures that are aimed at detecting, estimatingand decision making in online social media. The rest of this section is devoted toan overview of these topics along with motivation and research goals that have beenaddressed in this thesis.1.1.1 Utility change detection in online social mediaUtility maximization is the fundamental problem humans face, wherein humans maxi-mize utility given their limited resources of money or attention. However, the majorityof content in social media is user-generated with limited or no economic incentives.As explained above, incentives such as \u201cfun\u201d and \u201cfame\u201d are some of the major at-tributes of the utility function of online social behaviour. It is therefore difficult toanalytically characterize the utility function and hence any detection of utility max-imization behaviour in online social media needs to be necessarily nonparametric innature. The problem of nonparametric detection of utility maximizing behaviour21.1. Overviewis the central theme in the area of revealed preferences in microeconomics. This isfundamentally different to the theme used widely in the signal processing literature,where one postulates an objective function (typically convex) and then develops op-timization algorithms. In contrast, the revealed preference framework, considered inthis thesis, is data centric: Given a dataset, Revealed preference answer the questionwhether the dataset is consistent with utility-maximization behaviour.Chapter 2 of this thesis considers the extension of the classical revealed preferenceframework to agents with \u201cdynamic utility function\u201d. The utility function jumpchanges at an unknown time instant (change point) by a linear perturbation. Suchchanges in utility function arise in online search in social media. The online searchis currently the most popular method for information retrieval [2]. The online searchprocess can be seen as an example of an agent maximizing the information utility, i.e.the amount of information consumed by an online agent given the limited resource ontime and attention. There has been a gamut of research which links internet searchbehaviour to ground truths such as symptoms of illness, political election, or majorsporting events [3\u20138]. Detection of utility change in online search, therefore, is helpfulto identify changes in ground truth and useful, for example, for early containment ofdiseases [4] or predicting changes in political opinion [9, 10].Research GoalsMotivated by detection of change in ground truth using data from online search, ouraim is to develop a nonparametric test to detect the change point and the utilityfunctions before and after the change, which is henceforth referred to as the changepoint detection problem. In practical settings, the data from online social media ismeasured in noise. Hence, we need a decision test such that dynamic utility maxi-mization can be detected under noisy measurements. A related important practicalissue that we also consider in this thesis is the high dimensionality of data arisingin online social media. As an example of high dimensional data arising in online so-cial media, we investigate the detection of the utility maximization process inherentin video sharing via YouTube. This motivates developing computationally efficientalgorithms for detecting utility maximization.31.1. Overview1.1.2 Engagement dynamics and sensitivity analysis ofYouTube videosYouTube generates billions in revenue through advertising and through the Partnerprogram shares the revenue with the content creators. The video view count is akey metric of the measure of popularity or \u201cuser engagement\u201d of a video and themetric by which YouTube pays content providers. Hence, of significant importance forcontent creators is the meta-level features that are most sensitive for promoting videopopularity. Figure 1.1 shows the meta-level features of a YouTube video. The meta-level features of a YouTube video are the title, tags, keywords and the description ofthe video.Figure 1.1: The meta-level features of a video include the title, tag, keywords and thedescription of the video.Meta-level features, apart from the video content, play an important role in drivingtraffic to YouTube video, and thereby increasing the popularity of the video. Thetitle and the description of the video are the critical features through which a videois discovered via YouTube\/Google search. In addition to title and the description ofthe video, the tags are used by YouTube to recommend similar videos to users. Thethumbnail of the video should be eye-catching and of good quality to attract users toclick and watch the video.Research GoalsThe key question is: How do meta-level features of a posted video drive user engage-ment in the YouTube social network? Specifically, what are the most critical featuresthat affect the popularity of the video? However, the content alone does not influence41.1. Overviewthe popularity of a video. YouTube also has a social network layer on top of it\u2019s mediacontent. The main social component is how the content creators (also called \u201cchan-nels\u201d) interact with the users. So another key question is: How does the interactionof the YouTube channel with the user affect popularity of videos? Chapter 3 studiesboth the above questions. In particular, our aim is to examine how the individualvideo features (through the meta-level data) and the social dynamics contribute tothe popularity of a video.1.1.3 Interactive advertisement scheduling in personalizedlive social mediaPopularity of live video streaming has seen a sharp growth due to improved bandwidthfor streaming and the ease of sharing live content on the internet platforms. Inaddition, with the advent of high quality camera smartphones and the widespreaddeployment of 4G LTE based data services, personalized online video streaming isgrowing in popularity and includes mobile applications such as Periscope and Meerkat.One of the primary motivations for users to generate live content is that platformslike YouTube, Twitch etc., allow users to generate revenue through advertising androyalties.Some of the common ways advertisements (ads) are scheduled on pre-recordedvideo contents on social media like YouTube are pre-roll, mid-roll and post-roll; wherethe names indicate the time at which the ads are displayed. In a recent research,Adobe Research1 concluded that mid-roll video ads constitute the most engaging adtype for pre-recorded video contents, outperforming pre-roll and post-rolls when itcomes to completion rate (the probability that the ad will not be skipped). Viewersare more likely to engage with an ad if they are interested in the content of the videothat the ad is been inserted into.When a channel is streaming a live video, the mid-roll ads need to be scheduledmanually. Twitch allows only periodic ad scheduling [11] and YouTube and otherpersonalized live services currently offers no automated method of scheduling adsfor live channels. The ad revenue in live channel depends on the click rate (theprobability that the ad will be clicked), which in turn depend on the interest in thechannel content. Hence, ads need to be scheduled when the interest in the content ishigh.1https:\/\/gigaom.com\/2012\/04\/16\/adobe-ad-research\/51.2. Main ContributionsResearch GoalsChapter 4 deals with optimal scheduling of ads on personalized live channels in socialmedia, by considering viewer engagement, termed as active scheduling, to maximizethe revenue generated from the advertisements. We model the interest of the livecontent using a Markov chain [12, 13]. The viewer engagement of the content is notobserved directly, however, noisy observation of the viewer engagement is obtainedby comments and likes by the viewers. Hence, the problem of computing the optimalpolicy of scheduling ads on live channel can then be formulated as an instance ofa stochastic control problem called the partially observable Markov decision process(POMDP). Typically, computing the optimal policy of a POMDP is computationallyintractable. However, by introducing assumptions on the POMDP model, importantstructural properties of the optimal policy can be determined without brute-forcecomputations. These structural properties can then be exploited to compute theoptimal policy.1.2 Main ContributionsIn this section, a brief summary of major novel contributions of the chapters thatconstitute this thesis is provided in order that they appear in the thesis. More detaileddescription of the contributions and findings of each chapter is provided in individualchapters.1.2.1 Utility change detection in online social mediaChapter 2 considers the problem of utility change detection under the revealed pref-erence framework.The main contributions of Chapter 2 are summarized below:1. In Sec 2.3, we derive necessary and sufficient conditions for change point de-tection, for dynamic utility maximizing agents under the revealed preferenceframework.2. Sec 2.4, studies the change point detection problem in the presence of noise. Inthe presence of noisy measurements, we propose a method to detect the changepoint and construct a decision test.3. To reduce the computational cost associated with high dimensional data arisingin the context of revealed preference, a dimensionality reduction algorithm using61.2. Main ContributionsJohnson-Lindenstrauss transform is presented.4. The results developed are illustrated on the Yahoo! Tech Buzz dataset. Byusing the results developed in the paper, several useful insights can be gleanedfrom these data sets. First, the changes in ground truth affecting the utility ofthe agent can be detected by utility maximization behaviour in online search.Second, the recovered utility functions satisfy the single crossing property indi-cating strategic substitute behaviour in online search.1.2.2 Engagement dynamics and sensitivity analysis ofYouTube videosChapter 3 investigate how the meta-level features and the interaction of the YouTubechannel with the users affect the popularity of videos.The main contributions of Chapter 3 are summarized below:1. The five dominant meta-level features that affect the popularity of a video are:first day view count , number of subscribers, contrast of the video thumbnail,Google hits, and number of keywords. Sec. 3.2 discusses this further.2. Optimizing the meta-level features (e.g. thumbnail, title, tags, description) aftera video has been posted increases the popularity of the video. In addition,optimizing the title increases the traffic due to YouTube search, optimizing thethumbnail increases the traffic from related videos and optimizing the keywordsincreases the traffic from related and promoted videos. Sec. 3.2.4 providesdetails on this analysis.3. Insight into the causal relationship between the subscribers and view count forYouTube channels is also explored. For popular YouTube channels, we foundthat the channel view count affects the subscriber count, see Sec. 3.3.1.4. New insights into the scheduling dynamics in YouTube gaming channels arealso found. For channels with a dominant periodic uploading schedule, going\u201coff the schedule\u201d increases the popularity of the channel, see Sec. 3.3.2.5. The generalized Gompertz model can be used to distinguish views due to virality(views from subscribers), migration (views from non-subscribers) and exogenousevents, see Sec. 3.3.3.71.3. Related Works6. New insights into playlist dynamics. The early view count dynamics of aYouTube videos are highly correlated with the long term \u201cmigration\u201d of viewersto the video. Also, early videos in a game playthrough typically contain higherviews compared with later videos in a game playthrough playlist, see Sec. 3.3.4.7. The number of subscribers of a channel only affects the early view count dy-namics of videos in a playthrough, see Sec. 3.3.4.1.2.3 Interactive advertisement scheduling in personalizedlive social mediaChapter 4 considers the problem of optimal scheduling of ads in live social media.The main contributions of Chapter 4 are summarized below:1. A POMDP framework for the optimal ad-scheduling problem on live personal-ized channels and show that it is an instance of the optimal multiple stoppingproblem.2. Structural results are derived for the optimal multiple stopping policy. It isshown in Sec. 4.2 that the optimal multiple stopping policy is a threshold policyon the space of Bayesian posteriors. In addition, it was shown that the optimalmultiple stopping policy satisfy a nesting property.3. A stochastic gradient algorithm to compute linear threshold approximation tothe threshold policy. Numerical simulations show that the linear threshold poli-cies have a performance close to a brute-force POMDP solver. However, thelinear threshold policies are computationally cheaper to estimate and imple-ment.4. Numerical results on a Periscope real dataset show a significant improvementin the expected revenue by using the multiple stopping framework to schedulethe advertisements. The revenue can be improved by 25% in comparison tocurrently employed periodic scheduling and by 10% against heuristic schedulingusing the single stopping approach.1.3 Related WorksThis section is devoted to the literature review of topics and advances in the fieldsrelated to this thesis.81.3. Related Works1.3.1 Utility change detection in online social mediaChapter 2 of this thesis considers the problem of utility change detection in onlinesocial media under the revealed preference framework. Revealed preference deals withthe problem of nonparametric detection of utility maximizing behaviour. Major con-tributions to the area of revealed preferences are due to Samuelson [14], Afriat [15],Varian [16], and Diewert [17] in the microeconomics literature. Afriat [15] devised anonparametric test (called Afriat\u2019s theorem), which provides necessary and sufficientconditions to detect utility maximizing behaviour for a dataset. For an agent satis-fying utility maximization, Afriat\u2019s theorem [15] provides a method to reconstruct autility function consistent with the data. The utility function, so obtained, can beused to predict future response of the agent. Varian [18] provides a comprehensivesurvey of revealed preference literature.Despite being originally developed in economics, there has been some recent workon application of revealed preference to social networks and signal processing. In thesignal processing literature, revealed preference framework was used for detection ofmalicious nodes in a social network in [19, 20] and in demand estimation in smartgrids in [21]. [22] analyzes social behaviour and friendship formation using revealedpreference among high school friends. In online social networks, [23] uses revealedpreference to obtain information about products from bidding behaviour in eBay orsimilar bidding networks.1.3.2 Engagement dynamics and sensitivity analysis ofYouTube videosChapter 3 investigate how the meta-level features and the interaction of the YouTubechannel with the users affect the popularity of videos. The study of popularity ofYouTube videos based on meta-level features is a challenging problem given the di-versity of users and content providers. Several models on characterizing the popularityof YouTube videos are parametric in form, where the view count time series is usedto estimate the model parameters. For example, ARMA time series models [24],multivariate linear regression models [25], modified Gompertz models [26, 27], havebeen utilized to estimate the future video view counts given past view count timeseries. Using only the title of the video (one of the meta-level features) [28] considersthe problem of predicting whether the view count will be high or low. In a relatedcontext, [29, 30] studied the importance of tags for Flicker data. Aside from textbased meta-level features (title and tags), in [31] Support Vector Regression (SVR)91.3. Related Worksis proposed to predict the popularity using features of the video frames (e.g. facepresent, rigidity, color, clutter). It is illustrated in [31] that using the combination ofvisual features and temporal dynamics results in improved performance of the SVRfor predicting view count compared to using only visual features or temporal dynam-ics alone. In the social context, the uploading behaviour of YouTube content creatorswas studied in [32]. Specifically, the paper finds that YouTube users within a socialnetwork are more popular compared to other users.1.3.3 Interactive advertisement scheduling in livepersonalized social mediaChapter 4 considers the problem of optimal scheduling of ads in personalized livesocial media. The problem of optimal scheduling of ads has been well studied inthe context of advertising in television; see [33],[34], [35] and the references therein.However, scheduling ads on live online social media is different from scheduling ads ontelevision in two significant ways [36]: (i) real-time measurement of viewer engagement(ii) revenue is based on ads rather than a negotiated contract. Prior literature onscheduling ads on social media is limited to ad scheduling in real-time for socialnetwork games, where the ads are served to either the video game consoles in realtime over the Internet [37], or in digital games that are played via major socialnetworks [38].In Chapter 4 we formulate the problem of optimal scheduling of ads as an optimalmultiple stopping problem. The problem of optimal multiple stopping has been wellstudied in the literature see [39], [40], [41], [42] and the references therein. The optimalmultiple stopping problem generalizes the classical (single) stopping problem, wherethe objective is to stop once to obtain maximum reward. Nakai [39] considers optimalL-stopping over a finite horizon of length N in a partially observed Markov chain.More recently, [42] considers L-stopping over a random horizon.However, due to the spontaneous2 nature of personalized live channels the duration(or the horizon) is not known apriori. Therefore, we extend the results in Nakai [39]to the infinite horizon case. The extension is both important and non-trivial.The optimal multiple stopping problem can be contrasted to the recent work onsampling with \u201ccausality constraints\u201d. In sampling with causality constraints, notall the observations are observable. [43] considers the case where an agent is limitedto a finite number of observations (sampling constraints) and must adaptively decide2https:\/\/medium.com\/@mchang\/periscope-and-spontaneous-attention-seeking-43831eefac16101.4. Thesis outlinethe observation strategy so as to perform quickest detection on a data stream. Theextension to the case where the sampling constraints are replenished randomly isconsidered in [44]. In the multiple stopping problem, considered in this paper, thereis no constraint on the observations and the objective is to stop L times at states thatcorrespond to maximum reward.The optimal multiple stopping problem, considered in Chap. 4, is similar tosequential hypothesis testing [45, 46], sequential scheduling problem with uncer-tainty [47] and the optimal search problem considered in the literature. [48] and[49] consider the problem of finding the optimal launch times for a firm under strate-gic consumers and competition from other firms to maximize profit. [50],[51] consideran optimal search problem where the searcher receives imperfect information on a(static) target location and decides optimally to search or interdict by solving a clas-sical optimal stopping problem (L = 1). However, the multiple-stopping problemconsidered in this thesis is equivalent to a search problem where the underlying pro-cess is evolving (Markovian) and the searcher needs to optimally stop L > 1 times toachieve a specific objective.1.4 Thesis outlineIn this section, we present the organization of the thesis. The rest of the thesis isdivided into four chapters as outlined below:\u2022 Chapter 2 considers non-parametric detection of dynamics utility maximizingbehaviour. Necessary and sufficient conditions for detecting a linear perturba-tion in the utility function is derived. In addition, when the dataset is measuredin noise, Chapter 2 proposes a procedure to estimate the change point and con-struct a decision test. Chapter 2 also deals with reducing the computationalcomplexity of detecting utility maximizing behaviour in high dimensional data.Finally, the results are illustrated on real datasets.\u2022 Motivated by increasing the popularity of YouTube videos and thereby increas-ing advertising revenue, Chapter 3 studies the sensitivity of meta-level featuresof YouTube videos. It is found that the popularity is dependent on five dominantmeta-level features. In addition, changing the meta-level improves the popu-larity of YouTube video. We characterize how changing the various meta-levelfeatures affect the various major traffic sources. The popularity of the videoalso depends on how content creators interact with YouTube users. Chapter 3111.4. Thesis outlineshows novel insights into the causality between the view count and subscribercount, the scheduling dynamics in gaming channels and playthrough dynamics.\u2022 Chapter 4 considers the problem of scheduling advertisements in personalizedlive social media. The interest of the live content is modeled as a Markov chain.The viewer engagement is not observed directly, but noisy observations can beobtained by the comments and likes of the viewers. Hence, we formulate the adscheduling problem as a POMDP. We derive structural results on the optimalmultiple stopping policy. Using the structural results we compute optimal lin-ear threshold policies using a stochastic gradient algorithm. It is shown usingreal datasets that the linear threshold policies so obtained outperform currentperiodic scheduling by 25%.Chapter 5 outlines a summary of findings and provides a direction for futureresearch and development in the fields related to this thesis.12Chapter 2Utility Change Detection in OnlineSocial Media2.1 IntroductionCan an observer detect if a dataset (time series) is generated by optimizing a utilityfunction? More generally, given a dataset, can an observer detect if there is suddenchange in the utility function? Such data-driven detection of utility maximization isstudied in microeconomics under the framework of revealed preference3.The answer to the first question is addressed by Afriat\u2019s Theorem in the revealedpreference literature. In Section 2.2, we provide a brief background on the revealedpreference framework and the Afriat\u2019s Theorem. The novel contribution of this chap-ter is to address the second question. In Section 2.3, we extend the classical revealedpreference framework of Afriat to agents with a \u201cdynamic utility function\u201d: Theutility function jump changes at an unknown time instant by a linear perturbation.Given the dataset of probe and responses of an agent, the objective is to develop anonparametric test for the change point detection problem, i.e. to detect the changepoint (jump time) and the utility functions before and after the change.Application: Such change point detection problems arise in online search insocial media. Online search is currently the most popular method for informationretrieval [2] and can be viewed as an agent maximizing the information utility, i.e.the amount of information consumed by an online agent given the limited resource ontime and attention. There has been a gamut of research which links internet searchbehaviour to ground truths such as symptoms of illness, political election, or majorsporting events [3\u20138]. Hence, a change in the utility in the online search correspondsto change in ground truth or exogenous events affecting the utility of agent, such asthe onset of disease or the announcement of major political decision. Detection ofutility change in online search, therefore, is helpful to identify changes in ground truthand useful, for example, for early containment of diseases [4] or predicting changes in3In signal processing terminology, such problems can be viewed as set-valued system identifica-tion of an argmax system.132.2. Background: Utility maximization and Revealed preferencepolitical opinion [9, 10].A related important practical issue that we also consider in this chapter is the ap-plication of revealed preference framework to high dimensional data (\u201cbig-data\u201d). Asan example of high dimensional data arising in online social media, we investigate thedetection of the utility maximization process inherent in video sharing via YouTube.Detecting utility maximization behaviour with such high dimensional data is compu-tationally demanding. In this chapter, we use dimensionality reduction through theJohnson-Lindenstrauss lemma to overcome the computational cost associated withhigh dimensional data.Remark: The problem we consider is fundamentally different to the theme usedwidely in the signal processing literature, where one postulates an objective func-tion (typically convex) and then develops optimization algorithms. In contrast, therevealed preference framework, considered in this chapter, is data centric - given adataset, we wish to determine if is consistent with utility maximization, and thendetect changes in the utility function based on the observed behaviour.This chapter is organized as follows: Sec. 2.2 provides a brief background on therevealed preference framework. Sec. 2.3 derives necessary and sufficient conditionsfor change point detection, for dynamic utility maximizing agents under the revealedpreference framework. In Sec. 2.4, we study the change point detection problem in thepresence of noise. Section 2.5 address the problem of high dimensional data arisingin the context of revealed preference. Section 2.6 presents numerical results. First,we compare the proposed approach with the popular CUSUM test and correspondingROC curves are presented. Second, we illustrate the result developed on two realworld datasets: Yahoo! Tech Buzz dataset and Youstatanalyzer dataset.2.2 Background: Utility maximization andRevealed preferenceUtility Maximization: A utility-maximization behaviour (or utility maximizer) isdefined as follows:Definition 2.2.1. An agent is a utility maximizer if, at each time t, for input probept, the output response, xt, satisfiesxt = x(pt) \u2208 argmaxu(x){p\u2032tx\u2264It}. (2.1)142.2. Background: Utility maximization and Revealed preferenceHere, u(x) denotes a locally non-satiated4 utility function5. Also, It \u2208 R+, is thebudget of the agent. The linear constraint, p\u2032tx \u2264 It imposes a budget constraint onthe agent, where p\u2032tx denotes the inner product between pt and x.Given a dataset, D, consisting of probe, pt \u2208 Rm+ , and response, xt \u2208 IRm+ , of anagent for T time instants:D = {(pt, xt), t = 1, 2, . . . , T} . (2.2)Revealed preference aims to answer the following question: Is the datasetD in (2.2)consistent with utility-maximization behaviour of an agent? Afriat\u2019s Theorem an-swers the above question.Theorem 2.2.1 (Afriat\u2019s Theorem [15]). Given the dataset D in (2.2), the followingstatements are equivalent:1. The agent is a utility maximizer and there exists a monotonically increasing6and concave7utility function that satisfies (2.1).2. For ut and \u03bbt > 0 the following set of inequalities has a feasible solution:us \u2212 ut \u2212 \u03bbtp\u2032t(xs \u2212 xt) \u2264 0 \u2200t, s \u2208 {1, 2, . . . , T}. (2.3)3. A monotone and concave utility function that satisfies (2.1) is given by:u(x) = mint\u2208{1,2,...,T}{ut + \u03bbtp\u2032t(x\u2212 xt)} (2.4)4. The dataset D satisfies the Generalized Axiom of Revealed Preference (GARP),namely for any t \u2264 T , p\u2032txt \u2265 p\u2032txt+1 \u2200t \u2264 k \u2212 1 =\u21d2 p\u2032kxk \u2264 p\u2032kx1.The remarkable property of Afriat\u2019s Theorem is that it gives necessary and suffi-cient conditions for the dataset to satisfy utility maximization (2.1). The feasibilityof the set of inequalities can be checked using a linear programming solver or by usingWarshall\u2019s algorithm with O(T 3) computations [18] [16]. A utility function consistent4Local non-satiation means that for any point, x, there exists another point, y, within an \u03b5distance (i.e. \u2016x\u2212 y\u2016 \u2264 \u03b5), such that the point y provides a higher utility than x (i.e. u(x) < u(y)).Local non-satiation models the human preference: more is preferred to less5The utility function is a function that captures the preference of the agent. For example, if xis preferred to y, u(x) \u2265 u(y).6In this paper, we use monotone and local non-satiation interchangeably. Afriat\u2019s theorem wasoriginally stated for a non-satiated utility function.7Concavity of utility function models the human preference: averages are better than the ex-tremes. It is also related to the law of diminishing marginal utility, i.e. the rate of utility decreaseswith x.152.2. Background: Utility maximization and Revealed preference\u03c6* ( y )5 10 15 20 25 30Noise CDF00.10.20.30.40.50.60.70.80.91Upper bound on the CDF of MM.C. Sim. of Noise Dist.Analytical Exp.Figure 2.1: Upper bound on the CDF of M in (2.10) which constitute a lower bound tothe false alarm probability. The analytical expression in (2.11) is compared with a MonteCarlo evaluation of M .with the data can be constructed using (2.4). The recovered utility is not uniquesince any positive monotone transformation of (2.4) also satisfies Afriat\u2019s Theorem.Remark: In signal processing terminology, Afriat\u2019s Theorem can be viewed as aset-valued system identification method for an argmax nonlinear system with a con-straint on the inner product of the input and output of a system. Afriat\u2019s theoremhas several interesting consequences including the fact that if a dataset is consistentwith utility maximization, then it is rationalizable by a concave, monotone and con-tinuous utility function. Hence, the preference of the agent represented by a concaveutility function can never be refuted based on a finite dataset, see [18]. Further, onecan impose monotone and concave restrictions on the utility function with no loss ofgenerality.2.2.1 Revealed preference in a noisy settingAfriat\u2019s theorem (Theorem 2.2.1) assumes perfect observation of the probe and re-sponse. However, when the response of the agents are measured in noise, violationof the inequalities in Afriat\u2019s Theorem could be either due to measurement noise orabsence of utility maximization. In this section, we construct a decision test to detectthe preference of utility maximization in the presence of noise.We assume the additive noise model for measurement errors given by:yt = xt + wt, (2.5)where yt is the noisy measurement of response xt and wt \u2208 Rm is the independent162.2. Background: Utility maximization and Revealed preferenceand identically distributed (i.i.d) noise. Given the noisy datasetDobs = {(pt, yt) : t \u2208 {1, . . . , T}} , (2.6)[19] proposes the following statistical test for testing utility maximization (2.1) in adataset due to measurement errors. LetH0 denote the null hypothesis that the datasetDobs in (2.6) satisfies utility maximization. Similarly, let H1 denote the alternativehypothesis that the dataset does not satisfy utility maximization. There are twopossible sources of error:Type-I errors: Reject H0 when H0 is valid.Type-II errors: Accept H0 when H0 is invalid. (2.7)The following statistical test can be used to detect if an agent is seeking to maximizea utility function.+\u221e\u222b\u03a6\u2217(y)fM(\u03c8)d\u03c8H0\u2277H1\u03b3 . (2.8)In the statistical test\u2013 (2.8):(i) \u03b3 is the \u201csignificance level\u201d of the test.(ii) The \u201ctest statistic\u201d \u03a6\u2217(y), with y = [y1, y2, . . . , yT ] is the solution of the followingconstrained optimization problem :min \u03a6s.t. us \u2212 ut \u2212 \u03bbtp\u2032t(ys \u2212 yt)\u2212 \u03bbt\u03a6 \u2264 0\u03bbt > 0 \u03a6 \u2265 0 for t, s \u2208 {1, 2, . . . , T}.(2.9)(iii) fM is the pdf of the random variable M whereM , maxt,st6=s[p\u2032t(wt \u2212 ws)] . (2.10)The probability of false alarm or Type-I error, the probability of rejecting H0, whentrue, is given by P {M \u2265 \u03a6\u2217(y)}.Below, we derive an analytical expression for a lower bound on the false alarmprobability of the statistical test in (2.8). The motivation stems from the followingfact: Given the significance level of the statistical test in (2.8), a Monte Carlo simula-tion is required to compute the threshold. However, from an analytical expression for172.3. Revealed preference: Utility change point detection (deterministic case)the lower bound on false alarm probability, we can obtain an upper bound of the teststatistic, denoted by \u03a6\u2217(y). Hence, given a dataset Dobs in (2.6), if the solution to theoptimization problem (2.9) is such that \u03a6 > \u03a6\u2217(y), then the conclusion is that thedataset does not satisfy utility maximization, for the desired false alarm probability.Theorem 2.2.2 provides a lower bound on the false alarm probability and the proofis provided in Sec. 2.8.3.Theorem 2.2.2. If {w1, w2, . . . , wT} in (2.5) are i.i.d zero mean unit variance Gaus-sian vectors, then the probability of false alarm in (2.7) is lower bounded by1\u2212\u220ft\uf8f1\uf8f2\uf8f31\u2212\u221a2pi\u221a2\u2016pt\u20162 exp (\u2212\u03a6\u2217(y)2\/4\u2016pt\u20162)\u03a6\u2217(y) +\u221a\u03a6\u2217(y)2 + 8\u2016pt\u20162\uf8fc\uf8fd\uf8fe . (2.11)The key idea in the proof of Theorem 2.2.2 is to bound M in (2.10) by the highestorder statistic of a carefully chosen set of random variables which are negativelydependent ; see Sec. 2.8.2.Figure 2.1 compares the upper bound of the cdf (and correspondingly the lowerbound on the false alarm probability (2.11)) and the Monte Carlo simulation of actualdensity of M . As can be seen from Fig 2.1 that the upper bound of the cdf (lowerbound on false alarm probability) is tight at all regimes. The upper bound of the teststatistic, \u03a6\u2217(y), can be obtained by setting the analytical expression in (2.11) to beequal to the desired false alarm probability.2.3 Revealed preference: Utility change pointdetection (deterministic case)In this section, we consider agents with a dynamic utility function.2.3.1 Utility change modelConsider an agent that maximizes a utility function that jump changes by a linearperturbation at a time that is unknown to the observer. The aim is to estimate thejump time (change point) and the utility before and after the change point.The agent selects a response x at time t to maximize the utility function givenby:u(x, \u03b1; t) = v(x) + \u03b1\u2032x1{t \u2265 \u03c4}, (2.12)182.3. Revealed preference: Utility change point detection (deterministic case)subject to the following budget constraint p\u2032tx \u2264 It. Here, 1{\u00b7} denotes the indicatorfunction and \u03b1 = (\u03b11, \u03b12, . . . , \u03b1m)\u2032 denotes the m-dimensional linear perturbationvector. The utility function, u(x, \u03b1; t) in (2.12) consists of two components: a baseutility function, v(x), and a linear perturbation, \u03b1\u2032x, which occurs at an unknowntime \u03c4 . The base utility function, v(x) is assumed to be monotone and concave. Wewill restrict the components of the vector \u03b1 to be (strictly) greater than 0, so thatthe utility function, u, conditioned on \u03b1 is monotone and concave. The objectiveis to derive necessary and sufficient conditions to detect the time, \u03c4 at which linearperturbation is introduced to the base utility function.Motivation: The utility change model in (2.12) is motivated by several reasons.First, the linear perturbation assumption provides sufficient selectivity such that thenon-parametric test is not trivially satisfied by all datasets but still provides enoughdegrees of freedom. Second, the linear perturbation can be interpreted as the changein the marginal rate of utility relative to a \u201cbase\u201d utility function. In online socialmedia, the linear perturbation coefficients measure the impact of marketing or themeasure of severity of the change in ground truth on the utility of the agent. Thisis similar to the linear perturbation models used to model taste changes [52\u201355] inmicroeconomics. Finally, in social networks, linear change in the utility is often usedto model the change in utility of an agent based on the interaction with the agent\u2019sneighbours [56]. Compared to the taste change model, our model is unique in thatwe allow the linear perturbation to be introduced at an unknown time.Theorem 2.3.1 provides necessary and sufficient conditions to detect the changein utility function according to the model in (2.12) and the proof is in Sec. 2.8.1.Theorem 2.3.1. The dataset D in (2.2) is consistent with a utility function satis-fying the model in (2.12), if we can find set of scalars {vt}t=1,...,T , {\u03bbt > 0}t=1,...,T ,{\u03b1k}k=1,...,m, such that there exists a feasible solution to the following inequalities:vt + \u03bbtp\u2032t(xs \u2212 xt) \u2265 vs (t < \u03c4) (2.13)vt + \u03bbtp\u2032t(xs \u2212 xt)\u2212 \u03b1\u2032(xs \u2212 xt) \u2265 vs (t \u2265 \u03c4) (2.14)\u03b1i \u2264 \u03bbtpit (\u2200i, t \u2265 \u03c4), (2.15)where pit is the ith component of the probe pt.The inequalities in (2.13) to (2.15) resemble the Afriat inequalities (2.3). Thetime instant \u03c4 at which the inequalities are satisfied is the time at which the linearperturbation is introduced.192.3. Revealed preference: Utility change point detection (deterministic case)2.3.2 Recovery of minimum perturbation and base utilityfunctionComputing the linear perturbation coefficients in (2.12) gives an indication of theseverity of the ground truth or the effect of marketing and advertising in social media.The solution to the following convex optimization provides the minimum value of theperturbation coefficients:min \u2016\u03b1\u201622 (2.16)s.t. vt + \u03bbtp\u2032t(xs \u2212 xt) \u2265 vs (t < \u03c4) (2.17)vt + \u03bbtp\u2032t(xs \u2212 xt)\u2212 \u03b1\u2032(xs \u2212 xt)\u2265 vs (t \u2265 \u03c4) (2.18)\u03b1i \u2264 \u03bbtpit(\u2200i, t \u2265 \u03c4) (2.19)\u03bbt > 0v1 = \u03b2, \u03bb1 = \u03b4, (2.20)where, \u03b2 and \u03b4 are arbitrary constants.The inequalities (2.17) to (2.19) correspond to the revealed preference inequali-ties (2.13) to (2.15). The normalization conditions (2.20) are required because of theordinality8of the utility function. The ordinality of the utility function implies thatgiven any set of feasible values {v\u00aft}, {\u03bb\u00aft} and {\u03b1\u00afk} satisfying, for example, (2.18),the following set of inequalities (a scaled and translated version of (2.18)) also hold:\u03b2(v\u00afs + \u03b4)\u2212 \u03b2(v\u00aft + \u03b4)\u2212 \u03b2\u03bb\u00aftp\u2032t(xs \u2212 xt) + \u03b2\u03b1\u00af\u2032(xs \u2212 xt) \u2264 0.This can be avoided by the normalization conditions in (2.20).Recall that the base utility function v(x), is the utility function before the linearchange.Corollary 2.3.1. The recovered base utility function isv\u02c6(x) = mint{vt + \u03bbtp\u02dc\u2032t(x\u2212 xt)}, (2.21)wherep\u02dcit =\uf8f1\uf8f2\uf8f3pit t < \u03c4,pit \u2212 \u03b1i\/\u03bbt t \u2265 \u03c4. (2.22)8Clearly, for a given probe vector, any positive monotone transformation of u(x) in (2.1) givesthe same response vector.202.4. Revealed preference: Utility change point detection in noiseMethod Data model Change model ReferenceCUSUM xti.i.d\u223c p\u03b8 \u03b8 ={\u03b80 t < \u03c4\u03b81 t \u2265 \u03c4[57]Semi-supervised\/ D = {((p1, I1), U(p1, I1)) , . . . , ((pT , IT ), U(pT , IT ))} Not Applicable [58\u201360]Supervised Learning (pi, Ii) \u223c P , U(pi, xi): Optimal response for utility U [61]Revealed Preference xt = argmaxp\u2032txt\u2264Itu(xt) u(x) ={v(x) t < \u03c4v(x) + \u03b1\u2032x t \u2265 \u03c4 This workTable 2.1: Comparison of revealed preference with classical change detection algo-rithms; see the discussion in Sec. 2.3.3.In (2.21) and (2.22) {vt}, {\u03bbt}, {\u03b1k} are the solution of (2.16) to (2.20).2.3.3 Comparison with classical change detectionTable 2.1 compares the revealed preference framework (this work) with classicalchange detection algorithms. The key difference is that revealed preference considersa system which maximizes an unknown utility function subject to linear constraints(budget constraint). In comparison, a classical CUSUM type change detection al-gorithm requires knowledge of a parametrized utility function (see Sec. 2.6.1 for anumerical example when v(x) is a Cobb-Douglas9utility function).The revealed preference problem is related to supervised learning when the para-metric class of functions for empirical risk minimization (ERM) is limited to concaveand monotone functions [58\u201360]. The change detection problem can be thought ofas a multi-class learning problem with the first class being the utility function beforethe change and the second class being the utility function after the change [58]. How-ever, this paper provides an algorithmic approach to detect change points by derivingnecessary and sufficient conditions.2.4 Revealed preference: Utility change pointdetection in noiseSec. 2.3 dealt with utility change point detection in the deterministic case. In thissection, we consider a dynamic utility maximizing agent (the utility function under-goes a sudden linear perturbation at an unknown time) whose response is measuredin noise according to (2.5). The organization below is as follows. Sec. 2.4.1 proposes a9The Cobb-Douglas is a widely used utility function in economics. When m = 2, i.e. thedimension of the probe and response is 2, the utility function can be expressed as u(x) = xa1xb2. Theutility function is parameterized by a and b.212.4. Revealed preference: Utility change point detection in noiseprocedure to detect the unknown change point in the presence of noise. In Sec. 2.4.2,we formulate a hypothesis test to check whether the dataset is rationalizable by autility function satisfying the model in (2.12), given the estimated change point fromSec. 2.4.1. Once the unknown change point is estimated and the dataset satisfies thehypothesis test in Sec. 2.4.2, we need to recover the base utility function and thelinear perturbation. We will recover the linear perturbation coefficient correspondingto minimum false alarm probability.2.4.1 Estimation of unknown change pointIn the presence of noise, the inequalities in (2.13) to (2.15) may not be satisfied forany value of \u03c4 . Hence, we consider the following linear programming problem, to findthe minimum error or \u201cadjustment\u201d such that the inequalities in (2.13) to (2.15) aresatisfied.\u03a6\u03c4 = min \u03a6 (2.23)s.t. vs \u2212 vt \u2212 \u03bbtp\u2032t(ys \u2212 yt)\u2212 \u03a6 \u2264 0 (t < \u03c4)vs \u2212 vt \u2212 \u03bbtp\u2032t(ys \u2212 yt) + \u03b1\u2032(ys \u2212 yt)\u2212 \u03a6\u2264 0 (t \u2265 \u03c4)\u03b1i \u2212 \u03bbtpit \u2264 0 (\u2200i, t \u2265 \u03c4)\u03a6 \u2265 0, \u03bbt > 0The solution of the linear program (2.23) depends on the choice of the change pointvariable \u03c4 . When the data is measured without noise, the equations are satisfied withzero error at the correct change point. The estimated change point, \u03c4\u02c6 , correspondsto time point with minimum adjustment.\u03c4\u02c6 = argmin1\u2264\u03c4\u2264T\u03a6\u03c4 (2.24)The intuition for (2.24) is if \u03c4 is the true change point, then the perturbation \u03a6 needsto compensate only for the noise.2.4.2 Recovering the linear perturbation coefficients forminimum false alarm probabilityAs in (2.7), define the null hypothesis H0, that the dataset satisfies utility maximiza-tion under the model in (2.12), and the alternative hypothesis H1 that the dataset222.4. Revealed preference: Utility change point detection in noisedoes not satisfy utility maximization under the model in (2.12). Type-I errors andType-II errors are defined, similarly, as in Sec. 2.2.1.Consider the following statistical test:+\u221e\u222b\u03a6\u2217(y)fM(\u03c8)d\u03c8H0\u2277H1\u03b3 . (2.25)In the statistical test \u2013(2.25):1. \u03b3 is the significance level of the test.2. The test statistic \u201c\u03a6\u2217(y)\u201d, is the solution of the following constrained optimiza-tion problem with \u03c4 = \u03c4\u02c6 (from Sec. 2.4.1).min \u03a6 (2.26)s.t. vs \u2212 vt \u2212 \u03bbtp\u2032t(ys \u2212 yt)\u2212 \u03bbt\u03a6 \u2264 0 (t < \u03c4)vs \u2212 vt \u2212 \u03bbtp\u2032t(ys \u2212 yt) + \u03b1\u2032(ys \u2212 yt)\u2212 \u03bbt\u03a6 \u2264 0 (t \u2265 \u03c4)\u03b1i \u2212 \u03bbtpit\u2264 0 (\u2200i, t \u2265 \u03c4)\u03a6 \u2265 0, \u03bbt > 0The optimization problem above (2.26) is similar to the optimization prob-lem (2.23). The inequalities in (2.26) enable us to compute a bound on the teststatistic in (2.30).3. fM is the pdf of the random variable, M , whereM ,M1 +M2, (2.27)where M1 and M2 are defined as:M1 , maxs,ts 6=t[p\u2032t(wt \u2212 ws)] , (2.28)M2 , maxs,ts 6=t,t\u2265\u03c4[\u03b1\u2032(wt \u2212 ws)\/\u03bbt] , (2.29)where \u03b1 and \u03bb are the solution of (2.26). The set of inequalities in (2.26) can be232.5. Dimensionality reduction: Revealed preference for big datare-written using (2.5) as:(vs \u2212 vt)\/\u03bbt \u2212 p\u2032t(xs \u2212 xt) \u2264 p\u2032t(wt \u2212 ws) (t < \u03c4),(vs \u2212 vt)\/\u03bbt \u2212 p\u2032t(xs \u2212 xt) + \u03b1\u2032(xs \u2212 xt)\/\u03bbt\u2264 p\u2032t(wt \u2212 ws)\u2212 \u03b1\u2032(wt \u2212 ws)\/\u03bbt (t \u2265 \u03c4).(2.30)Therefore for the dataset satisfying utility maximization under the model in (2.12) itshould be the case that the test statistic \u201c\u03a6\u2217(y) \u2264M\u201d.The pdf of the random variable M can be estimated via Monte Carlo simulation.Below, we bound the probability of false alarm and recover the linear perturbationcoefficients corresponding to minimum false alarm probability.Theorem 2.4.1. Assume {w1, w2, . . . , wT} in (2.5) are i.i.d zero mean unit varianceGaussian vectors. Given \u03b5 > 0, let the change point \u03c4 and the length of the datasetT be \u03c4 = O(log 1\/\u03b5) and T \u2212 \u03c4 = O(log 1\/\u03b5). Then, the optimization criterion torecover the linear perturbation coefficients with minimum probability of Type-I erroris to minimize the Euclidean norm (i.e., \u2016\u03b1\u20162) subject to the constraints in (2.26).The proof is in Sec. 2.8.4. Theorem 2.4.1 provides the motivation for minimizingthe Euclidean norm of the linear perturbation coefficients in the optimization prob-lem (2.16) to (2.20). The recovery of the base utility function is similar to that inSec. 2.3.2.2.5 Dimensionality reduction: Revealedpreference for big dataClassical revealed preference deals with the case m < T (recall m is the dimension ofthe probe vector and T is the number of observations). Below, we consider the \u201cbigdata\u201d domain: m \u001d T . Checking whether a dataset, D, satisfies utility maximiza-tion (2.1) can be done by verifying whether GARP (statement 4 of Theorem 2.2.1)is satisfied. For m\u001d T , the computational cost for checking GARP is dominated bythe number of computations required to evaluate the inner product in GARP, givenby mT 2. The computational cost for evaluating the inner product can be reduced byembedding the m-dimensional probe and observation vector into a lower dimensionalsubspace, of dimension k, and checking GARP on the lower dimensional subspace.242.5. Dimensionality reduction: Revealed preference for big dataWe use Johnson-Lindenstrauss Lemma (JL Lemma) to achieve this10.Lemma 2.5.1 (Johnson-Lindenstrauss (JL) [62]). Suppose x1, x2, . . . , xn \u2208 IRm aren arbitrary vectors. Then, for any \u03b5 \u2208 (0, 12), there exists a mapping f : IRm \u2192 IRk,k = O(log n\/\u03b52), such that the following conditions are satisfied:(1\u2212 \u03b5)\u2016xi\u20162 \u2264 \u2016f(xi)\u20162 \u2264 (1 + \u03b5)\u2016xi\u20162 \u2200i (2.31)(1\u2212 \u03b5)\u2016xi \u2212 xj\u20162 \u2264 \u2016f(xi)\u2212 f(xj)\u20162\u2264 (1 + \u03b5)\u2016xi \u2212 xj\u20162 \u2200i, j. (2.32)To implement JL efficiently, one possible method of [63] is summarized in Theo-rem 2.5.1. This method utilizes a linear map for f and hence can be represented by aprojection matrix R. The key idea in [63] is to construct the projection matrix R withelements +1 or \u22121 so that the computing the projection involves no multiplications(only additions).Theorem 2.5.1 ([63]). Let A = [x1, x2, . . . , xn]\u2032 denote the n\u00d7m matrix containingthe n vectors each of dimension m. Given \u03b5, \u03b2 > 0, let R be a m\u00d7 k random binarymatrix, with independent and equiprobable elements +1 and \u22121, wherek >4 + 2\u03b2\u03b52\/2\u2212 \u03b53\/3 log n. (2.33)Then, B = 1\u221akAR, with dimension n\u00d7k, contains the n projected vectors of dimensionk and with probability at least 1\u2212 \u03b4, where \u03b4 = 1n\u03b2, the inequalities (2.31) and (2.32)holds.The linear map f : IRm \u2192 IRk in Theorem 2.5.1 maps the ith row of A to the ithrow of B. The inequalities in (2.31) and (2.32) hold in a probabilistic sense i.e. withprobability abreast 1\u2212 \u03b4, the following inequality holds for all i, j(1\u2212 \u03b5)\u2016xi \u2212 xj\u20162 \u2264 \u2016f(xi)\u2212 f(xj)\u20162 \u2264 (1 + \u03b5)\u2016xi \u2212 xj\u20162.Checking the GARP conditions (statement 4 of Theorem 2.2.1) depends only onthe relative value of the inner product between the probe and response vectors. Hence,we can scale both the probe and response vector such that their norms are less that10Other dimensionality reduction techniques such as Principal Component Analysis (PCA) arenot compatible with GARP.252.6. Numerical resultsone. In this case, as a consequence of preservation of the norms of the vector, theJohnson-Lindenstrauss embedding also preserves the inner product.Corollary 2.5.1. Let xi, xj \u2208 IRm and \u2016xi\u2016 \u2264 1, \u2016xj\u2016 \u2264 1 be such that (2.32) issatisfied with probability at least 1\u2212 \u03b4. Then,P ((x\u2032ixj \u2212 f(xi)\u2032f(xj)) \u2265 \u03b5) \u2264 \u03b4.The proof is available, for example, in [64]. The JL embedding of the vectorspreserves the inner product to within a \u03b5 fraction of the original value.Therefore, to check for utility maximization behaviour, we first project the highdimensional probe and response vector to a lower dimension using JL (using The-orem 2.5.1). The inner products in the lower dimensional space is then used forchecking the GARP condition for detecting utility maximization giving mksavings incomputation.2.6 Numerical resultsThe aim of this section is three fold. First, we illustrate the change point detectionalgorithm in Sec. 2.4 and show how the revealed preference framework is fundamen-tally different from classical change detection algorithms. Second, we show that thetheory developed in Sec. 2.3 and Sec. 2.4, for utility change point detection, candetect changes in ground truths from online search behaviour. Also, the recoveredutility functions satisfy the single crossing condition indicating strategic substitutionbehaviour11 in online search. Third, we show user behaviour in YouTube satisfiesutility maximization. To reduce the computational cost associated with checking theutility maximization behaviour, we use dimensionality reduction techniques discussedin Sec. 2.5.2.6.1 Detection of unknown change point in the presence ofnoiseIn this section, we present numerical results on change point detection in the presenceof noise. Assume that the probe and response vector is of dimension 2 (i.e. m = 2),11The substitution behaviour in economics says that consumers, constrained by a budget, substi-tute more expensive items with less costly alternatives.262.6. Numerical resultsand the utility function follows the model in (2.34). The base utility function v(x) isa Cobb-Douglas utility function12 with parameter a1 and a2.v(x) = xa11 xa22u(x) =\uf8f1\uf8f2\uf8f3v(x) t < \u03c4v(x) + \u03b1\u2032x t \u2265 \u03c4(2.34)The response is measured in noise as specified in (2.5).Fig. 2.2a shows the estimate for \u03a6\u03c4 in (2.23) as a function of \u03c4 . The estimatedchange point (\u03c4\u02c6) is the point at which \u03a6\u03c4 attains minimum. Fig. 2.2b comparesthe ROC curve for the revealed preference framework and the CUSUM algorithmfor change detection. The CUSUM algorithm for change detection is provided inSec. 2.8.7. The CUSUM algorithm is used as a reference for comparing the perfor-mance of the revealed preference framework presented in this paper. The CUSUMalgorithm in Sec. 2.8.7 makes two critical assumptions: (i) Knowledge of the util-ity function before change (ii) Knowledge of the linear perturbation coefficients, andhence the utility function after the change. The only unknown is the change pointat which the utility changed. However, if the linear perturbation coefficients are alsounknown, then the CUSUM algorithm in Sec. 2.8.7 can be modified to search overIRm+ and select the parameter with the highest likelihood. The critical assumptionin CUSUM is the knowledge of the utility function before the change point. Oneheuristic solution is to estimate the utility function using some initial data, assumingno change point, utilizing the Afriat\u2019s Theorem and then applying the CUSUM algo-rithm. Such a procedure is clearly suboptimal. In comparison, the revealed preferenceprocedure in Sec. 2.4 makes no assumption about the base utility function or the lin-ear perturbation coefficients. As can be gleaned from Fig. 2.2b, the performance ofthe revealed preference algorithm is comparable to the CUSUM algorithm, given thenon-parametric assumptions.2.6.2 Yahoo! Buzz GameHere we present an example of a real dataset of the online search process. Theobjective is to investigate the utility maximization of the online search process and12The Cobb-Douglas utility function is one of the most widely used utility function in the micro-economics literature. One of the main reasons for the popularity of Cobb-Douglas utility function isits simplicity. The utility maximization problem with Cobb-Douglas utility function can be computedin closed form. Alternatives to the Cobb-Douglas utility include the linear utility, the min utilityfunction e.t.c.272.6. Numerical resultsTime Index0 10 20 30 40 50log \u03a6\u03c4-1.05-1-0.95-0.9-0.85-0.8-0.75-0.7-0.65-0.6(a) Plot of \u03a6\u03c4 obtained by solving theoptimization problem in (2.23). The es-timated change point (\u03c4\u02c6) is the point atwhich \u03a6\u03c4 attains its minimum.False Alarm0 0.2 0.4 0.6 0.8 1True Positive Rate00.10.20.30.40.50.60.70.80.91Revealed PreferenceCUSUM(b) ROC curve.Figure 2.2: Estimating the utility change point using the revealed preference framework(Fig. 2.2a), where no knowledge of the utility function is assumed. Fig. 2.2b comparesthe ROC curve of the revealed preference framework with the CUSUM algorithm. TheCUSUM algorithm in Sec. 2.8.7 assumes knowledge of the utility function before and afterthe change point. However, the revealed preference framework considered in this paperassumes no parametric knowledge of the utility function. The plots were generated with 1000independent simulations. The parameters of Cobb-Douglas utility v(x) equal to (a1, a2) =(0.6, 0.4) and \u03b1 = (1, 1) with the change point set as 26. The budget is set to 5. The noisevariance is 0.50.to detect time points at which the utility has changed.The dataset that we use in our study is the Yahoo! Buzz Game Transactionsfrom the Webscope datasets13 available from Yahoo! Labs. In 2005, Yahoo! alongwith O\u2019Reilly Media started a fantasy market where the trending technologies atthat point where pitted against each other. For example, in the browser market therewere \u201cInternet Explorer\u201d, \u201cFirefox\u201d, \u201cOpera\u201d, \u201cMozilla\u201d, \u201cCamino\u201d, \u201cKonqueror\u201d,and \u201cSafari\u201d. The players in the game have access to the \u201cbuzz\u201d, which is the onlinesearch index, measured by the number of people searching on the Yahoo! searchengine for the technology. The objective of the game is to use the buzz and tradestocks accordingly. The interested reader is referred to [65] for an overview of theBuzz game. An empirical study of the dataset [66] reveals that most traders in theBuzz game follow utility maximization behaviour. Hence, the dataset falls within therevealed preference framework, if we consider the buzz as the probe and the \u201ctradingprice14\u201d as the response to the utility maximizing behaviour.13Yahoo! Webscope dataset: A2 - Yahoo! Buzz Game Transactions with Buzz Scores, version1.0 http:\/\/research.yahoo.com\/Academic Relations14The trading price is indicative of the value of the stock.282.6. Numerical results03\/29 04\/03 04\/08 04\/13 04\/18 04\/23 04\/28 05\/030100200300400500600700Buzz Scores for WIFI and WIMAX WIFIWIMAX(a)03\/29 04\/03 04\/08 04\/13 04\/18 04\/23 04\/28 05\/03567891011121314Trading Prce of WIFI and WIMAX WIFIWIMAX(b)Figure 2.3: Buzz scores and trading price for WIFI and WIMAX in the WIRELESSmarket from April 1 to April 29, 2005. The change point was estimated as April 18. Thiscorresponds to a new WIFI product announcement. The change is visible in the suddenpeak of interest in WIFI around April 18.We consider a subset of the dataset containing the \u201cWIRELESS\u201d market whichcontained two main competing technologies: \u201cWIFI\u201d and \u201cWIMAX\u201d. Figure 2.3shows the buzz and the \u201ctrading price\u201d of the technologies starting from April 1 toApril 29, 2005. The buzz is published by Yahoo! at the start of each day and the\u201ctrading price\u201d in Figure 2.3 was computed by averaging over the corresponding stocktrading prices for the day.Choose the probe and response vector for this dataset aspt = [Buzz(WIFI)t Buzz(WIMAX)t]xt = [Trading price(WIFI)t Trading price(WIMAX)t] .(2.35)The inner product of the probe and response vector in (2.35) provides the totalvaluation of the \u201cWIRELESS\u201d market which forms the budget constraint.Checking the Afriat inequalities (2.3), we find that the dataset does not satisfyutility maximization for the entire duration from April 1 to April 29. However, thedataset does satisfy utility maximization from April 1 to April 17 (using the Afriatinequalities). Using the inequalities (2.13) to (2.15), we detected a change in utility at(\u03c4 =) April 18. This change point corresponds to Intel\u2019s announcement of WIMAXchip15.Also, by minimizing the 2-norm of the linear perturbation, i.e. solving the opti-mization problem (2.16) to (2.20), we find that the recovered linear coefficients whichcorrespond to minimum perturbation is \u03b1 = [0 5.9]. This is intuitive, a positive15http:\/\/www.dailywireless.org\/2005\/04\/17\/intel-shipping-wimax-silicon\/292.6. Numerical results(a)-90-80-70-60-50-50-40-40-30-30-20-20-10-10-10000Price (Valuation) WIFI10 10.5 11 11.5 12 12.5 13Price (Valuation) WIMAX66.577.588.599.510(b)Figure 2.4: Fig. 2.4a shows the recovered utility function v(x) using (2.21). The indifferencecurve of the recovered utility function, is shown in Fig. 2.4b. The indifference curve indicatesthe strategic substitution behaviour in online search; See discussion in Sec. 2.6.2.change in the WIMAX utility, due to the change in ground truth. The recovered util-ity function, v(x), is shown in Fig. 2.4a and the indifference curve (contour plot) ofthe base utility is shown in Fig. 2.4b. The recovered base utility function in Fig. 2.4asatisfies the single crossing condition16indicating strategic substitute behaviour in on-line search. The substitute behaviour in online search can also be noticed from theindifference curve in Fig. 2.4b. This is due to the fact that WIFI and WIMAX werecompeting technologies for the problem of providing wireless local area network.2.6.3 Youstatanalyzer databaseWe now analyze the utility maximizing behaviour of YouTube users. YouTube is anexample of a content-aware utility maximization, where the utility depends on thequality of the video present at any point. We measure the quality of the video usingtwo measurable metrics: the number of subscribers and the number of views. We usedthe Youstatanalyzer database17 which is built using the Youstatanalyzer tool [68]. Thedatabase is particularly suited for our study of dimensionality reduction in revealed16Utility function, U(x1, x2), satisfy the single crossing condition if \u2200 x\u20321 > x1, x\u20322 > x2, we haveU(x\u20321, x2) \u2212 U(x1, x2) \u2265 0 =\u21d2 U(x\u20321, x\u20322) \u2212 U(x1, x\u20322) \u2265 0; see [67]. The single crossing conditionimplies that argmaxx1 U(x1, x2) is non-decreasing in x2; this defines substitution behavior. Thesingle crossing condition is an ordinal condition and therefore compatible with Afriat\u2019s Theorem.17The Youstatanalyzer dataset can be downloaded from http:\/\/www.congas-project.eu\/sites\/www.congas-project.eu\/files\/Dataset\/youstatanalyzer1000k.json.gz. The dataset containsmeta-level data for over 1 million videos from the period of July 2012 to September 2013. Thecomplete list of collected parameters are given in Table 1 in [68].302.6. Numerical resultspreference (Sec. 2.5), since the database contains statistics of 1 million videos.From the database, we aggregated the statistics of all popular videos existingin the time interval from 08 July, 2012 to end of 07 Sept, 2013, having at least 2subscribers. This time interval was divided into 15 time periods, corresponding toeach month of the duration, giving us a total of 15 observations (T = 15). For therevealed preference analysis, the probe is the number of subscribers and the response isthe number of views during the time period. The aim is to detect utility maximizingbehaviour between the number of people subscribed to a particular video and thenumber of views that the video received. Formally, the probe and the response arechosen as follows:pt = [1\/#Subscriber(Video1), . . . , 1\/#Subscriber(VideoN)] , (2.36)xt = [#Views(Video1), . . . ,#Views(VideoN)] . (2.37)The motivation for this definition is that as the number of subscribers to a videoincreases, the number of views also increase [69]. The inner product of the probeand the response vectors gives the sum of the \u201cview focus\u201d of all videos [70]. Also,a recent study shows that 90% of the YouTube views are due to 20% of the videosuploaded [71]. Hence, if we restrict attention to popular videos, the view focus tendto remain constant during a time period which correspond to the linear constraint inthe revealed preference setting.The number of videos satisfying the above requirements is 7605, and therefore,the dimension of the probe and response, m = 7605. The number of inner prod-uct computations required for checking the GARP conditions (statement 4 of Theo-rem 2.2.1) is mT 2. Hence, we apply Johnson-Lindenstrauss Lemma (Lemma 2.5.1) tothe dataset using the \u201cdatabase friendly\u201d transform (Theorem 2.5.1) in Sec 2.5. Wechoose \u03b5 = 0.1, so that the inner product are within 90% of the accuracy. Also, wechoose \u03b2 = 0.65, such that the above condition on the inner products hold with prob-ability at least 0.9. Substituting the values of T , \u03b5 and \u03b2 in (2.33) we find that thedimension of the embedded subspace is k = 3800. We see that the GARP condition issatisfied with probability 0.9, which is inline with what we expect. For this example,the number of inner product computations required to compute GARP condition inthe lower dimensional subspace is given by kT 2, which is less than the number ofcomputations required to compute the GARP in the original space by a factor of 2.The lower dimensional utility function obtained in Sec 2.5 is useful for visualizationpurposes, and gives a sparse representation of the original utility function.312.7. Closing remarks2.7 Closing remarksThe revealed preference framework is a nonparametric approach to detection of utilitymaximization. This chapter extended the classical revealed preference frameworkto dynamic utility maximizing agents. The main result (Theorem 2.3.1) providednecessary and sufficient conditions on the dataset for the existence of the change pointat which the utility function jump changes by a linear perturbation. In addition, weproposed convex programs to recover the minimum linear perturbation and the baseutility function. In the presence of noise Theorem 2.3.1 may not be satisfied for anyvalue of change point. Hence, we provided a procedure for detecting the unknownchange point and a hypothesis test for detecting dynamic utility maximization. We,then, considered the problem of detection of utility maximization behaviour in the bigdata domain. In order to reduce the computational cost, we proposed dimensionalityreduction through the Johnson-Lindenstrauss transform.The results were illustrated on real dataset from Yahoo! Tech Buzz. The Yahoodataset is an example of an online search dataset. The application of results providednovel insights into the utility maximizing behaviour of agents in online search. Thechange point reflect the ground truth and the recovered utility function show thestrategic substitution behaviour in online search.2.8 Proof of theorems2.8.1 Proof of Theorem 2.3.1Necessary Condition: Assume that the data has been generated by a utility functionssatisfying the model in (2.12). An optimal interior point solution to the problem mustsatisfy the first order optimality conditions:5xitv(xt) = \u03bbtpit (t < \u03c4) (2.38)5xitv(xt) + \u03b1i = \u03bbtpit (t \u2265 \u03c4) (2.39)At time t, the concavity of the utility function implies:u(xt, \u03b1, t) +5xtu(xt, \u03b1, t)\u2032(xs \u2212 xt) \u2265 u(xs, \u03b1, t) \u2200s. (2.40)322.8. Proof of theoremsSubstituting the first order conditions (2.38) and (2.39) into (2.40), yieldsv(xt) + \u03bbtp\u2032t(xs \u2212 xt) \u2265 v(xs) (t < \u03c4) (2.41)v(xt) + \u03bbtp\u2032t(xs \u2212 xt)\u2212 \u03b1\u2032(xs \u2212 xt) \u2265 v(xs) (t \u2265 \u03c4) (2.42)Denoting v(xt) = vt yields the set of inequalities (2.13), (2.14). (2.15) holds since theutility function v(x) is monotone increasing.Sufficient Condition: We first construct a piecewise linear utility function V(x)from the lower envelope of the T overestimates, to approximate the function v(x)defined in (2.12),V(x) = mint{vt + \u03bbtp\u02dc\u2032t(x\u2212 xt)}, (2.43)where each element of p\u02dct is defined as,p\u02dcit =\uf8f1\uf8f2\uf8f3pit t < \u03c4pit \u2212 \u03b1i\/\u03bbt t \u2265 \u03c4 (2.44)To verify that the construction in (2.43) is indeed correct, consider an arbitraryresponse, x\u02c6, such that: p\u2032tx\u02c6 \u2264 p\u2032txt18. We need to show V(x\u02c6) + \u03b1\u2032x\u02c6 \u2264 V(xt) + \u03b1\u2032xt.First, we show that V(xt) = vt, t = 1, . . . , T as follows: From (2.43),V(xt) = vm + \u03bbmp\u02dc\u2032m(xt \u2212 xm),for some m. In particular, if m \u2265 \u03c4 ,V(xt) = vm + \u03bbmp\u02dc\u2032m(xt \u2212 xm)= vm + \u03bbmp\u2032m(xt \u2212 xm)\u2212 \u03b1\u2032(xt \u2212 xm)\u2264 vt + \u03bbtp\u2032t(xt \u2212 xt) (2.45)= vtIf the inequality (2.45) is true, then it would violate (2.42). Similarly, it can be shownthat if m < \u03c4 , V(xt) = vt. Hence, V(xt) = vt.18In economics, xt is said to be \u201crevealed preferred\u201d to x\u02c6. Since xt was chosen as the response tothe probe pt, the utility at xt should be higher than the utility at x\u02c6.332.8. Proof of theoremsNext, we show V(x\u02c6) + \u03b1\u2032x\u02c6 \u2264 V(xt) + \u03b1\u2032xt. If, t \u2265 \u03c4 ,V(x\u02c6) + \u03b1\u2032x\u02c6 \u2264 vt + \u03bbtp\u02dc\u2032t(x\u02c6\u2212 xt) + \u03b1\u2032x\u02c6= vt + \u03bbtp\u2032t(x\u02c6\u2212 xt)\u2212 \u03b1\u2032(x\u02c6\u2212 xt) + \u03b1\u2032x\u02c6= vt + \u03bbtp\u2032t(x\u02c6\u2212 xt) + \u03b1\u2032xt\u2264 vt + \u03b1\u2032xt= V(xt) + \u03b1\u2032xtThe inequality holds, similarly, for the case t < \u03c4 . Therefore, we can construct autility function consistent with the model in (2.12).2.8.2 Negative dependence of random variablesDefinition 2.8.1 ([72]). Random variables X1, . . . , Xn, n \u2265 2, are said to be nega-tively dependent, ifP {\u2229nk=1 {Xk \u2264 xk}} \u2264n\u220fk=1P {Xk \u2264 xk} ,andP {\u2229nk=1 {Xk > xk}} \u2264n\u220fk=1P {Xk > xk} .Negative dependence allows us to bound the joint distribution of the randomvariables in terms of marginals.The variable M in (2.10) is the highest order statistic of the set of random variablesM defined as:M , {(p\u2032t(wt \u2212 ws)) : s, t = {1, 2, . . . , T} , s 6= t} . (2.46)Define, \u03be \u2282M as\u03be = {p\u20321(w1 \u2212 w2), p\u20322(w2 \u2212 w3) . . . , p\u2032T (wT \u2212 w1)} . (2.47)Lemma 2.8.1. If {w1, w2, . . . , wT} in (2.47) are i.i.d zero mean unit variance Gaus-sian vectors, then the set of random variables in \u03be are negatively dependent.Proof. Each of the random variables in the set \u03be (defined in (2.47)), is Gaussian.Hence to show negative dependence of random variables in \u03be, it is sufficient to show342.8. Proof of theoremsthat these variables are negatively correlated [72, 73]. Any element in \u03be, p\u2032i(wi\u2212wi+1),is correlated with either:1. Element of the form p\u2032i+1(wi+1 \u2212 wi+2):E{(p\u2032i(wi \u2212 wi+1))(p\u2032i+1(wi+1 \u2212 wi+2))}= \u2212p\u2032ipi+1 < 0.2. Element of the form p\u2032k(wk \u2212 wk+1), k \/\u2208 {i, i+ 1}:E {(p\u2032i(wi \u2212 wi+1)) (p\u2032k(wk \u2212 wk+1))} = 0.So the random variables in \u03be (2.47) are negatively correlated and hence, negativelydependent, as defined in Def. 2.8.1.2.8.3 Proof of Theorem 2.2.2For any subset of the random variables, \u03be,\u03be \u2282M = {(p\u2032t(wt \u2212 ws)) : s, t = {1, 2, . . . , T} , s 6= t}P {M \u2264 x} \u2264 P{maxi\u03bei \u2264 x}= P {\u03be1 \u2264 x, . . . , \u03beT \u2264 x}Choosing the set \u03be to be set defined in (2.47). Also, from Lemma 2.8.1 the randomvariables in \u03be are negatively dependent, as defined in Def. 2.8.1. Hence,\u2264\u220fiP {\u03bei \u2264 x}Each of the term in \u03be, (p\u2032t(wt \u2212 wt+1)) is distributed as N (0, 2\u2016pt\u20162). Using standardlower bound for the tail of the Gaussian distribution, we have\u2264\u220ft\uf8f1\uf8f2\uf8f31\u2212\u221a2pi\u221a2\u2016pt\u20162x+\u221ax2 + 8\u2016pt\u20162exp(\u2212x2\/4\u2016pt\u20162)\uf8fc\uf8fd\uf8feThe false alarm probability is given by 1 \u2212 P {M \u2264 \u03a6\u2217(y)}. Substituting the upperbound for P {M \u2264 \u03a6\u2217(y)}, we get a lower bound for the false alarm probability.2.8.4 Proof of Theorem 2.4.1The proof of Theorem 2.4.1 relies on two lemmas: Lemma 2.8.2 and Lemma 2.8.3which are stated below. Lemma 2.8.2 states that for \u201csufficient\u201d number of observa-tions the random variables are \u201calmost\u201d positive.352.8. Proof of theoremsLemma 2.8.2. Assume {w1, w2, . . . , wT} in (2.5) are i.i.d zero mean unit varianceGaussian random vectors. For \u03b5 > 0, T = O(log 1\/\u03b5) and T \u2212 \u03c4 = O(log 1\/\u03b5), wehave:P {M1 \u2264 0} < \u03b5,P {M2 \u2264 0} < \u03b5Define, auxiliary random variables, M\u02c61 and M\u02c62, which corresponds to the trun-cated distributions of M1 and M2 as shown below:fM\u02c6i(x) = fMi(x)1 {x \u2265 0}+ P (Mi < 0) \u03b4(x) ; i = 1, 2, (2.48)where, \u03b4(x) is the delta function. Then, Lemma 2.8.3 states that the expectation ofthe auxiliary random variables M\u02c6i; i = 1, 2, are close to the expectation of the originalrandom variables, Mi; i = 1, 2.Lemma 2.8.3. Assume {w1, w2, . . . , wT} in (2.5) are i.i.d zero mean unit varianceGaussian random vectors. For \u03b5 > 0, T = O(log 1\/\u03b5) and T \u2212 \u03c4 = O(log 1\/\u03b5), wehave:|EM\u02c61 \u2212 EM1| < 2\u03b5,|EM\u02c62 \u2212 EM2| < 2\u03b5.The proof of Lemma 2.8.2 and Lemma 2.8.3 are provided in Sec. 2.8.5 and Sec. 2.8.6,respectively.Proof (Theorem 2.4.1). For \u03a6\u2217(y) > 0, the probability of Type-I error is given byP {M \u2265 \u03a6\u2217(y)}.P {M \u2265 \u03a6\u2217(y)} = P {M1 +M2 \u2265 \u03a6\u2217(y)}If \u03c4 = O(1\/\u03b5), by Lemma 2.8.2 and Lemma 2.8.3, the truncated distribution have asmall probability of being less than 0 and the expectation of the truncated distributionis close to the original distribution. Hence,P {M \u2265 \u03a6\u2217(y)} = P{M\u02c61 + M\u02c62 \u2265 \u03a6\u2217(y)}By Markov inequality,\u2264E{M\u02c61 + M\u02c62}\u03a6\u2217(y)=E{M\u02c61}\u03a6\u2217(y)+E{M\u02c62}\u03a6\u2217(y)362.8. Proof of theoremsSince, M\u02c62 is a positive random variable,=E{M\u02c61}\u03a6\u2217(y)+\u221e\u222b0P(M\u02c62 > z)dz\u03a6\u2217(y)\u2264E{M\u02c61}\u03a6\u2217(y)+\u221e\u222b0\u2211s,tt\u2265\u03c4,s6=tP (\u03b1 (wt \u2212 ws) \/\u03bbt > z) dz\u03a6\u2217(y)=E{M\u02c61}\u03a6\u2217(y)+\u221e\u222b0\u2211s,tt\u2265\u03c4,s6=texp(\u2212z2\u03bb2t\/4\u2016\u03b1\u20162) dz\u03a6\u2217(y)Hence, the probability of Type-I error, is minimized by minimizing \u2016\u03b1\u20162.2.8.5 Proof of Lemma 2.8.2P {M1 \u2264 0} = P\uf8f1\uf8f2\uf8f3maxs,ts 6=t(pt(wt \u2212 ws)) \u2264 0\uf8fc\uf8fd\uf8feChoosing the set \u03be \u2282M as defined in (2.47) and since the set \u03be are negative dependentfrom Lemma 2.8.1,\u2264 P{maxi\u03bei \u2264 0}\u2264\u220fiP {\u03bei \u2264 0}Each of the term if \u03bei = (pt(wt \u2212 wt+1)) is distributed as N (0, 2\u2016pt\u20162). Let FN (\u00b5,\u03c32)is the cdf of Gaussian random variable with mean \u00b5 and variance \u03c32. Noting thatFN (0,\u03c32)(0) = 1\/2, we have the following=\u220ftFN (0,2\u2016pt\u20162)(0) =\u220ft12=12T< \u03b5.The proof for the second part is similar by an appropriate choice of a negative de-pendent set, \u03be and is hence omitted.372.8. Proof of theorems2.8.6 Proof of Lemma 2.8.3From the definition of the random variable M\u02c61 in (2.48),E{M\u02c61}=+\u221e\u222b0xfM1(x)dx+ P (M1 < 0)<+\u221e\u222b0xfM1(x)dx+ \u03b5, (2.49)where the inequality in (2.49) follows from Lemma 2.8.2. The expectation of M1 isgiven byE {M1} = E {M11 {x \u2265 0}}+ E {M11 {x \u2264 0}} (2.50)To continue with the proof, we derive a lower bound on E {M11 {x \u2264 0}}, the secondterm in (2.50).For computing the lower bound, we proceed by integration by parts,E {M11 {x \u2264 0}} =0\u222b\u2212\u221exfM1(x)dx = \u22120\u222b\u2212\u221eP {M1 \u2264 x} dx.Choosing the negative dependent subset \u03be \u2282 M defined in (2.47), and noting thateach \u03bei is distributed as N (0, 2\u2016pi\u20162) and using analytical expression for bounds ofthe cdf of the Gaussian density, we obtainE {M11 {x \u2264 0}} \u2265 \u22120\u222b\u2212\u221e\u220fiP {\u03bei \u2264 x} dx. \u2265 \u2212\u03b5 (2.51)From (2.49) and (2.51) we get the first part of the Lemma 2.8.3.The proof for thesecond part is similar and hence omitted.382.8. Proof of theorems2.8.7 CUSUM algorithm for utility change point detectionAlgorithm 1 CUSUM algorithm for utility change point detection1: Initialize:Set threshold \u03c1 > 0.Set cumulative sum S(0) = 0.Set decision function G(0) = 0.2: for t = 1 to T do3: For probe pt and observed response yt,4: xt(0) = argmax{p\u2032tx\u2264It}v(x), with v(x) as in (2.34).5: xt(1) = argmax{p\u2032tx\u2264It}v(x) + \u03b1\u2032x.6: Likelihood `(yt, i) = P(yt|xt(i))19 ; i = 0, 1.7: Instantaneous log likelihood s(t) = log( `(yt,1)`(yt,0)).8: S(t) = S(t\u2212 1) + s(t).9: G(t) = {G(t\u2212 1) + s(t)}+, where {x}+ = max {x, 0}.10: if G(t) > \u03c1 then11: Change Point Estimate \u03c4\u02c6 = argmin1\u2264\u03c4\u2264tS(\u03c4 \u2212 1)12: break13: end if14: end for19In our example, the probability is given by the Gaussian distribution.39Chapter 3Engagement Dynamics andSensitivity Analysis of YouTubevideos3.1 IntroductionThe YouTube social network contains over 1 billion users who collectively watch mil-lions of hours of YouTube videos and generate billions of views every day. Addition-ally, users upload over 300 hours of video content every minute. YouTube generatesbillions in revenue through advertising and through the Partner program shares therevenue with the content creators.The video view count is a key metric of the measure of popularity of a video andthe metric by which YouTube pays the content providers20. A key question is: Howdo meta-level features of a posted video (e.g. thumbnail, title, tags, description) driveuser engagement in the YouTube social network? However, the content alone doesnot influence the popularity of a video. YouTube also has a social network layer ontop of it\u2019s media content. The main social component is how the content creators (alsocalled \u201cchannels\u201d) interact with the users. So another key question is: How does theinteraction of the YouTube channel with the user affect popularity of videos? In thischapter, we study both the above questions. In particular, our aim is to examine howthe individual video features (through the meta-level data) and the social dynamicscontribute to the popularity of a video.Main results: In this chapter, we investigate how the meta-level features and theinteraction of the YouTube channel with the users affect the popularity of videos.The main empirical conclusions of this chapter are:1. The five dominant meta-level features that affect the popularity of a video are:first day view count , number of subscribers, contrast of the video thumbnail,Google hits, and number of keywords. Sec. 3.2 discusses this further.2. Optimizing the meta-level features (e.g. thumbnail, title, tags, description) after20However, recently, view time is gaining more prominence than view count.403.1. Introductiona video has been posted increases the popularity of the video. In addition,optimizing the title increases the traffic due to YouTube search, optimizing thethumbnail increases the traffic from related videos and optimizing the keywordsincreases the traffic from related and promoted videos. Sec. 3.2.4 providesdetails on this analysis.3. Insight into the causal relationship between the subscribers and view count forYouTube channels is also explored. For popular YouTube channels, we foundthat the channel view count affects the subscriber count, see Sec. 3.3.1.4. New insights into the scheduling dynamics in YouTube gaming channels arealso found. For channels with a dominant periodic uploading schedule, going\u201coff the schedule\u201d increases the popularity of the channel, see Sec. 3.3.2.5. The generalized Gompertz model can be used to distinguish views due to virality(views from subscribers), migration (views from non-subscribers) and exogenousevents, see Sec. 3.3.3.6. New insights into playlist dynamics. The early view count dynamics of aYouTube videos are highly correlated with the long term \u201cmigration\u201d of viewersto the video. Also, early videos in a game playthrough typically contain higherviews compared with later videos in a game playthrough playlist, see Sec. 3.3.4.7. The number of subscribers of a channel only affects the early view count dy-namics of videos in a playthrough, see Sec. 3.3.4.All the above results21 are validated on a YouTube dataset consisting of over 6 millionvideos across 25 thousand channels. This dataset22 was provided to us by Broad-bandTV Corp. (BBTV). The dataset consists of daily samples of metadata of theYouTube videos on the BBTV platform from April, 2007 to May, 2015. BBTV is oneof the largest Multi-channel network (MCN) in the world23. The results of the chapterallows YouTube partners such as BBTV to adapt their user engagement strategies togenerate more views and hence increase revenue.The organization of the chapter is as follows. The chapter contains two main sec-tions which address each of the above question. In Sec. 3.2, we use several machinelearning methods to characterize the sensitivity of meta-level features on the popu-larity of YouTube videos. In Sec. 3.3, we use time series methods to analyze how the21Caveat: It is important to note that the above empirical conclusions are based on the BBTVdataset. These videos cover the YouTube categories of gaming, entertainment, food, music, andsports as described in Table 3.7 of the Sec. 3.5.1. Whether the above conclusions hold for othertypes of YouTube videos is an open issue that is beyond the scope of this thesis.22Sec. 3.5.1 summarizes the key features of the YouTube dataset that we have used.23http:\/\/variety.com\/2016\/digital\/news\/broadbandtv-mcn-disney-maker-comscore-1201696857\/413.2. Sensitivity analysis of YouTube meta-level featuresinteraction of the channel affect the popularity of the content.3.2 Sensitivity analysis of YouTube meta-levelfeaturesIn this section we apply machine learning methods to study how meta-level featuresof a YouTube video impacts the view count of the video. The main machine learningmethod that we use is the Extreme Learning Machine (ELM). Section 3.2.1 providesa brief background on ELM. Given a trained ELM, Section 3.2.2 provides a briefbackground on the various sensitivity analysis methods. Section 3.2.3 provides thesensitivity analysis results on the BBTV dataset. In Section 3.2.3, we compare theELM algorithm with several state of the art machine learning algorithms. Fig. 3.1illustrates a trace of the subscribers (one of the meta-level features) when the videowas posted, and the associated view count 14 days after the video has been posted.The machine learning algorithms must be able to address the challenging problem ofmapping from such noisy meta-level features (as shown in Fig. 3.1) to the associatedview count of a video. Of all machine learning methods, it is shown in Section 3.2.3,that the ELM provides sufficient performance to both be used to estimate the meta-level features which significantly contribute to the view count of a video, and forpredicting the view count of videos.Video Index100 101 102 103 104ViewCount102104106108Video Index100 101 102 103 104Subscribers100102104106108Video Index100 101 102 103 104ViewCount102104106108Video Index100 101 102 103 104Subscribers100102104106108Figure 3.1: The left figure shows the view count of all videos (arranged according todecreasing order of view count) after 14 days of the video being posted. The rightfigure shows the associated subscriber count when the video was posted.3.2.1 Extreme learning machine (ELM)The dataset of features (described in Sec. 3.2.3) and view count are denoted asD = {(xi, vi)}Ni=1 where xi \u2208 Rm is the feature vector, of dimension m, for video423.2. Sensitivity analysis of YouTube meta-level featuresi, and vi is the total view count for video i. Here, N is the number of videos in thetraining dataset (The ELM was trained for three categories of videos, for details seeSec. 3.2.3). The ELM is a single hidden-layer feed-forward neural network\u2013that is,the ELM consists of an input layer, a single hidden layer of L neurons and an outputlayer. Each hidden-layer neuron can have a unique transfer function. Popular trans-fer functions include the sigmoid, hyperbolic tangent, and Gaussian. However anynon-linear piecewise continuous function can be utilized. The output layer is obtainedby a weighted linear combination of the output of the L hidden neurons.The ELM model presented in [74, 75] is given by:vi =L\u2211k=1\u03b2khk(xi; \u03b8k), (3.1)where \u03b2k is the weight of neuron k, and hk(\u00b7; \u03b8k) is the hidden-layer neuron transferfunction with parameter \u03b8k, and L is the total number of hidden-layer neurons inthe ELM. Given D, how can the ELM model parameters \u03b2k, \u03b8k, and L in (3.1) beselected? Given L, the ELM trains \u03b2k and \u03b8k in two steps. First, the hidden layerparameters \u03b8k are randomly initialized. Any continuous probability distribution canbe used to initialize the parameters \u03b8k. Second, the parameters \u03b2k are selected tominimize the square error between the model output and the measured output fromD. Formally,\u03b2\u2217 \u2208 argmax\u03b2\u2208RL{||H\u03b2 \u2212 V ||22}, (3.2)where H denotes the hidden-layer output matrix with entries Hkj = hk(xj; \u03b8k) fork \u2208 {1, 2, . . . , L} and j \u2208 {1, 2 . . . , N}, and V the target output with entries V =[v1, v2, . . . , vN ]. The solution to (3.2) is given by \u03b2\u2217 = H\u2020V where H\u2020 denotes theMoore-Penrose generalized inverse of H. The major benefit of using the ELM, com-pared to other single layer feed-forward neural network, is that the training onlyrequires the random generation of the parameters \u03b8k, and the parameters \u03b2k can becomputed as the solution of a set of linear equations. The computational cost oftraining the ELM is O(N3) for constructing the Moore-Penrose inverse [76].3.2.2 Sensitivity analysis (Background)There are several sensitivity analysis techniques available in the literature [77, 78]which can be classified into two groups: filter methods, and wrapper methods. Thefilter methods consider only the meta-level features and the viewcount without the433.2. Sensitivity analysis of YouTube meta-level featuresinformation available from a machine learning algorithm. The wrapper methods, onthe other hand, utilize the information from the machine learning algorithm. Typi-cally, wrapper methods give a more accurate measure of the sensitivity compared tofilter methods [77, 78]. However, filter methods are computationally less expensivethan wrapper methods and do not require the training and evaluation of the machinelearning algorithm. Given the noise present in the meta-level features (Fig. 3.1) andthe non-linearity between the meta-level features and view count , filter methods arenot suitable for the sensitivity analysis of the meta-level features. Hence, in thissection we focus on two wrapper methods suitable for estimating the sensitivity ofmeta-level features on the view count of YouTube videos.For the first method we focus on the ELM (3.1) for evaluating the sensitivity ofthe meta-level features, however the method can be used for any machine learningmethod. Given that the ELM (3.1) is a single feed-forward hidden layer neural net-work, it is possible to evaluate the sensitivity of the meta-level features by taking thepartial derivative of (3.1) for the trained ELM. Note that this method is utilized toestimate the sensitivity of input features in neural networks [79]. The sum of squaresderivatives, denoted by SSDk for meta-level feature x(k), is given by:SSDk =N\u2211i=1( \u2202vi\u2202x(k))2=N\u2211i=1( L\u2211k=1\u03b2k\u2202hk(xi; \u03b8k)\u2202x(k))2. (3.3)The variable with the largest SSDk is most influential to the prediction of the view count vin (3.1) using the ELM. Note that since the ELM is trained using all the meta-levelfeatures, the SSDk evaluates the average sensitivity of changes in a single meta-levelfeature with all other features held constant.A state of the art filter method when there are significant interdependency re-lationships is the Hilbert-Schmidt Independence Criterion Lasso (HSIC-Lasso) [80].The main idea of this method is to use the benefits of least absolute shrinkage andselection operator (Lasso) with a feature wise kernel to capture the non-linear input-output dependency. The HSIC-Lasso is given by the solution to the following convexoptimization problem:min\u03b1\u2208IRm12\u2225\u2225\u2225\u2225\u2225L\u00af\u2212m\u2211k=1\u03b1kK\u00af(k)\u2225\u2225\u2225\u2225\u22252F+ \u03bb\u2016\u03b1\u20161, (3.4)where \u03bb is the regularization parameter, L\u00af = \u0393L\u0393 and K\u00af(k) = \u0393K(k)\u0393 are centeredGram matrices, K(k)i,j = K(xk,i, xk,j) and Li,j = L(vi, vj) are Gram matrices, K(\u00b7, \u00b7)443.2. Sensitivity analysis of YouTube meta-level featuresand L(\u00b7, \u00b7) are kernel functions24, \u0393 = I \u2212 1N1N1\u2032N is the centering matrix, I is theidentity matrix and 1N is the vector of ones. A measure of the importance of ameta-level feature is then given by the vector \u03b1.Both of these methods will be applied to the YouTube dataset to study the sen-sitivity of the meta-level features of YouTube videos on the videos view count .3.2.3 Sensitivity of YouTube meta-level features andpredicting view countIn this section, the ELM (3.1) and other state-of-the art machine learning methodsare applied to the YouTube dataset to compute the sensitivity of a videos meta-levelfeatures on the view count of the video based on the feature importance measure SSDk(3.3). Videos of different popularity, (i.e. highly popular, popular, and unpopular asdefined in Table 3.8 in the Sec. 3.5.1), may have different sensitivities to the meta-level features. Hence, we independently perform the sensitivity analysis on the threepopularity categories. First we define the meta-level features for each video, thenevaluate the meta-level feature sensitivities on the associated view count , and finallyprovide methods to predict the view count of YouTube videos using various machinelearning techniques. The analysis provides insight into which meta-level features areuseful for optimizing the view count of a YouTube video.Meta-level feature constructionEach YouTube video contains four primary components: the Thumbnail of the video,the Title of the video, the Keywords (also known as tags), and the description of thevideo. However, in typical user searches only a subset of the description is provided tothe user. Therefore, we do not consider the contents of the description to significantlyaffect the view count of the video. The meta-level features are constructed25 usingthe Thumbnail, Title, and Keywords. For the Thumbnail, 19 meta-level featuresare computed which include: the blurriness (e.g. CannyEdge, Laplace Frequency),brightness, contrast (e.g. tone), overexposure, and entropy of the thumbnail. For theTitle, 23 meta-level features are computed which include: word count, punctuationcount, character count, Google hits (e.g. if the title is entered into the Google search24In Section 3.2.3 we used the Gaussian kernel.25The meta-level features were constructed manually. The features were constructed based onexisting literature and features that can be extracted using off-the-shelf software. In addition, ananalysis by experts at Broadband TV Corp confirmed that the 54 features are comprehensive forthe sensitivity analysis.453.2. Sensitivity analysis of YouTube meta-level featuresengine how many results are found), and the Sentiment\/Subjectivity of the title com-puted using Vader [81], and TextBlob 26. For the Keywords, 7 meta-level features arecomputed which include: the number of keywords, and keyword length. In addition,to the above 49 meta-level features, we also include auxiliary user meta-level featuresincluding: the number of subscribers, resolution of the thumbnail used, category ofthe video, the length of the video, and the first day view count of the video. Notethat our analysis does not consider the video or audio quality of the YouTube video.Our analysis is focused on the sensitivity of the view count based on the Thumbnail,Title, Keywords, and auxiliary channel information of the user that uploaded thevideo. In total 54 meta-level features are computed for each video. The completedataset used for the sensitivity analysis is given by D = {(xi, vi)}Ni=1, with xi \u2208 R54the computed meta-level features for video i \u2208 {1, . . . , N}, vi the view count 14 daysafter the video is published, and N = 104, the total number of videos used for thesensitivity analysis. Note that the view count vi is on the log scale (i.e. if a video has106 views then vi = 6). This is a necessary step as the range of view counts is from102 to above 107.Prior to performing any analysis, we pre-process the meta-level features in thedataset D. First, all the meta-level features are scaled to satisfy x(k) \u2208 [0, 1]. Notethat the meta-level features were not whitened (e.g. the meta-level data as not trans-formed to have an identity covariance matrix). The second pre-processing step in-volves removing redundant features in D. Feature selection is a popular method foreliminating redundant meta-level features. In this work, we employ a correlationbased feature selection based on the Pearson correlation coefficient (which was usedfor feature selection in [82]) to eliminate the redundant meta-level features. Of theoriginal 54 meta-level features, m = 29 meta-level features remain after the removal ofthe correlated meta-level features. Note that removal of these features does not signif-icantly impact the performance of the machine learning algorithms or the sensitivityanalysis results.Meta-level feature sensitivityGiven the dataset D = {(xi, vi)}Ni=1 constructed in Sec.3.2.3, the goal is to estimatewhich features significantly contribute to the view count of a video. To perform thissensitivity analysis five machine learning algorithms which include: the ELM, Bagged26http:\/\/textblob.readthedocs.io\/en\/dev\/463.2. Sensitivity analysis of YouTube meta-level featuresMARS using gCV Pruning [83]27, Conditional Inference Random Forest (CIRF) [84]28,Feed-Forward Neural Network (FFNN) [85], and the feature selection method Hilbert-Schmidt Independence Criterion Lasso (HSIC-Lasso) [80]. Each of these models istrained using a 10-fold cross validation technique, and the design parameters of eachwas optimized via extensive empirical evaluation. We selected the ELM (3.1) to con-tain L = 100 neurons which ensures that we have sufficient accuracy on the predictedview count given the features xi, while reducing the effects of over-fitting. For theCIRF the design parameter for randomly selected predictors was set to 6, and theFFNN we have 10 neurons in the hidden-layer. The HSIC-Lasso regularization pa-rameter was set to 100. Given the trained models, the sensitivity of the view count onthe meta-level features of a video is computed by evaluating the sum of squares deriva-tives, SSDk (3.3). Fig. 3.2 shows the normalized29 SSDk for the five highest sensitivitymeta-level features of these five machine learning methods. Note that for the HSIC-Lasso we do not use the SSDk but instead the values of the coefficient \u03b1 in (3.4) whichprovides an estimate of the feature sensitivity. Recall, from Sec. 3.2.2, that larger theSSDk value or higher the value of \u03b1k the more sensitive the view count is to varia-tions in the meta-level feature. From Fig. 3.2, the meta-level features with the highestsensitivities are: first day view count , number of subscribers, contrast of the videothumbnail, Google hits, number of keywords, video category, title length, and numberof upper-case letters in the title respectively. Notice that all these methods have thefirst day view count and number of subscribers as the most sensitive meta-level fea-tures as expected. The FFNN and Bagged MARS however do not have the contrastof the video thumbnail as the third most sensitive meta-level feature compared withthe other algorithms. This results as the learning method and learning rate of eachof these algorithms is different which results in differences in the meta-level featuresensitivity. However as we can see from Fig. 3.2, the view count of a video is depen-dent on these eight meta-level features with the first day view count and number ofsubscribers being the most sensitive features.As expected, Fig. 3.2 shows that if the first day view count is high then the as-sociated view count 14 days after the video is posted will be high. Additionally, ifthere is a large number of subscribers to the channel that posted the video, then theassociated view count after 14 days is also expected to be large. As expected, theproperties of the title and keywords also contribute to the view count of the videohowever with less sensitivity than the thumbnail of the video. Therefore, to increase27Refer to Sec. 3.5.2.28Refer to Sec. 3.5.2.29The normalization is with respect to the highest value among the computed SSDk.473.2. Sensitivity analysis of YouTube meta-level featuresMeta-Level Feature x(k)x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8)FeatureSensitivity00.20.40.60.81ELMBagged MARSCIRFHSIC-LassoFFNNFigure 3.2: Sensitivity of the meta-level features computed using the sum of squaresderivatives SSDk (3.3) for the ELM, Bagged MARS, CIRF, and FFN, and the asso-ciated coefficient of the centered Gram matrix for the HSIC-Lasso using the datasetD defined in Sec.3.2.3. The meta-level features k=1 to k=8 are associated with: firstday view count , number of subscribers, contrast of the video thumbnail, Google hits,number of keywords, video category, title length, and number of upper-case letters inthe title respectively. Similar results are obtained for highly popular, popular, andunpopular videos as defined in Table 3.8.the view count of a video it is vital to increase the number of subscribers, and focuson the quality of the Thumbnail used. A surprising result is that the sensitivity of theview count resulting from changes in these meta-level features are negligible acrossthe three popularity classes of videos (i.e. highly popular, popular, and unpopular asdefined in Table 3.8). Therefore, regardless of the expected popularity of a video, achannel owner should focus on maximizing the number of subscribers and the qualityof the thumbnail to increase the associated view count of a video.Predicting the view count of YouTube videosIn this section we illustrate how machine learning methods can be used to theview count of a YouTube video. The machine learning methods used for predictioninclude: the Extreme Learning Machine (3.1), Feed-Forward Neural Network [85],Lasso, Relaxed Lasso [86], Conditional Inference Random Forest [84], Boosted Gen-eralized Additive Model [87, 88], Bagged MARS using gCV Pruning [83], and General-ized Linear Model with Stepwise Feature Selection using Akaike information criterion.For each method their predictive performance and the top-five highest sensitivitymeta-level features are provided.To perform the analysis we train each model using an identical 10-fold cross val-idation method with the dataset D = {(xi, vi)}Ni=1 with all the meta-level featuresincluded. The predictive performance of the machine learning methods are evaluatedusing the root-mean-square error (RMSE) and the R2 (e.g. coefficient of determina-483.2. Sensitivity analysis of YouTube meta-level featurestion). Note that for both training and evaluation the view count is pre-processed tobe on the log scale (i.e. if the view count is 106, the associated label is vi = 6).The predictive performance and the top-five highest sensitivity meta-level featuresof the machine learning methods are provided in Table 3.1. In Table 3.1 the meta-levelfeature numbers are identical to those defined in Fig. 3.2. As seen from Table 3.1, theELM has the lowest RMSE of 0.44 which is comparable to the RMSE of the Condi-tional Inference Random Forest and Feed-Forward Neural Network which have 0.47and 0.48 respectively. The R230of the ELM, Feed-Forward Neural Network and Con-ditional Inference Random Forest are also comparable with values of 0.77, 0.79, and0.80. Therefore any of these methods could be used to estimate the view count of aYouTube video. A key question is which of the meta-level features x(k) are most sen-sitive between these machine learning methods. As seen from the results in Table 3.1the top two most important features are the first day view count and the number ofsubscribers, and the majority of methods suggest that the number of Google hits isalso an important meta-level feature. Interestingly the Conditional Inference RandomForest, Boosted Generalized Additive Model, and the Bagged MARS using gCV Prun-ing do not consider the number of Google hits in the top five most sensitive featuresand instead use the video category. This is consistent with the result that videos inthe \u201cMusic\u201d category are the most viewed on YouTube, followed by \u201cEntertainment\u201dand \u201cPeople and Blogs\u201d. Only the Bagged MARS using gCV Pruning considers themeta-level features of title length and number of upper-case letters in the title to bein the five most sensitive features compared with the other machine learning meth-ods. This result suggests that the number of Google hits associated with the titlesignificantly contributes to the video\u2019s popularity however the view count is not verysensitive to the specific length and number of upper-case letters in the title. There-fore, when performing meta-level feature optimization for a video a user should focuson the meta-level features of: first day view count , number of subscribers, contrastof the video thumbnail, Google hits, number of keywords, and video category.To estimate the view count of an unpublished video (a video that is about to beposted for the first time) we can not utilize the most sensitive meta-level feature of themachine learning algorithms which is the first day view count . Is it still possible toestimate the view count with the remaining meta-level features? To answer this ques-tion we compare the performance of the ELM using the 28 meta-level features with30The R2 is a popular measure of the goodness of fit. It is given by the ratio of the variation(measured using sum of squares) explained using a model to the total variation in the data. Theimportant property of R2 is that it is bounded between 0 and 1. A high value of R2 implies thatthe variation in the data can be explained using the model in question.493.2. Sensitivity analysis of YouTube meta-level featuresTable 3.1: Performance and feature sensitivityMethod RMSE R2 Features x(k)Extreme Learning Machine 0.44 0.77 1 2 3 4 5Feed-Forward Neural Network 0.48 0.79 2 1 5 3 6Lasso 0.53 0.66 1 2 3 4 5Relaxed Lasso 1.14 0.64 1 2 3 4 5CI Random Forest 0.47 0.80 1 2 6 4 5Boosted GAM 0.50 0.77 1 2 6 4 5Bagged MARS 0.50 0.77 1 2 6 7 8GLM with Feature Selection 0.53 0.67 1 2 3 4 5the view count on the first day removed. Fig. 3.3a shows the predicted view count ofthe ELM trained using 29 meta-level features, and Fig. 3.3b shows the predictedview count using the 28 meta-level features. As expected, Fig. 3.3 illustrates thatthe predictive accuracy of the ELM decreases if the view count on the first day isremoved. Though there is a drop in the predictive accuracy of the ELM trained usingthe 28 meta-level features, it still contains sufficient predictive accuracy to aid in theselection of the meta-level features to increase the view count of a video. Note thatsimilar performance results are obtained for the Feed-Forward Neural Network andConditional Inference Random Forest when performing the prediction with the firstday view count removed. Therefore, these prediction methods can be used to opti-mize the meta-level features of unpublished videos where the optimization can focuson the meta-level features of: number of subscribers, contrast of the video thumbnail,Google hits, number of keywords, and video category.3.2.4 Sensitivity to meta-level optimizationSec. 3.2.3 described how meta-level features (e.g. number of subscribers) can beused to estimate the popularity of a video. In this section, we analyze how changingmeta-level features, after a video is posted, impacts the user engagement of the video.Meta-level data plays a significant role in the discovery of content, through YouTubesearch, and in video recommendation, through the YouTube related videos. Hence,\u201coptimizing\u201d the meta-level data to enhance the discoverability and user engagementof videos is of significant importance to content providers. Therefore, in this section,we study how optimizing the title, thumbnail or keywords affect the view count ofYouTube videos.503.2. Sensitivity analysis of YouTube meta-level features(a) (b)Figure 3.3: Predictive view count using an ELM with the actual view count (blackdots) and predicted view count indicated by the (gray dots). Fig. 3.3a illustrates theresults for a trained ELM (3.1) using all 29 meta-level features defined in Sec. 3.2.3.Fig. 3.3b illustrates the results for a trained ELM (3.1) using the 28 meta-level features(first day view count removed from the 29 meta-level features defined in Sec. 3.2.3).To perform the analysis, we utilize the dataset (see Table 3.9 in the Sec. 3.5.1),and remove any time-sensitive videos. Time-sensitive videos are those videos thatare relevant for a short period of time and the popularity of such videos cannot beimproved by optimization. We removed the following two time-sensitive categoriesof videos: \u201cpolitics\u201d and \u201cmovies and trailers\u201d. In addition, we removed videos(from other categories) which contained the following keywords in their video meta-data: \u201choliday\u201d, \u201cmovie\u201d, or \u201ctrailers\u201d. For example, holiday videos are not watchedfrequently during off-holiday times.Let \u03c4\u02c6i be the time at which the meta-level optimization was performed on videoi and let si, denote the corresponding sensitivity. We characterize the sensitivity tometa-level optimization as follows:si =(\u2211\u03c4\u02c6i+6t=\u03c4\u02c6ivi(t))\/7(\u2211\u03c4\u02c6it=\u03c4\u02c6i\u22126 vi(t))\/7(3.5)The numerator of (3.5) is the mean value of the view count 7 days after optimiza-tion. Similarly, the denominator of (3.5) is the mean value of the view count 7 daysbefore optimization. The results are provided in Table 3.2 for optimization of thetitle, thumbnail, and keywords. As shown in Table 3.2, at least half of the optimiza-31\u201cNo change\u201d was obtained by randomly selecting 104 videos which performed no optimization513.2. Sensitivity analysis of YouTube meta-level featuresOptimization Fraction of Videos with increased popularityTitle change 0.52Thumbnail change 0.533Keyword change 0.50No change31 0.35Table 3.2: Sensitivity to Meta-Level Optimization. The table shows than in more than50% the videos, meta-level optimization resulted in an increase in the popularity ofthe video.tions resulted in an increase in the popularity of the video. In addition, comparedto videos with no optimization, the meta-level optimization improves the probabilityof increased popularity by 45%. This is consistent with YouTube and BBTV rec-ommendation to optimize meta-level features to increase user engagement. However,some class of videos benefit from optimizing meta-data much more than others. Theeffect may be due to small user channels, which have limited number of videos andsubscribers, gain by optimizing the meta-level data of the video compared to hugelypopular channels such as Sony or CNN. The highly popular channel (e.g. Sony orCNN) upload videos frequently (even multiple times daily), so video content becomesirrelevant quickly. The question of which class of users gain by optimizing the metalevel features of the video is part of our ongoing research.Table 3.3 summarizes the impact of various meta-level changes on the three majorsources of YouTube traffic, i.e. YouTube search32, YouTube promoted33 and trafficfrom related videos34. For those videos where meta-level optimization increased thepopularity (the ratio of the mean value of the views after and before optimization ishigher than one), we computed the sensitivity for various traffic sources as in (3.5).Table 3.3 summarizes the median statistics of the ratio of the traffic sources beforeand after optimization. The title optimization resulted in significant improvement(approximately 25%) from the YouTube search. Similarly, thumbnail optimizationimproved traffic from the related videos and keyword optimization resulted in in-creased traffic from related and promoted videos.Summary: This section studied the sensitivity of view count with respect to meta-level optimization.The main finding is that meta-level optimization increased the popularity of videosin the majority of cases. In addition, we found that optimizing the title improvedand evaluating si 3 months from the date of posting the video.32Video views from YouTube search results33Video views from an unpaid YouTube promotion34Video views from a related video listing on another video watch page523.3. Social interaction of the channel with YouTube usersOptimization Related Promoted SearchTitle change 1.13 NAa 1.24Thumbnail change 1.20 NAa 1.125Keyword change 1.10 1.16 1aNot enough data available: A binomial test to check for the true hypothesis with 95% confidenceinterval requires that the sample size, n, should be at least(1.960.04)2p(1\u2212p). With p = 0.5, n > 600.Table 3.3: Sensitivity of various traffic sources to meta-level optimization, for videoswith increased popularity. The title optimization resulted in significant improvement(approximately 25%) from the YouTube search. Similarly, thumbnail optimizationimproved traffic from the related videos and keyword optimization resulted in in-creased traffic from related and promoted videos.traffic from YouTube search. Similarly, thumbnail optimization improved traffic fromthe related videos and keyword optimization resulted in increased traffic from relatedand promoted videos.3.3 Social interaction of the channel withYouTube usersIn this section, we use time series analysis methods to determine how the social inter-action of a YouTube channel with its viewers affects the view count dynamics. Thissection is organized as follows. Sec. 3.3.1, characterizes the causal relationship be-tween the subscribers and view count of a channel using Granger causality test. InSec. 3.3.2, we investigate how the popularity of the channel is affected by the schedul-ing dynamics of the channel. When channels deviate from a regular upload schedule,the view count and the comment count of the channel increase. In Sec. 3.3.3, we ad-dress the problem of separating the view count dynamics due to virality (viewcountresulting from subscribers) and migration (views from non-subscribers) and exoge-nous events using a generalized Gompertz model. Finally, Sec. 3.3.4, we studies theeffect of video playlists on the view count. The main conclusion outlined in Sec. 3.3.4is that the dynamics of the view count in a playlist is highly correlated and the effectsof \u201cmigration\u201d causes the view count of videos to decrease even with an increase inthe subscriber count.533.3. Social interaction of the channel with YouTube users3.3.1 Causality between subscribers and view count inYouTubeIn this section the goal is to detect the causal relationship between subscriber andviewer counts and how it can be used to estimate the next day subscriber countof a channel. The results are of interest for measuring the popularity of a YouTubechannel. Fig. 3.4 displays the subscriber and view count dynamics of a popular movietrailer channel in YouTube. It is clear from Fig. 3.4 that the subscribers \u201cspike\u201d witha corresponding \u201cspike\u201d in the view count. In this section we model this causalrelationship of the subscribers and view count using the Granger causality test fromthe econometric literature [89].The main idea of Granger causality is that if the value(s) of a lagged time-series canbe used to predict another time-series, then the lagged time-series is said to \u201cGrangercause\u201d the predicted time-series. To formalize the Granger causality model, let sj(t)denote the number of subscribers to a channel j on day t, and vji (t) the correspondingview count for a video i on channel j on day t. The total number of videos in a channelon day t is denoted by I(t). Define,v\u02c6j(t) =I(t)\u2211i=1vji (t), (3.6)as the total view count of channel j at time t. The Granger causality test involvestesting if the coefficients bi are non-zero in the following equation which models therelationship between subscribers and view counts:sj(t) =ns\u2211k=1ajksj(t\u2212 k) +nv\u2211i=kbjkv\u02c6j(t\u2212 k) + \u03b5j(t), (3.7)where \u03b5j(t) represents normal white noise for channel j at time t. The parameters{aji}{i=1,...,ns} and {bji}{i=1,...,nv} are the coefficients of the AR model in (3.7) for channelj, with ns and nv denoting the lags for the subscriber and view counts time seriesrespectively. If the time-series Dj = {sj(t), v\u02c6j(t)}t\u2208{1,...,T} of a channel j fits themodel (3.7), then we can test for a causal relationship between subscribers and viewcount. In equation (3.7), it is assumed that |ai| < 1, |bi| < 1 for stationarity. Thecausal relationship can be formulated as a hypothesis testing problem as follows:H0 : b1 = \u00b7 \u00b7 \u00b7 = bnv = 0 vs. H1 : Atleast one bi 6= 0. (3.8)543.3. Social interaction of the channel with YouTube usersThe rejection of the null hypothesis, H0, implies that there is a causal relationshipbetween subscriber and view counts.First, we use Box-Ljung test [90] is to evaluate the quality of the model (3.7)for the given dataset Dj. If satisfied, then the Granger causality hypothesis (3.8) isevaluated using the Wald test [91]. If both hypothesis tests pass then we can concludethat the time series Dj satisfies Granger causality\u2013that is, the previous day subscriberand view count have a causal relationship with the current subscriber count.A key question prior to performing the Granger causality test is what percentageof videos in the YouTube dataset in Sec. 3.5.1 satisfy the AR model in (3.7). Toperform this analysis we apply the Box-Ljung test with a confidence of 0.95 (p-value= 0.05). First, we need to select ns and nv, the number of lags for the subscribersand view count time series. For ns = nv = 1, we found that only 20% of the channelssatisfy the model (3.7). When ns and nv are increased to 2, the number of channelssatisfying the model increases to 63%. For ns = nv = 3, we found that 91% of thechannels satisfy the model (3.7), with a confidence of 0.95 (p-value = 0.05). Hence,in the below analysis we select ns = nv = 3. It is interesting to note that the meanvalue of coefficients bi decrease as i increases indicating that older view counts haveless influence on the subscriber count. Similar results also hold for the coefficientsai. Hence, as expected, the previous day subscriber count and the previous day viewcount most influence the current subscriber count.The next key question is does their exist a causal relationship between the sub-scriber dynamics and the view count dynamics. This is modeled using the hypothesisin (3.8). To test (3.8) we use the Wald test with a confidence of 0.95 (p-value =0.05) and found that approximately 55% of the channels satisfy the hypothesis. Forapproximately 55% of the channels that satisfy the AR model (3.7), the view count\u201cGranger causes\u201d the current subscriber count. Interestingly, if different channelcategories are accounted for then the percentage of channels that satisfy Grangercausality vary widely as illustrated in Table 3.4. For example, 80% of the Entertain-ment channels satisfy Granger causality while only 40% of the Food channels satisfyGranger causality. These results illustrate the importance of channel owners to notonly maximize their subscriber count, but to also upload new videos or increase theviews of old videos to increase their channels popularity (i.e. via increasing their sub-scriber count). Additionally, from our analysis in Sec.3.2 which illustrates that theview count of a posted video is sensitive to the number of subscribers of the channel,increasing the number of subscribers will also increase the view count of videos thatare uploaded by the channel owners.553.3. Social interaction of the channel with YouTube usersCategorya FractionGaming 0.60Entertainment 0.80Food 0.40Sports 0.67aYouTube assigns a category to videos, rather than channels. The category of the channel wasobtained as the majority of the category of all the videos uploaded by the channel.Table 3.4: Fraction of channels satisfying the hypothesis: View count \u201cGrangercauses\u201d subscriber count, split according to category.0 1,000 2,000 3,00000.511.52\u00b7106Time [Days]Viewcount0 1,000 2,000 3,00002,0004,0006,0008,000Time [Days]SubscribersFigure 3.4: Viewcount and subscribers for the popular movie trailer channel: VI-SOTrailers. The Granger causality test for view counts \u201cGranger causes\u201d subscribercount is true with a p-value of 5\u00d7 10\u22128.3.3.2 Scheduling dynamics in YouTubeIn this section, we investigate the scheduling dynamics of YouTube channels. We findthe interesting property that for popular gaming YouTube channels with a dominantupload schedule, deviating from the schedule increases the views and the commentcounts of the channel.Creator Academy35 in their best practice section recommends to upload videos ona regular schedule to get repeat views. The reason for a regular upload schedule is toincrease the user engagement and to rank higher in the YouTube recommendation list.However, we show in this section that going \u201coff the schedule\u201d can be beneficial for agaming YouTube channel, with a regular upload schedule, in terms of the number ofviews and the number of comments.From the dataset, we \u2018filtered out\u2019 video channels with a dominant upload sched-ule, as follows: The dominant upload schedule was identified by taking the peri-35YouTube website for helping with channels563.3. Social interaction of the channel with YouTube usersodogram of the upload times of the channel and then comparing the highest valueto the next highest value. If the ratio defined above is greater than 2, we say thatthe channel has a dominant upload schedule. From the dataset containing 25 thou-sand channels, only 6500 channels contain a dominant upload schedule. Some chan-nels, particularly those that contain high amounts of copied videos such as trailers,movie\/TV snippets upload videos on a daily basis. These have been removed fromthe above analysis. The expectation is that by doing so we concentrate on thosechannels that contain only user generated content.We found that channels with gaming content account for 75% of the 6500 channelswith a dominant upload schedule36 and the main tags associated with the videos were:\u201cgame\u201d, \u201cgameplay\u201d and \u201cvideogame\u201d37. We computed the average views when thechannel goes off the schedule and found that on an average when the channel goes offschedule the channel gains views 97% of the time and the channel gains comments68% of the time. This suggests that channels with \u201cgameplay\u201d content have periodicupload schedule and benefit from going off the schedule.3.3.3 Modeling the view count dynamics of videos withexogenous eventsSeveral time-series analysis methods have been employed in the literature to modelthe view count dynamics of YouTube videos. These include ARMA time series mod-els [24], multivariate linear regression models [25], hidden Markov models [92], nor-mal distribution fitting [93], and parametric model fitting [26, 27]. Though allthese models provide an estimate of the view count dynamics of videos, we are in-terested in segmenting view count dynamics of a video resulting from subscribers,non-subscribers and exogenous events. Exogenous events are due to video promotionon other social networking platform such as Facebook or the video being referencedby a popular news organization or celebrity on Twitter. This is motivated by tworeasons. First, removing view count dynamics due to exogenous events provides anaccurate estimate of sensitivity of meta-level features in Sec. 3.2. Second, extractingthe view count resulting from exogenous events gives an estimate of the efficiency ofvideo promotion.The view count dynamics of popular videos in YouTube typically show an initialviral behaviour, due to subscribers watching the content, and then a linear growthresulting from non-subscribers. The linear growth is due to new users migrating from36This could also be due to the fact gaming videos account for 70% of the videos in the dataset.37We used a topic model to obtain the main tags.573.3. Social interaction of the channel with YouTube usersother channels or due to interested users discovering the content either through searchor recommendations (we call this phenomenon migration similar to [26]). Hence, with-out exogenous events, the view count dynamics of a video due to subscribers and non-subscribers can be estimated using piecewise linear and non-linear segments. In [26], itis shown that a Gompertz time series model can be modeled the view count dynamicsfrom subscribers and non-subscribers, if no exogenous events are present. In this chap-ter, we generalize the model in [26] to account for views from exogenous events. Itshould be noted that classical change-point detection methods [94] cannot be usedhere as the underlying distribution generating the view count is unknown.To account for the view count dynamics introduced from exogenous events we usethe generalized Gompertz model given by:v\u00afi(t) =Kmax\u2211k=0wki (t)u(t\u2212 tk),wki (t) = Mk(1\u2212 e\u2212\u03b7k(ebk(t\u2212tk)\u22121))+ ck(t\u2212 tk),(3.9)where v\u00afi(t) is the total view count for video i at time t, u(\u00b7) is the unit step func-tion, t0 is the time the video was uploaded, tk with k \u2208 {1, . . . , Kmax} are the timesassociated with the Kmax exogenous events, and wki (t) are Gompertz models which ac-count for the view count dynamics from uploading the video and from the exogenousevents. In total there are Kmax + 1 Gompertz models with each having parameterstk,Mk, \u03b7k, bk. Mk is the maximum number of requests not including migration for anexogenous event at tk, \u03b7k and bk model the initial growth dynamics from event tk,and ck accounts for the migration of other users to the video. In (3.9) the parameters{Mk, \u03b7k, bk}k=0 are associated with the subscriber views when the video is initiallyposted, the parameters {tk,Mk, \u03b7k, bk}Kmaxk=1 are associated with views introduced fromexogenous events, and the views introduced from migration are given by {ck}Kmaxk=0 .Each Gompertz model (3.9) captures the initial viral growth when the video is ini-tially available to users, followed by a linearly increasing growth resulting from usermigration to the video.The parameters \u03b8i = {ak, tk,Mk, \u03b7k, bk, ck}Kmaxk=0 in (3.9) can be estimated by solving583.3. Social interaction of the channel with YouTube usersthe following mixed-integer non-linear program:\u03b8i \u2208 arg min{ Ti\u2211t=0(v\u00afi(t)\u2212 vi(t))2+ \u03bbK}K =Kmax\u2211k=0ak, ak \u2208 {0, 1} k \u2208 {0, . . . , Kmax}, (3.10)with Ti the time index of the last recorded views of video vi, and ak a binary variableequal to 1 if an exogenous event is present at tk. Note that (3.10) is a difficultoptimization problem due to the presence of the binary variables ak [95]. In theYouTube social network when an exogenous event occurs this causes a large andsudden increase in the number of views, however as seen in Fig. 3.5, a few days afterthe exogenous event occurs the views only result from migration (i.e. linear increasein total views). Assuming that each exogenous event is followed by a linear increase inviews we can estimate the total number of exogenous events Kmax present in a giventime-series by first using a segmented linear regression method, and then countingthe number of segments of connected linear segments with a slope less then cmax. Theparameter cmax is the maximum slope for the views to be considered to result fromviewer migration. Plugging Kmax into (3.10) results in the optimization of a non-linear program for the unknowns {tk,Mk, \u03b7k, bk, ck}Kmaxk=0 . This optimization problemcan be solved using sequential quadratic programming techniques [96].To illustrate how the Gompertz model (3.9) can be used to detect for exogenousevents, we apply (3.9) to the view count dynamics of a video that only contains asingle exogenous event. Fig. 3.5 displays the total view count of a video where anexogenous event occurs at time t = 41 (i.e. t1 = 41 in (3.9)) days after the videois posted38. The initial increase in views for the video for t \u2264 7 days results fromthe 2910 subscribers of the channel viewing the video. For 7 \u2264 t \u2264 41, other usersthat are not subscribed to the channel migrate to view the video at an approximatelyconstant rate of 13 views\/day. At t = 41, an exogenous event occurs causing anincrease in the views per day. The difference in viewers, resulting from the exogenousevent, is 7174. For t \u2265 43, the views result primarily from the migration of users toapproximately 2 views\/day. Hence, using the generalized Gompertz model (3.9) we38Due to privacy reasons, we cannot detail the specific event. Some of the reasons for the suddenincrease in the popularity of the video include: Another user on YouTube mentioning the video, thiswill encourage viewers from that channel to view the video, resulting in a sudden increase in thenumber of views. Another possibility is that the channel owner or a YouTube Partner like BBTVdid significant promotional initiatives on other social media sites such as Twitter, Facebook, etc. topromote the channel or video.593.3. Social interaction of the channel with YouTube userscan differentiate between subscriber views, views caused by exogenous events, andviews caused by migration.Day0 50 100 150 200TotalViews02468MeasuredGompertzFigure 3.5: Due to an exogenous event on day 41, there is a sudden increase in thenumber of views. The total view count fitted by the Gompertz model v\u00afi(t) in (3.9) isshown in black with the virality (exponential) and migration (linear) illustrated bythe dotted red.3.3.4 Video playthrough dynamicsOne of the most popular sequences of YouTube videos is the video game \u201cplaythrough\u201d.A video game playthrough is a sequence of videos for which each video has a relaxedand casual focus on the game that is being played and typically contains commentaryfrom the user presenting the playthrough. Unlike YouTube channels such as CNN,BBC, and CBC in which each new video can be considered independent from theothers, in a video playthrough the future view count of videos are influenced by thepreviously posted videos in the playthrough. To illustrate this effect we consider avideo playthrough for the game \u201cBioShock Infinite\u201d\u2013a popular video game releasedin 2013. The channel, popular for hosting such video playthroughs, contains close to4500 videos and 180 video playthroughs. The channel is highly popular and has gar-nered a combined view count close to 100 million views with 150 thousand subscribersover a period of 3 years. Fig. 3.6 illustrates that the early view count dynamics arehighly correlated with the view count dynamics of future videos. Both the short termview count and long term migration of future videos in the playthrough decrease af-ter the initial video in the playthrough is posted. This results for two reasons, eitherthe viewers purchase the game, or the viewers leave as the subsequent playthroughsbecome repetitive as a result of game quality or video commentary quality. A uniqueeffect with video playthroughs is that though the number of subscribers to the channelhosting the videos in Fig. 3.6 increases over the 600 day period, the linear migration isstill maintained after the initial 50 days after the playthrough is published. Addition-603.4. Closing remarksally, the slope of the migration is related to the early total view count as illustratedin Fig. 3.6b.0 100 200 300 400 500 600103104Time [Days]ViewCountVid Idx Exp Pred1510152025(a) Actual and predictedview count of playthrough. Weplot the 1st, 5th, 10th, 15th, 20thand 25th video from the playlistcontaining 25 videos. In the legend,Exp and Pred corresponds to theactual and the predicted value us-ing (3.9), respectively. Figure showsthat the view counts decreases forsubsequent videos in the playlist.5 10 15 20 25103104Video Part NumberViewCountMigration RateVirality Rate(b) The virality rate specifies theearly views due to subscribers, andthe migration rate (in units ofviews\/1000 days) specifies the sub-sequent linear growth due to non-subscribers.Figure 3.6: Actual and predicted view count of a playthrough containing 25 YouTubevideos for the game \u201cBioShock Infinite\u201d. The predictions are computed by fitting a modifiedGompertz model (3.9) to the measured view count for each video in the playthrough.3.4 Closing remarksIn this chapter, we conducted a data-driven study of YouTube based on a largedataset (see Sec. 3.5.1 for details). First, by using several machine learning methods,we investigated the sensitivity of the videos meta-level features on the view countsof videos. It was found that the most important meta-level features include: firstday view count , number of subscribers, contrast of the video thumbnail, Google hits,number of keywords, video category, title length, and number of upper-case lettersin the title respectively. Additionally, optimizing the meta-data after the video isposted improves the popularity of the video. The social dynamics (the interaction ofthe channel) also affects the popularity of the channel. Using the Granger causalitytest, we showed that the view count has a casual effect on the subscriber count ofthe channel. A generalized Gompertz model was also presented which can allowthe classification of a videos view count dynamics which results from subscribers,613.5. Supplementary materialmigration, and exogenous events. This is an important model as it allows the viewsto be categorized as resulting from the video or from exogenous events which bringviewers to the video. The final result was to study the upload scheduling dynamicsof gaming channels in YouTube. It was found that going \u201coff schedule\u201d can actuallyincrease the popularity of a channel.3.5 Supplementary material3.5.1 Description of YouTube datasetThis chapter uses the dataset provided by BBTV. The dataset contains daily samplesof metadata of YouTube videos on the BBTV platform from April, 2007 to May, 2015,and has a size of around 200 gigabytes. Table 3.5 contains the details of metadatacollected for the each YouTube channel and video. The dataset contains around6 million videos spread over 25 thousand channels. Table 3.6 shows the statisticssummary of the videos present in the dataset.Table 3.5: Metadata of YouTube channel and videoChannel VideoId IdName NameStart Date Published DateTitle TitleVideo Count DurationView Count View CountSubscriber Count Like CountComment Count Dislike CountDescription Comment CountTopic Id TagsThumbnail ThumbnailSampling Time Sampling TimeBanner Category IdLanguage Average View DurationAverage View TimeClick Through RateTable 3.7, shows the summary of the various category of the videos present in thedataset. The dataset contains a large percentage of gaming videos. Fig. 3.7 showsthe fraction of videos as a function of the age of the videos. There is a large fraction623.5. Supplementary materialTable 3.6: Dataset summaryVideos 6 millionChannels 26 thousandAverage number of videos (per channel) 250Average age of videos 275 daysAverage number of views (per video) 10 thousandCategory FractionGaming 0.69Entertainment 0.07Food 0.07Music 0.035Sports 0.017Table 3.7: YouTube dataset categories (out of 6 million videos)of videos uploaded within a year. Also, the dataset captures the exponential growthin the number of videos uploaded to YouTube. Similar to [26], we define three cate- Age of videos 050010001500200025003000Density 10 -610 -510 -410 -310 -2Figure 3.7: The fraction of videos in the dataset as a function of the age of the videos.There is a significant percentage of newer videos (videos with less age) compared toolder videos. Hence, the dataset capture the exponential growth of the number ofvideos uploaded to YouTube.gories of videos based on their popularity: Highly popular, popular, and unpopular.Table 3.8 gives a summary of the fraction of videos in the dataset belonging to eachcategory. As can be seen from Table 3.8, the majority of the videos in the datasetbelong to the popular category.A unique feature of the dataset is that it contains information about the \u201cmeta-level optimization\u201d633.5. Supplementary materialCriteria FractionHighly Popular (Total Views > 104) 0.12Popular (150 < Total Views < 104) 0.67Unpopular (Total Views < 150) 0.21Table 3.8: Popularity distribution of videos in the datasetOptimization # VideosTitle change 21 thousandThumbnail change 13 thousandKeyword change 21 thousandTable 3.9: Optimization summary statisticsfor videos. The meta-level optimization is a change in the title, tags or thumbnail,of an existing video in order to increase the popularity. BBTV markets a productthat intelligently automates the meta-level optimization. Table 3.9 gives a summaryof the statistics of the various meta-level optimization present in the dataset.3.5.2 Background: Statistical learning algorithmsMultivariate Adaptive Regression Splines (MARS)MARS is an adaptive method for regression. It uses two types of linear basis functionsof the form:(x\u2212 t)+ =\uf8f1\uf8f2\uf8f3x\u2212 t if x > t,0 else and (t\u2212 x)+ =\uf8f1\uf8f2\uf8f3t\u2212 x if t > x,0 else . (3.11)The functions are piecewise linear, with a knot at value of t. The two functions arecalled a reflected pair. The regression uses a collection of functions C which containthe above functions at each of the observed value xi,j:C = {(xj \u2212 t)+, (t\u2212 xj)+} ; j = 1, \u00b7 \u00b7 \u00b7 , N t \u2208 x1,j, \u00b7 \u00b7 \u00b7 , xm,j. (3.12)The MARS model is given byf(x) = \u03b20 +M\u2211m=1\u03b2mhm(x), (3.13)643.5. Supplementary materialwhere hm(x) is a function from the library C or a product of two or more functionsfrom C. Given hm, the coefficients \u03b2m can be computed by minimizing the least squarecriteria. The construction of hm is as below: Start with h0(x) = 1. Given a modelM (with M functions), we add to the model the term of the form\u03b2M+1hl(x)(xj \u2212 t)+ + \u03b2M+2hl(x)(t\u2212 xj)+, hl \u2208M,that produces the largest decrease in the training error. The coefficients \u03b2M+1 and\u03b2M+2 can be obtained by least squares.At the end (using an appropriate error criteria), we have a large model which typ-ically overfits the data, and hence we apply a backward deletion. For computationalreasons, we use the generalized cross-validation criteria given by:gCV(\u03bb) =\u2211Ni=1(vi \u2212 f\u03bb(xi))2(1\u2212M(\u03bb)\/N)2 (3.14)The value M(\u03bb) is the effective number of parameters in the model. If there are rindependent basis functions and there are K knots then M(\u03bb) = r + 3K.Conditional Inference and Random ForestRandom forests, an extension of the tree learning algorithm, builds a large collection ofde-correlated trees and then averages them. Algorithm 2 provides a brief descriptionof how to construct random forest. For more details, refer to [97].Algorithm 2 Random forest1: for b = 1 to B do2: Draw a bootstrap sample of size N from the training data.3: Grow a tree Tb to the bootstrapped data, by recursively repeating the followingsteps for each terminal node of the tree, until the minimum node size nmin isreached.\u2022 Select p features at random from the m features.\u2022 Pick the best variable among the p.\u2022 Split the node into two daughter nodes.4: end forThe output of Algorithm 2 is a set of trees {Tb}Bb=1. The output of a random forest653.5. Supplementary materialis given byf(x) =1BB\u2211b=1Tb(x) (3.15)The basic random forest construction tend to select variables that have many possiblesplits or many missing values.The conditional inference random forest (CIRF) [84] uses a significance test pro-cedure in order to select variables. In the this work, we used the ctree39 R packageto implement the CIRF.RegressionA regression model assumes that the output is a linear function of the featuresX1, X2, \u00b7 \u00b7 \u00b7 , Xm ,i.e.Y = \u03b1 +m\u2211i=1\u03b2iXi. (3.16)A Generalized Linear Model (GLM) is a generalization of the regression model in (3.16)given byg(y) = \u03b1 +m\u2211i=1\u03b2iXi, (3.17)where g(\u00b7) is called the link function. Common link functions include the logistic andexponential. A Generalized Additive Model (GAM) is a generalization of the GLMand is given byg(y) = \u03b1 +m\u2211i=1fi(Xi), (3.18)where the fi(\u00b7) are unknown smooth functions. Estimating the GLM and GAMcan be done by the back fitting algorithm (Algorithm 9.1 in [97]). In the boostedGAM [87, 88], we estimate the model iteratively by adding a function most similarto the gradient of the likelihood with respect to the link function (refer to gradientboosting in [97]). The optimal number of functions for the boosted GAM is obtainedthrough cross validation.Feature Selection: There are a number of approaches to variable selection inthe regression models discussed above. A forward step-wise feature selection algo-rithm is a greedy strategy, wherein we start with a null model and sequentially addfeatures that improve the prediction accuracy. In contrast, the backward step-wise39Please refer to https:\/\/cran.r-project.org\/web\/packages\/partykit\/vignettes\/ctree.pdf on the usage of ctree. Given the dataset D, as a dataframe in R, the followingcommand trains a CIRF: ctree(viewcount \u223c \u00b7, data = D)663.5. Supplementary materialfeature selection, we start with the full model and delete features that have the leasteffect on the fit. Refer to [97] for more details on feature selection for regressionmodels. In Generalized Linear Model with Stepwise Feature Selection using Akaikeinformation (AIC) criterion, the final model selection using either forward or back-ward feature selection is done using the AIC criteria.Another popular method for feature selection, especially for high dimensionaldata, is the Least Absolute Shrinkage and Selection Operator (LASSO). LASSO wasoriginally introduced by Robert Tibshirani in 1996[97]. LASSO is the solution to thefollowing convex optimization problem:\u03b2\u02c6 = argmin1NN\u2211i=1(Yi \u2212X \u2032i\u03b2)2 + \u03bb\u2016\u03b2\u20161. (3.19)The set of predictor variables selected by the Lasso estimator is denoted byM\u03bb = {1 \u2264 k \u2264 m|\u03b2k 6= 0} . (3.20)The `1-penalty for the LASSO (3.19) has two effects, model selection and shrinkageestimation. On the one hand, a certain set of coefficients is set to zero and henceexcluded from the selected model. On the other hand, for all variables in the selectedmodel M\u03bb, coefficients are shrunken towards zero compared to the least-squares so-lution (\u03bb = 0).In the Relaxed LASSO the model selection and the shrinkage estimation andcontrolled by two separate parameters \u03bb and \u03c6 and is given by\u03b2\u02c6 = argmin1NN\u2211i=1(Yi \u2212X \u2032i {\u03b2 \u00b7 1M\u03bb})2 + \u03c6\u03bb\u2016\u03b2\u20161, (3.21)where 1M\u03bb is the indicator function on the set of variables M\u03bb \u2282 {1, 2, \u00b7 \u00b7 \u00b7 ,m}.67Chapter 4Interactive AdvertisementScheduling in Personalized LiveSocial MediaIn this chapter, we consider the problem of interactive advertisement (ad) schedul-ing in personalized live social media. Popularity of live video streaming has seen asharp growth due to improved bandwidth for streaming and the ease of sharing User-Generated-Content (UGC) on the internet platforms. In addition, with the advent ofhigh quality camera smartphones and the widespread deployment of 4G LTE baseddata services, personal online video streaming is growing in popularity and includesmobile applications such as Periscope and Meerkat. One of the primary motivationsfor users to generate content is that platforms like YouTube, Twitch etc., allow usersto generate revenue through advertising and royalties. A strong motivation to con-sider the problem of interactive ad scheduling in live online videos stems from the factthat ads are currently scheduled using passive techniques: periodic [11], and manualmethods; and yet advertisement revenues are significant for social media companies40.In this chapter, we model the interest of the live media using a Markov chain.Viewers are more likely to engage with an ad if there are interested in the content ofthe video that the ad is inserted. Hence, advertisers aim to schedule ads when theinterest is high, so as to maximize advertisement revenue through viewer engagement(click on the ads). The interset of the content is not observed directly, however, noisyobservation is obtained by the comments and likes of the viewers. Hence, the problemof computing the optimal policy of scheduling ads on live channel can be formulatedas a multiple stopping time problem where a decision maker wishes to stop at mostL-times to maximize the cumulative reward.40The revenue of Twitch which deals with live video gaming, play through of video games, ande-sport competitions, is around 3.8 billion for the year 2015, out of which 77% of the revenue wasgenerated from advertisements.68Chapter 4. Interactive Advertisement Scheduling in Personalized Live Social MediaMain results and OrganizationThis chapter is organized as follows: Section 4.1 formulates the multiple stopping timeproblem as a partially observed Markov decision process (POMDP); the POMDPformulation is natural in this context since we are dealing with a partially observedmulti-state Markov chain with multiple actions (L stops, continue). For a POMDP, ingeneral, and the multiple stopping time problem, it is intractable (PSPACE-complete)to numerically compute the optimal policy. Hence, we provide structural results onthe optimal multiple stopping policy. Structural results impose sufficient conditionson the model to determine the structure of the optimal policy without brute forcecomputations - the main tools used are submodularity and stochastic dominance onthe belief space of posterior distributions.This chapter has the following main results:1. Optimality of threshold policies: Section 4.2.3 provides the main structuralresult. Specifically, Theorem 4.2.1 asserts that the optimal policy is characterizedby up to L threshold curves, \u0393l on the unit simplex of Bayesian posteriors (beliefstates). To prove this result we use the monotone likelihood ratio (MLR) stochasticorder since it is preserved under conditional expectations. However, determining theoptimal policy is non-trivial since the policy can only be characterized on a partiallyordered set (more generally a lattice) within the unit simplex. We modify the MLRstochastic order to operate on line segments within the unit simplex of posteriordistributions. Such line segments form chains (totally ordered subsets of a partiallyordered set) and permit us to prove that the optimal decision policy has a thresholdstructure. In addition, similar to [39], we show that the stopping sets (set of beliefstates at which the decision maker stops) have a nested structure, i.e. S l\u22121 \u2282 S l.2. Optimal Linear Threshold and their Estimation: For the threshold curves\u0393l, l = 1, \u00b7 \u00b7 \u00b7 , L, Theorem 4.3.1 and Theorem 4.3.2 give necessary and sufficient con-ditions for the optimal linear hyperplane approximation (linear threshold policies)that preserves the structure of the optimal multiple stopping policy. Section 4.3presents a simulation based stochastic gradient algorithm (Algorithm 3) to computethe best linear threshold policies. The advantage of the simulation based algorithmis that it is very easy to implement and is computationally efficient.3. Application to Interactive Advertising in live social media: Figure 4.1 showsthe schematic setup of the ad scheduling problem considered in this chapter. Theproblem of optimal scheduling of ads has been studied in the context of advertisingin television; see [33], [34] and the references therein. However, scheduling ads on liveonline social media is different from scheduling ads on television in two significant694.1. Sequential multiple stopping and stochastic dynamic programmingways [36]: i) real-time measurement of viewer engagement (comments and likes onthe content). The viewer engagement provides a noisy measurement of the underlyinginterest in the content. ii) revenue is based on viewer engagement with the ads ratherthan a pre-negotiated contract.In Section 4.4, we use real dataset from Periscope, a popular personalized livestreaming application owned by Twitter, to optimally schedule multiple ads (L > 1) ina sequential manner so as to maximize the advertising revenue. The numerical resultsshow that the policy obtained through the multiple stopping framework outperformsconventional scheduling techniques.Broadcaster (Stochas-tic Scheduler)Live SessionLive Video Schedule AdsContinue StopIntegrated Live Video(Interest \u223c P, pi0)Live ViewersViewerEngagementFigure 4.1: Block diagram showing the stochastic scheduling problem faced by thedecision maker (broadcaster) in advertisement scheduling on live media. The setupis detailed in Section 4.4. The broadcaster wishes to schedule at most L-ads duringthe live session. To maximize advertisement revenue, the ads need to be scheduledwhen the interest in the content is high. The interest in the content cannot bemeasured directly, but noisy observations of the interest are obtained from the viewerengagement (viewer comments and likes) during the live session.4.1 Sequential multiple stopping and stochasticdynamic programmingIn this section, we formulate the optimal multiple stopping time problem as a POMDP.In Section 4.1.3, we present a solution to the POMDP using stochastic dynamic pro-gramming. This sets the stage for Section 4.2 where we analyze the structure of theoptimal policy.704.1. Sequential multiple stopping and stochastic dynamic programming4.1.1 Optimal multiple stopping: POMDP formulationConsider a discrete time Markov chain Xt with state-space S = {1, 2, \u00b7 \u00b7 \u00b7 , S}. Here,t = 0, 1, \u00b7 \u00b7 \u00b7 denote discrete time. The decision maker receives a noisy observation Ytof the state Xt at each time t. The decision maker wishes to stop at most L times overan infinite horizon. The positive integer L, is chosen a priori. At each time the decisionmaker either stops or continues, and obtains a reward that depends on the currentstate of the Markov chain. The objective of the decision maker is to opportunisticallyselect the best time instants to stop so as to maximize the cumulative reward. Thisproblem of stopping at most L times sequentially so as to maximize the cumulativereward corresponds to a multiple stopping time problem with L-stops.The multiple stopping time problem consists of the following components:1. State Dynamics: The Markov chain has transition matrix P and initial proba-bility vector pi0; soP (i, j) = P(Xt+1 = j|Xt = i), pi0(i) = P(X0 = i). (4.1)2. Observations: At each time instant t, the decision maker receives noisy obser-vation Yt of the state Xt. Denote, the conditional probability of receiving observationj \u2208 Y (Yt = j) in state i (Xt = i) by B(i, j). Then,B(i, j) = P (Yt = j|Xt = i) \u2200i \u2208 S, j \u2208 Y . (4.2)3. Actions: At each time instant t, the decision maker chooses an action ut \u2208 A ={1 (Stop) , 2 (Continue) } to either stop or to continue.4. Reward: Choosing the stop action at time t, when there are l additional stopremaining, the decision maker accrues a reward rl(Xt, a = 1), where Xt is the stateof the Markov chain at time t. Similarly, if the decision maker chooses to continue,it will accrue rl(Xt, a = 2).5. Scheduling Policy: The history available to the decision maker at time t isZt = {pi0, Y1, \u00b7 \u00b7 \u00b7 , Yt} .The scheduling policy \u00b5, at each time t, maps Zt to action ut i.e. the action chosenat time t is ut = \u00b5(Zt).Objective:For l \u2208 {1, 2, \u00b7 \u00b7 \u00b7 , L}, let \u03c4l denote the stopping time when there are l stops714.1. Sequential multiple stopping and stochastic dynamic programmingremaining, i.e.\u03c4l = inf {t : t > \u03c4l+1, ut = 1} ,with \u03c4L+1 = 0. (4.3)For policy \u00b5 and initial belief pi0, cumulative reward is:J\u00b5(pi0) = E\u00b5{\u03c4L\u22121\u2211t=0\u03c1trL(Xt, 2) + \u03c1\u03c4LrL(X\u03c4L , 1) (4.4)+\u03c4L\u22121\u22121\u2211t=\u03c4L+1\u03c1trL\u22121(Xt, 2) + \u00b7 \u00b7 \u00b7+ \u03c1\u03c41r1(X\u03c41 , 1)\u2223\u2223\u2223 pi0} ,where the expectation is over the state dynamics and the observation distribution.In (4.4), \u03c1 \u2208 [0, 1] denotes a user-defined economic discount factor41. Choosing \u03c1 < 1de-emphasizes the effect of decisions taken at later time instants on the cumulativereward.The decision maker aims to compute the optimal strategy \u00b5\u2217 to maximize (4.4),i.e.\u00b5\u2217 = argmax\u00b5\u2208UJ\u00b5(pi0). (4.5)Remark 1. The above formulation is an instance of a special type of POMDP calledthe stopping time POMDP. This is seen as follows: the objective in (4.4) can beexpressed as an infinite horizon criteria by augmenting a fictitious absorbing state\u20130that has zero reward, i.e. r0(0, u) = 0 u \u2208 A. When L stop actions are taken, thesystem transitions to state 0 and remains there indefinitely. Then (4.4) is equivalentto the following discounted infinite horizon criteria:J\u00b5(pi0) = E\u00b5{\u03c4L\u22121\u2211t=0\u03c1trL(Xt, 2) + \u03c1\u03c4LrL(X\u03c4L , 1)+ \u00b7 \u00b7 \u00b7+ \u03c1\u03c41r1(X\u03c41 , 1) +\u221e\u2211t=\u03c41+1\u03c1tr0(0, 2)\u2223\u2223\u2223 pi0} ,where the last summation is zero.Remark 2 (Finite horizon constraint). This work considers the problem of at most41In the multiple stopping time problem, considered here, \u03c1 = 1 is allowed. For undiscountedproblem (\u03c1 = 1), the stopping times may not be finite and the objective in (4.4) becomes un-bounded. However, the multiple stopping time problem will terminate in finite time: AssumeR = maxi,lrl(i, 1) > 0 i.e. the maximum stop reward is positive and R = mini,lrl(i, 2) < 0, i.e. theminimum reward for continue is negative. Then, it is clear that any optimal policy will stop in lessthanLR|R| time steps.724.1. Sequential multiple stopping and stochastic dynamic programmingL stops with no constraints on the stopping times. Our results also hold straightfor-wardly for the case where L stops need to be made within a pre-specified finite timehorizon. Then, the optimal policy will be non-stationary and the structural resultspresented in subsequent sections apply at each time instant.4.1.2 Belief state formulation of the objectiveAs is customary for partially observed control problems, we reformulate the dynamicsand cumulative objective in terms of the belief state. Let \u03a0 denote the belief spaceof S-dimensional probability vectors. The belief space is the unit S \u2212 1 dimensionalsimplex:\u03a0 ={pi : 0 \u2264 pi(i) \u2264 1,S\u2211i=1pi(i) = 1}. (4.6)The belief state at time t, denoted by pit \u2208 \u03a0, is the posterior probability of Xt giventhe history Zt. The belief state is a sufficient statistic of Zt [98], and evolves accordingto the following Hidden Markov Bayesian filter update [99]:pit+1 = T (pit, Yt+1), whereT (pi, y) =ByP\u2032pi\u03c3(pi, y), \u03c3(pi, y) = 1\u2032SByP\u2032pi,By = diag (B(1, y), \u00b7 \u00b7 \u00b7 , B(S, y)) .(4.7)Here 1S represents the S-dimensional vectors of ones.Using the smoothing property of conditional expectations, the objective in (4.4)can be reformulated in terms of belief state as:J\u00b5(pi0) = E\u00b5{\u03c4L\u22121\u2211t=0\u03c1tr\u20322,Lpit + \u03c1\u03c4Lr\u20321,Lpi\u03c4L (4.8)+\u03c4L\u22121\u22121\u2211t=\u03c4L+1\u03c1tr\u20322,L\u22121pit + \u00b7 \u00b7 \u00b7+ \u03c1\u03c41r\u20321,1pi\u03c41 +\u221e\u2211t=\u03c41+1\u03c1tr\u20322,0pit\u2223\u2223\u2223 pi0} ,where ru,l = [rl(1, u), . . . , rl(S, u)]\u2032. For the stopping time problem (4.8), there existsa stationary optimal policy [98]. Since the belief state is a sufficient statistic of Zt,(4.5) is equivalent to computing the optimal stationary policy \u00b5\u2217 : \u03a0 \u00d7 [L] \u2192 A,where [L] = {1, 2, \u00b7 \u00b7 \u00b7 , L}, as a function of belief and number of stops remaining tomaximize (4.8).734.2. Optimal multiple stopping: Structural results4.1.3 Stochastic dynamic programmingComputing the optimal policy \u00b5\u2217 to maximize (4.5) or equivalently (4.8) involvessolving multiple stopping Bellman\u2019s dynamic programming equation [98]:\u00b5\u2217(pi, l) = argmaxu\u2208AQ(pi, l, u),V (pi, l) = maxu\u2208AQ(pi, l, u),(4.9)Q(pi, l, 1) = r\u20321,lpi + \u03c1\u2211y\u2208YV (T (pi, y), l \u2212 1)\u03c3(pi, y),Q(pi, l, 2) = r\u20322,lpi + \u03c1\u2211y\u2208YV (T (pi, y), l)\u03c3(pi, y).Since the state-space \u03a0 is a continuum, Bellman\u2019s equation (4.9) does not translateinto a practical solution methodology as V (pi, l) needs to be evaluated at each pi \u2208 \u03a0.This, in turn, renders the calculation of the optimal policy \u00b5\u2217(pi, l) computationallyintractable42.4.2 Optimal multiple stopping: Structural resultsIn this section, we derive structural results for the optimal policy (4.9) of the multiplestopping time problem. In Section 4.2.3, we show that under reasonable conditionson the POMDP parameters, the optimal policy is a monotone policy.4.2.1 DefinitionsDefine the stopping set Sl (the set of belief states where Stop is the optimal action),when l stops are remaining as:Sl = {pi : \u00b5\u2217(pi, l) = 1} . (4.10)Correspondingly, the continue set (the set of belief states where Continue is theoptimal action) is defined asC l = {pi : \u00b5\u2217(pi, l) = 2} . (4.11)42It is well known that a finite horizon POMDP with finite observation space can be solvedexactly, indeed the value function is piecewise linear and convex [99]. However, the problem isPSPACE complete [100]; the worst case computational cost increases exponentially with the numberof actions and doubly exponential with the time index.744.2. Optimal multiple stopping: Structural resultsLet W (pi, l) be defined asW (pi, l) = V (pi, l)\u2212 V (pi, l \u2212 1). (4.12)The stopping and continue sets in terms of W defined in (4.12) is as follows:Sl = {pi|r\u2032lpi \u2265 \u03c1\u2211yW (T (pi, y), l)\u03c3(pi, y)},C l = {pi|r\u2032lpi < \u03c1\u2211yW (T (pi, y), l)\u03c3(pi, y)}.(4.13)where, rl , r1,l \u2212 r2,l.Remark 3. For notational convenience, without loss of generality, assume r1,l = rland r2,l = 0. So, the decision maker accrues no reward for the continue action.Remark 4. We consider r1 = r2 = \u00b7 \u00b7 \u00b7 = rL = r, i.e. the rewards are not dependenton l. It should be noted however that the structural results continue to hold for thecase where the instantaneous rewards rl are dependent on l.The stopping and continue sets can be arbitrary partitions of the simplex \u03a0.However, in Section 4.2.3, we show that these sets can be characterized by thresholdcurves. The question of computing the optimal policy, then, reduces to estimatingthe threshold curve.It is worth pointing out that in the classical stopping POMDPs in [99], the stop-ping and continue sets are characterized in terms of convex value function. The keydifficulty of the multiple stopping problem, is that W being the difference of twoconvex value functions does not share the convex properties of the value function.4.2.2 AssumptionsThe main result below, namely, Theorem 4.2.1, requires the following assumptions onthe reward vector, r, the transition matrix, P and the observation distribution, B.(A1) P is totally positive of order 2 (TP2), i.e. all second order minors are non-negative43.43A stochastic matrix A is TP2 if\u2223\u2223\u2223\u2223Ai1,j1 Ai1,j2Ai2,j1 Ai2,j2\u2223\u2223\u2223\u2223 \u2265 0,\u2200i2 \u2265 i1, j2 \u2265 j1.Equivalently, Ai,j\/Ai+1,j is increasing in j.754.2. Optimal multiple stopping: Structural results(A2) B satisfies the following: If the observation space is discrete or countably infinitethen B(j, x)B(i, y) \u2264 B(j, y)B(i, x), j < i, x \u2264 y. If the observation space iscontinuous, let B(i) denote the observation probability density while the Markovchain is in state i. ThenB(j)(x)B(i)(x)should be non-decreasing function of x.(A3) The vector, (I \u2212 \u03c1P )r, has decreasing elements.Discussion of Assumptions:When S = 2, (A1) is valid when P (1, 1) \u2265 P (2, 1). When S > 2, consider thetridiagonal transition matrix44 with P (i, j) = 0, i > j + 2 and i < j\u2212 2. (A1) is validif P (i, i)P (i+ 1, i+ 1) \u2265 P (i+ 1, i)P (i, i+ 1).(A2) holds for numerous examples. Examples include binomial, Poisson, geo-metric, Gaussian, exponential, etc. When the observation space is discrete Assump-tion (A2) is equivalent to the TP2 definition. In the numerical results in Section 4.4,we use the Poisson distribution where B(i, j) =gji exp (\u2212gi)j!, where gi is the mean of thePoisson distribution. (A2) is satisfied if gi decreases monotonically with i (see Propo-sition 4.6.1). For a continuous observation distribution such as Gaussian whose meanis dependent on the state of the Markov chain (variance is fixed), (A2) is satisfiedwhen the mean monotonically decreases with i.(A3) is a joint condition on the reward vector and the transition matrix. (A3)and (A1) jointly imply that the reward vector r has decreasing elements45. WhenS = 2, it can be verified that r having decreasing elements is sufficient for (I \u2212 \u03c1P ) rto have decreasing elements. For S > 2, (A3) is a stronger condition than having theelements of r decreasing.(A3) is easy to interpret when P has additional structure. For example, consider aslowly varying Markov chain with P = I + \u000fQ, where Q(i, j) > 0, i 6= j, \u2211j Q(i, j) =0, and \u000f > 0. Here 1\u000f> maxi\u2211j |Q(i, j)| for P to be a valid transition matrix.Then (A3) is equivalent to r having decreasing elements. Such slowly varying matricesarise in a lot of applications like manufacturing systems, internet packet transmissionand wireless communication (see Section 1.3 in [101]). Also, the user interest in anonline social media typically evolves slowly [102]. The reward vector r captures thepreference of the decision maker - the highest reward is accrued in State 1.44The transition matrices computed on real dataset in Section 4.4 follow a tridiagonal structure;refer to (4.23).45\u03c1 < 1: Let v = (I \u2212 \u03c1P )r. (A3) implies that v has decreasing elements. When \u03c1 < 1, (I \u2212 \u03c1P )is invertible. Hence, r = (I \u2212 \u03c1P )\u22121v = \u2211\u221ek=0 \u03c1kP kv. Since the product of TP2 matrices is TP2,each P k is TP2. The result follows from Theorem 9.2.2 in [99].For \u03c1 = 1, g = lim\u03c1\u21911(1 \u2212 \u03c1)(I \u2212 \u03c1P )\u22121v is the solution of (I \u2212 P )r = v. This limit exists [103,764.2. Optimal multiple stopping: Structural results4.2.3 Main result: Optimality of threshold policiesThe main result below (Theorem 4.2.1) states that the optimal policy is monotonewith respect to the belief state pi. However, for a monotone policy to be well defined,we need to first define the ordering between two belief states. For S = 2, the belief pi =[1\u2212 pi(2) pi(2)]can be completely ordered with respect to pi(2) \u2208 [0, 1]. However,for S > 2, comparing belief states requires using stochastic orders which are partialorders. We will use the monotone likelihood ratio (MLR) (see Def. 4.6.1 in Sec. 4.6.1);it is ideal for partially observed control problems since it is preserved under conditionalexpectation (Bayesian update).Under reasonable conditions, Theorem 4.2.1 asserts that the optimal policy \u00b5\u2217(pi)is monotonically decreasing in pi with respect to the MLR order. However, despite thismonotonicity, determining the optimal policy is nontrivial since the policy can only becharacterized on a partially ordered set. The main innovation in Theorem 4.2.1 is tomodify the MLR stochastic order to operate on lines L(e1, p\u00afi) and L(eS, p\u00afi) (see 4.6.1)within the belief space. Such line segments form chains (totally ordered subsets of apartially ordered set) and permit us to prove that the optimal decision policy has athreshold structure.e2e3p\u00afi1Hp\u00afi2p\u00afi3L(e1, p\u00afi1)e1C lSl\u0393lSl\u22121\u0393l\u22121Figure 4.2: Visual illustration of Theorem 4.2.1. Each of the stopping sets Sl ischaracterized by a threshold curve \u0393l. Each of the threshold curve \u0393l intersects theline L(e1, p\u00afi) at most once.Theorem 4.2.1. Assume (A1), (A2) and (A3). Then,A There exists an optimal policy \u00b5\u2217(pi, l) that is decreasing on lines L(e1, p\u00afi), andL(eS, p\u00afi) in the belief space \u03a0 for each l46.Cor. 8.2.5] and hence, r has decreasing elements.45H is defined in 4.6.1774.3. Stochastic gradient algorithm for estimating optimal linear threshold policiesB There exists an optimal switching curve \u0393l, for each l, that partitions the beliefspace \u03a0 into two individually connected sets Sl and C l,such that the optimalpolicy is\u00b5\u2217(pi, l) =\uf8f1\uf8f2\uf8f31 if pi \u2208 Sl2 if pi \u2208 C l (4.14)C Sl\u22121 \u2282 Sl, l = 1, 2, \u00b7 \u00b7 \u00b7 , L.Theorem 4.2.1A asserts that the optimal policy is monotonically decreasing onthe line L(e1, p\u00afi), as shown in Figure 4.2. Hence, on each line L(e1, p\u00afi) there existsa threshold above (in MLR sense) which it is optimal to Stop and below which it isoptimal to Continue. Theorem 4.2.1B asserts, for each l, the stopping and continuesets are connected. Hence, there exists a threshold curve, \u0393l, as shown in Figure 4.2,obtained by joining the thresholds, from Theorem 4.2.1A, on each of the line L(e1, p\u00afi).Furthermore, the stopping set enclosed by the threshold curve is a union of convexsets47and hence, the threshold curve is continuous and differentiable almost every-where. Theorem 4.2.1C proves the nested structure of the stopping sets: Thestopping set when l\u22121 stops are remaining is a subset of the stopping set when thereare l stops remaining.4.3 Stochastic gradient algorithm for estimatingoptimal linear threshold policiesIn light of Theorem 4.2.1, computing the optimal policy reduces to estimating L-threshold curves in the unit simplex (belief space), one for each of the L-stops. Thethreshold curves can be approximated by any of the standard basis functions. Inthis paper, we will restrict the approximation to linear threshold policies, i.e. policiesof the form given in (4.15). However, any such approximation needs to capture the46In general, the optimal policy is not unique. The theorem asserts that there exists a version ofthe optimal policy that is monotone.47This is due to the finite stopping time property of the multiple stopping time problem; seeFootnote 41. A finite horizon POMDP with a finite state and observation space has a value functionthat is piecewise linear and convex; see Theorem 7.4.1 in [99]. For l = 1, V (pi, 1) = max\u03b3\u2208\u0393\u03b3\u2032pi, where \u0393 isa finite set due to the finite stopping time property. For l = 2, the dynamic programming equationin (4.9) can be written as: V (pi, 2) = max{r\u2032pi + max\u03b3\u2208\u0393\u03b3\u2032P \u2032pi, \u03c1\u2211y\u2208Y V (T (pi, y), 2)\u03c3(pi, y)}. Foreach \u03b3 \u2208 \u0393, the stopping set is convex; see the proof of Theorem 12.2.1 in [99]. Hence, the stoppingset for l = 2 is a union of convex sets. Similar argument holds for any value of l.784.3. Stochastic gradient algorithm for estimating optimal linear threshold policiesessence of Theorem 4.2.1, i.e. the optimal policy is MLR decreasing on lines, connectedand satisfy the nested property. We call such linear threshold policies (that capturesthe essence of Theorem 4.2.1) as the optimal linear threshold policies.Section 4.3.1 derives necessary and sufficient condition to characterize such linearthreshold policies. Algorithm 3 in Section 4.3.2 is a simulation based algorithm tocompute the optimal linear threshold policies. The simulation based algorithm iscomputationally efficient (see comments at end of Section 4.3.2).4.3.1 Structure of optimal linear threshold policies formultiple stoppingWe define a linear parametrized policy on the belief space \u03a0 as follows. Let \u03b8l \u2208 IRS\u22121denote the parameters of linear hyperplane. Then, linear threshold policies as afunction of the belief pi and the number of stops remaining l, are defined as\u00b5\u03b8(pi, l) =\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f31 if[0 1 \u03b8l]\uf8ee\uf8f0 pi\u22121\uf8f9\uf8fb \u2264 02 otherwise .(4.15)The linear policy \u00b5\u03b8(pi, l) is indexed by \u03b8 to show the explicit dependence of the pa-rameters on the policy. In (4.15), \u03b8 = (\u03b81, \u03b82, . . . , \u03b8L) \u2208 IRL\u00d7(S\u22121) is the concatenationof the \u03b8l vectors, one for each of the L-stops.In Theorem 4.2.1A, it was established that the optimal multiple stopping policy isMLR decreasing on specific lines within the belief space, i.e. for pi1 \u2265Li pi2, \u00b5(pi1, l) \u2264\u00b5(pi2, l); i = 1, S. Theorem 4.3.1 gives necessary and sufficient conditions on thecoefficient vector \u03b8l such that pi1 \u2265Li pi2, \u00b5\u03b8(pi1, l) \u2264 \u00b5\u03b8(pi2, l); i = 1, S.Theorem 4.3.1. A necessary and sufficient condition for the linear threshold policies\u00b5\u03b8(l, pi) to be1. MLR decreasing on line L(e1), iff \u03b8l(S \u2212 1) \u2265 0 and \u03b8l(i) \u2265 0, i \u2264 S \u2212 2.2. MLR decreasing on line L(eS), iff \u03b8l(S \u2212 1) \u2265 0, \u03b8l(S \u2212 2) \u2265 1 and \u03b8l(i) \u2264\u03b8l(S \u2212 2), i < S \u2212 2.The proof is in Sec. 4.6.3. As a consequence of Theorem 4.3.1, the constraintson the parameters \u03b8 ensure that only MLR decreasing linear threshold policies are794.3. Stochastic gradient algorithm for estimating optimal linear threshold policiesconsidered; the necessity and sufficiency imply that non-monotone policies are notconsidered, and monotone policies are not left out.In Theorem 4.2.1B it was established that the optimal stopping sets are con-nected, which is satisfied trivially since we approximate the threshold curve usinga linear hyperplane. Theorem 4.3.2 below provides sufficient conditions such thatthe parametrized linear threshold curves satisfy the nested property established inTheorem 4.2.1C. A proof is provided in Sec. 4.6.3.Theorem 4.3.2. A sufficient condition for the linear threshold policies in (4.15) tosatisfy the nested structure in Theorem 4.2.1C is given by\u03b8l\u22121(S \u2212 1) \u2264 \u03b8l(S \u2212 1)\u03b8l\u22121(i) \u2265 \u03b8l(i) i < S \u2212 1,(4.16)for each l.4.3.2 Simulation-based stochastic gradient algorithm forestimating linear threshold policiesWe now estimate the optimal linear threshold policies using a simulation basedstochastic gradient algorithm using Algorithm 3. The algorithm is designed so thatthe estimated policies satisfy the conditions in Theorem 4.3.1 and Theorem 4.3.2.The optimal policy of a multiple stopping time problem maximizes the expectedcumulative reward J\u00b5 in (4.4). In Algorithm 3, we approximate J\u00b5 over a finite timehorizon (N), as JN which is computed as:JN(\u03b8) = E\u00b5\u03b8{L\u2211l=1\u03c1\u03c4lr\u2032pi\u03c4l\u2223\u2223\u2223 \u03c4l \u2264 N ;\u2200l} . (4.17)JN is an asymptotic estimate of J\u00b5 as N tends to infinity.Algorithm 3, is a stochastic gradient algorithm that generates a sequence of es-timates \u03b8n, that converges to a local maximum. It requires the computation of thegradient: \u2207\u03b8JN(\u00b7). Evaluating the gradient in closed form is intractable due to thenon-linear dependence of JN(\u03b8) on \u03b8. We can estimate \u2207\u02c6\u03b8JN(\u00b7) using a simulationbased gradient estimator. There are several such simulation based gradient estimatorsavailable in the literature including infinitesimal perturbation analysis, weak deriva-tives and likelihood ratio (score function) methods [104]. In this work, for simplicity,we use the SPSA algorithm [105], which estimates the gradient using a finite difference804.3. Stochastic gradient algorithm for estimating optimal linear threshold policiesmethod.To make use of the SPSA algorithm, we convert the constrained optimizationproblem in \u03b8 (constraints imposed by Theorem 4.3.1 and Theorem 4.3.2) into anunconstrained problem using spherical co-ordinates as follows:\u03b8\u03c6l (i) =\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\u03c621(S \u2212 1)\u220fL\u22121`=l sin2(\u03c6`(S \u2212 1)) i = S \u2212 11 + \u03c621(S \u2212 2)\u220fl`=2 sin2(\u03c6`(S \u2212 2)) i = S \u2212 2\u03b8l(S \u2212 2)\u220fL`=1 sin2(\u03c6`(i)) i < S \u2212 2.(4.18)It can be verified that the parametrization, \u03b8\u03c6 in (4.18), satisfies the conditions inTheorem 4.3.1 and Theorem 4.3.2. For example, consider i = S\u22121, then the productterm involving sin(\u00b7) ensures that \u03b8l\u22121(S \u2212 1) \u2264 \u03b8l(S \u2212 1) (the first part of Theo-rem 4.3.2).Algorithm 3 Stochastic gradient algorithm for optimal multiple stoppingRequire: POMDP parameters satisfy (A1), (A2), (A3).1: Choose initial parameters \u03c60 and initial linear threshold policies \u00b5\u03b80 using (4.15).2: for iterations n = 0, 1, 2, . . . : do3: Evaluate JN(\u03b8\u03c6n+cn\u03c9n) and JN(\u03b8\u03c6n\u2212cn\u03c9n) using (4.17)4: SPSA: Gradient estimate \u2207\u02c6\u03c6JN(\u03b8\u03c6n) using (4.19).5: Update the parameter vector \u03c6n to \u03c6n+1 using (4.20).6: end forFollowing [105], the gradient estimate using SPSA is obtained by picking a randomdirection \u03c9n, at each iteration n. The estimate of the gradient is then given by\u2207\u02c6\u03c6JN(\u03b8\u03c6n) = JN(\u03b8\u03c6n+cn\u03c9n)\u2212 JN(\u03b8\u03c6n\u2212cn\u03c9n)2cn\u03c9n, (4.19)where,\u03c9n(i) =\uf8f1\uf8f2\uf8f3\u22121 with probability 0.5+1 with probability 0.5.The two JN(\u00b7) terms in the numerator of (4.19) is estimated using the finite timeapproximation (4.17). Using the gradient estimate in (4.19), the parameter update isas follows [105]:\u03c6n+1 = \u03c6n + an\u2207\u02c6\u03c6JN(\u03b8\u03c6n). (4.20)814.4. Numerical examples: Interactive advertising in live social mediaThe parameters an and cn are typically chosen as follows [105]:an = \u03b5(n+ 1 + \u03c2)\u2212\u03ba 0.5 < \u03ba \u2264 1, and \u03b5, \u03c2 > 0cn = \u00b5(n+ 1)\u2212\u03c5 0.5 < \u03c5 \u2264 1 \u00b5 > 0(4.21)At each iteration of Algorithm 3, evaluating the gradient estimate in (4.19) re-quires two POMDP simulations. However, this is independent of the number of states,the number of observations or the number of stops. The decreasing step size stochas-tic gradient algorithm, Algorithm 3, converges to a local optimum with probabilityone. Hence, it is necessary to try several initial conditions and estimate the optimalthreshold.To summarize, we have used a stochastic gradient algorithm to estimate the opti-mal linear threshold policies for the multiple stopping time problem. Instead of linearthreshold policies, one could also use piecewise linear or other basis function approx-imations; provided that the resulting parameterized policy is still MLR decreasing(i.e. the characterization similar to Theorem 4.3.1 holds).4.4 Numerical examples: Interactive advertisingin live social mediaThis section has three parts. In Section 4.4.1 we visually illustrate the main resultfor S = 348. We consider multiple examples to illustrate how the assumptions inSec. 4.2.3 affect the optimal multiple stopping time policy. In addition, we benchmarkthe performance of linear threshold policies (obtained using Algorithm 3) againstoptimal multiple stopping policy. Second, using a real dataset, we study howthe multiple stopping problem can be used to schedule advertisements in live socialmedia. We show numerically that the linear threshold scheduling policies outperformsconventional techniques for scheduling ads in live social media. Finally, we illustratethe performance of the linear threshold policies for a large size POMDP (25 states)by comparing with the popular SARSOP algorithm.4.4.1 Synthetic dataIn this section, we visually illustrate the optimal multiple stopping policy, using nu-merical examples. The objective is to illustrate how the assumptions in Sec. 4.2.348For S = 3, the unit simplex is an equilateral triangle.824.4. Numerical examples: Interactive advertising in live social mediaaffect the optimal multiple stopping time policy. The optimal policy can be obtainedby solving the dynamic programming equations in (4.9) and can be computed ap-proximately by discretizing the belief space. The belief space \u03a0 , for all examplesbelow, was uniformly quantized into 100 states, using the finite grid approximationmethod in [106].Example 1: POMDP parameters: Consider a Markov chain with 3\u2212states withthe transition matrix P and the reward vector specified in (4.22). The observationdistribution is given by B(i, j) =gji exp (\u2212gi)j!, i.e. the observation distribution is Pois-son with state dependent mean vector g given in (4.22). It is easily verified thatthe transition matrix, the observation distribution and the reward vector satisfy theconditions (A1) to (A3).P =\uf8ee\uf8ef\uf8f00.2 0.1 0.70.1 0.1 0.80 0.1 0.9\uf8f9\uf8fa\uf8fb , g = [12 7 2] , r = [9 3 1] (4.22)We choose49 L = 5, i.e. the decision maker wishes to stop at most 5 times.Figure 4.3a shows the stopping sets S5 and S1. It is evident from Figure 4.3a thatthe optimal policy is monotone on lines, stopping sets are connected and satisfy thenested property; thereby illustrating Theorem 4.2.1.(a) Example 1: S1 (shownin black) and S5 (shownin red) obtained by solv-ing the dynamic program-ming (4.9). The figure illus-trates monotone, connectedand the nested structure ofthe stopping sets (Sl\u22121 \u2282Sl), in Theorem 4.2.1.0 0.2 0.4 0.6 0.8 1(b) Example 2: Optimalpolicy when (A3) is vio-lated. S1 is shown in blackand S5 is shown in red. Themonotone property of The-orem 4.2.1A is violated.0 0.2 0.4 0.6 0.8 1(c) Example 3: Optimalpolicy when (A3) is vio-lated. S1 is shown in blackand S2 is shown in red.The stopping sets are notnested.Example 2: Consider the same parameters as in Example 1, except rewardr =[1 2 1]which violates (A3). Figure 4.3b shows the optimal multiple stopping49This is motivated by the real dataset example in Section 4.4.2.834.4. Numerical examples: Interactive advertising in live social mediapolicy in terms of the stopping sets. As can be seen from Figure 4.3b that the optimalpolicy do not satisfy the monotone property (Theorem 4.2.1A). However, the nestedproperty continues to hold.Example 3: Consider the same parameters as in Example 1, except L = 2 andr1 =[9 3 1]and r2 =[3 9 1]. Assumption (A3) is violated for l = 2. Figure 4.3cshows the optimal multiple stopping policy in terms of the stopping sets. As can beseen from Figure 4.3c that the optimal policy does not satisfy the monotone propertyor the nested property.Thus, the conditions (A1) to (A3) of Theorem 4.2.1 are useful in the sense thatwhen they are violated, there are examples where the optimal policy does not havethe monotone or nested property.Performance of linear threshold policies: In order to benchmark the performanceof optimal linear threshold policies (that satisfy the constraints in Theorem 4.3.1and Theorem 4.3.2), we ran Algorithm 3 for Example 1 (parameters in (4.22)). Theperformance was compared based on the expected cumulative reward between theoptimal policy and the linear threshold policies for 1000 independent runs. Thefollowing parameters were chosen for the SPSA algorithm \u00b5 = 2, \u03c5 = 0.2, \u03c2 = 0.5,\u03ba = 0.602 and \u03b5 = 0.1667; these values are as suggested in [105]. It was observedthat there is a 12% drop in performance of the linear threshold policies compared tothe optimal multiple stopping policy.4.4.2 Real dataset: Interactive ad scheduling on Periscopeusing viewer engagementWe now formulate the problem of interactive ad scheduling on live online social mediaas a multiple stopping problem and illustrate the performance of linear thresholdpolicies using a Periscope dataset50. Periscope is a popular live personalized videostreaming application where a broadcaster interacts with the viewers via live videos.Each such interaction lasts between 10\u221220 minutes and consists of: (i) A broadcasterwho starts a live video using a handheld device. (ii) Video viewers who engage withthe live video through comments and likes.The technique of interactive scheduling, where viewer engagement is utilized toschedule ads, has not been addressed in the literature. It will be seen in this sectionthat interactive scheduling of ads has significant performance improvements over the50We use the dataset in [107], which can be downloaded from http:\/\/sandlab.cs.ucsb.edu\/periscope\/. In [107], the authors deal with the performance of Periscope application in terms ofdelay and scalability.844.4. Numerical examples: Interactive advertising in live social mediaexisting passive methods.Dataset: The dataset in [107] contains details of all public broadcasts on thePeriscope application from May 15, 2015 to August 20, 2015. The dataset consistsof timestamped events: time instants at which the live video started\/ended; timeinstants at which viewers join; and, time instants at which the viewers engage usinglikes and comments. In this work, we consider viewer engagement through likes, sincecomments are restricted to the first 100 viewers in the Periscope application.Ad scheduling ModelHere we briefly describe how the model in Section 4.1 can be adapted to theproblem of interactive ad scheduling in live video streaming; see Figure 4.1 for theschematic setup.1. Interest Dynamics: In live online social media, it is well known that the viewerengagement is correlated with the interest of the content being streamed or broadcast.Markov models have been used to model interest in online games [108], web [12] andin online social networks [109]. We therefore model the interest in live video as aMarkov chain, Xt, where the different states denote the level of interest in the livecontent. The states are ordered in the decreasing order of interest.Homogeneous Assumption: Periscope utilizes the Twitter network to link broad-casters with the viewers and hence shares many of the properties of the Twittersocial network. Different sessions of a broadcaster, therefore, tend to follow similarstatistics due to the effects of social selection and peer influence [110]. It was shownin [111] that live sessions on live online gaming platforms can be viewed as commu-nities and communities in online social media have similar information consumptionpatterns [112]. We therefore model the interest dynamics as a time homogeneousMarkov chain having a transition matrix P .2. Engagement Dynamics: The interest in the video, Xt, cannot be measureddirectly by the broadcaster and has to be inferred from the viewer engagement, de-noted by Yt. Since the viewer engagement measures the number of likes in a giventime interval, we model it using a Markov modulated Poisson distribution. Denotethe rate of the Poisson observation process when the interest is in state i by gi. Theobservation probability in (4.2) can be obtained using B(i, j) =gji exp (\u2212gi)j!.3. Broadcaster Revenue: The ad revenue in online social media depends on theclick rate (the probability that the ad will be clicked). In a recent research, AdobeResearch51 concluded that video viewers are more likely to engage with an ad if theyare interested in the content of the video that the ad is inserted into. The reward51https:\/\/gigaom.com\/2012\/04\/16\/adobe-ad-research\/854.4. Numerical examples: Interactive advertising in live social mediavector in Section 4.1.1 should capture the positive correlation that exists betweeninterest in the videos and the click rate [113]. Since the information regarding theclick rate and actual number of viewers are not available in the dataset, we choosethe reward vector r to be a vector of decreasing elements, each being proportional tothe reward in that state, such that (A3) is satisfied.4. Broadcaster operation: The broadcaster wishes to schedule at most L ads atinstants when the interest is high. Here, we choose52 the number of stops L = 5.At each discrete time, after receiving the observation Yt, the broadcaster either stopsand schedules an ad or continues with the live stream; see Figure 4.1.5. Broadcaster objective: The objective of the broadcaster is given by (4.4). Itaims to schedule ads when the content is interesting, so as to elicit maximum numberof clicks, thereby maximizing the expected revenue. In personalized live streamingapplications like Periscope, the discount factor in (4.4) captures the \u201cimpatience\u201d oflive broadcaster in scheduling ads.The above model and formulation correspond to a multiple stopping problem withL stops, as discussed in Section 4.1. In the next section, we describe how to estimatethe model parameters from the data (viewer engagement Yt) for computing the linearthreshold policies using Algorithm 3 in Section 4.3.Estimation of parameters: The live video sessions in Periscope have a rangeof 10\u221220 minutes [107]. The viewer engagement information consists of a time seriesof likes obtained by sampling the timestamped likes at a 2-second interval. Samplingat a 2-second interval, each session provides 1000 data points. The model parametersP and B are computed using maximum likelihood estimation. Since the interestdynamics are time homogeneous, we utilize data from multiple sessions to estimatethe parameters P and B. The model was validated using the QQ-plot (see Figure 4.4)of normal pseudo-residuals [114, Section 6.1]. The estimated value of the transitionmatrix P and the state dependent mean g of a popular live session are given as:P =\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f00.733 0.266 0.000 0.0000.081 0.718 0.201 0.0000.000 0.214 0.670 0.1160.000 0.000 0.222 0.778\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb ,g =[38 21 10 1].(4.23)52Most of the popular Periscope sessions last 15\u2212 30 mins. Broadcast television usually average13.5 mins per hour of advertisement or approximately one ad every 5 mins. Hence, we choose thenumber of advertisements L = 5.864.4. Numerical examples: Interactive advertising in live social mediaThe model order dimension was estimated using the penalized likelihood criterion;specifically Table 4.1 shows the model order selection using the Bayesian informa-tion criterion (BIC). The likelihood values in Table 4.1 were obtained using anExpectation-Maximization (EM) algorithm [114]. In Table 4.1 that S = 4 has thelowest BIC value.Table 4.1: BIC model order selection for the popular live session. The maximumlikelihood estimated parameters are given in (4.23). The BIC criteria was run for Svarying from 2\u221212 (only values for 2\u22126 are shown below). It can be seen that S = 4has the lowest BIC value.S \u2212 log(L ) BIC = \u22122 log(L ) + n log(N)2 -4707.254 9535.0533 -4190.652 8601.1224 -3969.955 8287.3645 -3951.155 8405.7646 -3887.453 8462.725\u2022 L denotes the likelihood value.\u2022 n denotes the number of parameters: n = S2 + S \u2212 1.\u2022 N denotes the number of observations. Here, N = 104.The reward vector was chosen as r =[4 3 2 1], and satisfies (A3) for \u03c1 \u2208 [0, 1].200 400 600 800 1000 1200Time Index020406080100120Rate of likes(a) Plot of likes-4 -3 -2 -1 0 1 2 3 4Standard Normal Quantiles-3-2-101234Quantiles of Input Sample(b) QQ-plotFigure 4.4: The plot of the likes obtained for a popular session is shown in Fig. 4.4a.The maximum likelihood estimated parameters are given in (4.23). The QQ-plot isused for validating the goodness of fit. The linearity of the points suggest that theestimated parameters in (4.23) are a good fit.874.4. Numerical examples: Interactive advertising in live social mediaMultiple ad scheduling: Performance resultsWe now compare the linear threshold scheduling policies (obtained from Algorithm 3)with two existing schemes:1. Periodic: Here, the broadcaster stops periodically to advertise. Twitch53, forexample, uses periodic ad scheduling [11]. Periodic advertisement scheduling isalso widely used for pre-recorded videos on social media platforms like YouTube.2. Heuristic: Here, the broadcaster solves a classical single stopping problem ateach stop. The scheduler re-initializes and solves for L = 1 in Section 4.1 ateach stop.Performance Results: It was seen that the optimal linear threshold policiesoutperforms conventional periodic scheduling by 25% and the heuristic scheduling by10%. The periodic scheme performs poorly because it does not take into account theviewer engagement or the interest in the content while scheduling ads. The multiplestopping policy, in comparison to the heuristic scheme, takes into account the factthat L-ads need to be scheduled and hence, is optimal.4.4.3 Large state space models & Comparison withSARSOPTo illustrate the application on large state space models, we present a numericalexample using synthetic data.POMDP parameters: We consider a Markov chain with 25 states. The transitionmatrix and observation distribution are generated as discussed in [115]. In order thatthe transition matrix P satisfy the TP2 assumption in (A1), we use the followingapproach: First construct a 5-state transition matrix A = exp(Qt), where Q is atridiagonal generator matrix (off-diagonal entries are non-negative and row sums to0) and t > 0. Since Kronecker product preserves TP2 structure, we let P = A \u2297 A.The observation distribution B, containing 25 observations satisfying (A2) is similarlygenerated. The reward vector is chosen as follows: r = [25, 24, \u00b7 \u00b7 \u00b7 , 1]. The numberof stops, L = 5.Because of the large state space dimension, computing the optimal policy usingdynamic programming is intractable. We compare linear threshold policies (obtainedthrough Algorithm 3), the heuristic policy and periodic policy (described in the Sec-tion 4.4.2), in terms of the expected cumulative reward by each of the policy. Also, we53Twitch is a video platform that focuses primarily on video gaming. In 2015, Twitch has morethan 1.5 million broadcasters and 100 million visitors per month884.4. Numerical examples: Interactive advertising in live social mediacompare the linear threshold policy against the state-of-the-art solver for POMDP:SARSOP (an approximate POMDP planning algorithm) [116].Table 4.2 shows the normalized cumulative reward by each of the policies. Theexpected reward was calculated using 1000 independent Monte Carlo simulations.From Table 4.2 we observe the following:1. The linear threshold policy and heuristic policy outperforms periodic schedulingby a factor of 2.2. The linear threshold policy outperforms the heuristic policy by 12%.3. The linear threshold policy has a performance drop of 9% compared to thesolution obtained using SARSOP. This can be attributed to the linear hyper-plane approximation to the threshold curve compared to the SARSOP solutionwhere the number of linear segments is exponential in the number of states andobservations.Although the linear threshold policies have a slight performance drop comparedto SARSOP, it has two significant advantages:1. The policy (the linear threshold vectors corresponding to each stop) is easy toimplement54.2. Computing the linear threshold approximation is computationally cheaper com-pared to SARSOP algorithm. It can be noted from Table 4.2 that Algorithm 3is computationally cheaper by a factor of 10.Algorithm Cumulative Reward #Computations(Normalized w.r.t SARSOP)SARSOP 1 18e12Linear Threshold 0.91 1.25e11Heuristic 0.79 1.25e11Periodic 0.35 0Table 4.2: Comparison of the expected reward and number of computations by vari-ous algorithm. The linear threshold policies have a performance drop of 9% comparedto the solution obtained using SARSOP and outperforms the heuristic policy by 12%.SARSOP solution computed using a 2.5 GHz CPU running for 2 hours. The calcula-tion assumes a floating point operation every CPU cycle. Algorithm 3, for obtaininglinear threshold policies, was run with finite horizon N = 1000.54The SARSOP policy has approximately 4e4 linear segments.894.5. Closing remarks4.5 Closing remarksThis chapter presented three main results regarding the multiple stopping time prob-lem.(i) The optimal policy was shown to be monotone with respect to a specialized mono-tone likelihood ratio order on lines (under reasonable conditions). Therefore theoptimal policy was characterized by multiple threshold curves on the belief space andthe optimal stopping sets satisfied a nested property (Theorem 4.2.1).(ii) Necessary and sufficient conditions were given for a linear threshold policies tosatisfy the MLR increasing condition for the optimal policy (Theorem 4.3.1 and The-orem 4.3.2). We then gave a stochastic gradient algorithm (Algorithm 3) to estimatethe linear threshold policies.(iii) Finally, the linear scheduling policy was illustrated on a real data set involvinginteractive advertising in live social media video.4.6 Proof of theorems4.6.1 Preliminaries and DefinitionsTheorem 4.2.1 require concepts in stochastic dominance [117] and submodularity [118].First-order and MLR stochastic dominanceIn order to compare belief states and we will use the monotone likelihood ratio (MLR)stochastic ordering and a specialized version of the MLR order restricted to lines inthe simplex. The MLR stochastic order is useful since it is preserved under conditionexpectations.Definition 4.6.1 (MLR ordering). Let pi1, pi2 \u2208 \u03a0 be two belief state vectors. Then, pi1is greater than pi2 with respect to Monotone Likelihood Ratio (MLR) ordering\u2013denotedas pi1 \u2265r pi2, ifpi1(j)pi2(i) \u2264 pi2(j)pi1(i), i < j, i, j \u2208 {1, . . . , S} (4.24)Proposition 4.6.1. If the observation distribution is Poisson, i.e. B(i, j) =gji exp (\u2212gi)j!,where gi is the mean of the Poisson distribution, then (A2) is satisfied when gi de-creases monotonically with i.904.6. Proof of theoremsProof. Recall, (A2) is given byB(j, x)B(i, y) \u2264 B(j, y)B(i, x), j < i, x \u2264 y.Substituting,gxj exp (\u2212gj)x!gyi exp (\u2212gi)y!\u2264 gyj exp (\u2212gj)y!gxi exp (\u2212gi)x!(gjgi)x\u2264(gjgi)y,which implies that gi \u2264 gj.Definition 4.6.2 (First order stochastic dominance). Let pi1, pi2 \u2208 \u03a0 be two be-lief state vectors. Then, pi1 is greater than pi2 with respect to first-order stochasticdominance\u2013denoted as pi1 \u2265s pi2, ifS\u2211i=jpi1(i) \u2264S\u2211i=jpi2(i) \u2200j \u2208 {1, 2, \u00b7 \u00b7 \u00b7 , S} . (4.25)Result [99]:i) pi1, pi2 \u2208 \u03a0. Then, pi1 \u2265r pi2 implies pi1 \u2265s pi2.ii) pi1 \u2265s pi2 if and only if for any increasing function \u03c6(\u00b7), Epi1 {\u03c6(x)} \u2265 Epi2 {\u03c6(x)}.For state-space dimension S = 2, MLR is a complete order and coincides with first-order stochastic dominance. For state-space dimension S > 2 MLR is a partial orderi.e. [\u03a0,\u2265r] is a partially ordered set55since it is not always possible to order any twobelief states. However, on line segments in the simplex defined below, MLR is a totalordering.Define the sub simplex Hi; i = 1, S as:Hi = {p\u00afi : p\u00afi \u2208 \u03a0 and p\u00afi(i) = 0} . (4.26)Figure 4.2 illustrates H1 for S = 3. Consider two types of lines, L (ei, p\u00afi) ; i = 1, S, asfollows: For any p\u00afi \u2208 Hi, construct the line L(ei, p\u00afi) that connects p\u00afi to ei as below:L (ei, p\u00afi) = {pi \u2208 \u03a0 : pi = (1\u2212 \u03b3) p\u00afi + \u03b3e1, 0 \u2264 \u03b3 \u2264 1} , p\u00afi \u2208 H1 (4.27)With an abuse of notation, we denote L(ei, p\u00afi) by L(ei). Figure 4.2 illustrates the55A partially ordered set is a set X on which there is a binary relation 4 that is reflexive,antisymmetric, and transitive.914.6. Proof of theoremsdefinition of L(e1).Definition 4.6.3 (MLR ordering on lines). pi1 is greater than pi2 with respect to MLRordering on the lines L(ei), denoted as pi1 \u2265Li pi2, if pi1, pi2 \u2208 L(ei), for some p\u00afi \u2208 Hiand pi1 \u2265r pi2.Remark 5 ([99]). For i = 1, S, pi1 \u2265Li pi2 is equivalent to pij = \u03b5jei + (1 \u2212 \u03b5j)p\u00afi, forsome p\u00afi \u2208 Hi and \u03b51 \u2265 \u03b52.The MLR ordering on lines is a complete order, i.e. it forms a chain, i.e. allelements pi, p\u00afi \u2208 L(ei) are comparable, i.e. either pi \u2265Li p\u00afi or p\u00afi \u2265Li pi. The MLRon lines allows us to give a threshold characterization of the optimal policy on thebelief space. An important consequence of assumption (A1) and (A2) is the followingtheorem, which state that the filter T (pi, y) in (4.7) preserves MLR dominance.Theorem 4.6.1 ([99]). If the transition matrix, P , and the observation matrix, B,satisfies the condition in (A1) and (A2), then\u2022 For pi1 \u2265r pi2, the filter satisfies T (pi1, \u00b7) \u2265r T (pi2, \u00b7).\u2022 For pi1 \u2265r pi2, \u03c3(pi1, \u00b7) \u2265s \u03c3(pi2, \u00b7)To prove the structural result, we show that the Q(pi, l, u) in (4.9) is submodularon the lines L(ei); i = 1, S with respect to the MLR order \u2265Li .Definition 4.6.4 (Submodular function). A function f : L(ei) \u00d7 {1, 2} \u2192 IR issubmodular if :f(pi, u)\u2212 f(pi, u\u00af) \u2264 f(p\u00afi, u)\u2212 f(p\u00afi, u\u00af);u \u2264 u\u00af, pi \u2265Li p\u00afi (4.28)Theorem 4.6.2 ([118]). If f(pi, u) is submodular, then there exists a u\u2217(pi) = argmaxu\u2208Uf(pi, u)that is decreasing in pi.4.6.2 Value iterationThe value iteration algorithm is a successive approximation approach for solvingBellman\u2019s equation (4.9). For iterations k = 0, 1, . . . ,Vk+1(pi, l) = maxu\u2208{1,2}Qk+1(pi, l, u), (4.29)924.6. Proof of theorems\u00b5k+1(pi, l) = argmaxu\u2208{1,2}Qk+1(pi, l, u), (4.30)whereQk+1(pi, l, 1) = r\u2032pi + \u03c1\u2211yVk(T (pi, y), l \u2212 1)\u03c3(pi, y), (4.31)Qk+1(pi, l, 2) = \u03c1\u2211yVk(T (pi, y), l)\u03c3(pi, y), (4.32)with V0(pi, l) initialized arbitrarily. Define Wk(pi, l) asWk(pi, l) , Vk(pi, l)\u2212 Vk(pi, l \u2212 1). (4.33)The stopping and continue sets (at each iteration k) when l stops are remaining isdefined as follows:Slk+1 = {pi|r\u2032pi \u2265 \u03c1\u2211yWk(T (pi, y), l)\u03c3(pi, y)},C lk+1 = {pi|r\u2032pi < \u03c1\u2211yWk(T (pi, y), l)\u03c3(pi, y)}.(4.34)The optimal stationary policy \u00b5\u2217(pi, l) is given by\u00b5\u2217(pi, l) = limk\u2192\u221e\u00b5k(pi, l). (4.35)Correspondingly, the stationary stopping and continue sets in (4.10) and (4.11) aregiven bySl = limk\u2192\u221eSlk, Cl = limk\u2192\u221eC lk. (4.36)The value function, Vk(pi, l) in (4.29), can be rewritten, using (4.34), as follows:Vk(pi, l) =(r\u2032pi + \u03c1\u2211yVk\u22121(T (pi, y), l \u2212 1)\u03c3(pi, y))ISlk+(\u03c1\u2211yVk\u22121(T (pi, y), l)\u03c3(pi, y))IClk , (4.37)where IClk and ISlk are indicator functions on the continue and stopping sets respec-tively, for each iteration k.Assume Sl\u22121k \u2282 Slk (see Theorem 4.6.5) and substituting (4.37) in the definition934.6. Proof of theoremsof Wk(pi, l) in (4.33),Wk(pi, l) =(\u03c1\u2211yWk\u22121(T (pi, y), l)\u03c3(pi, y))IClk(pi)+ r\u2032piICl\u22121k \u2229Slk(pi) (4.38)+(\u03c1\u2211yWk\u22121(T (pi, y), l \u2212 1)\u03c3(pi, y))ISl\u22121k (pi).In order to prove the main theorem (Theorem 4.2.1), we require the followingresults, proofs of which are provided in 4.6.3.Theorem 4.6.3. Vk(pi, l) is increasing in pi.Theorem 4.6.4. Wk(pi, l) is decreasing in l.Theorem 4.6.5. Slk+1 \u2283 Sl\u22121k+14.6.3 ProofsTo prove Theorem 4.6.3, Theorem 4.6.4 and Theorem 4.6.5, we assume that theproposition hold for all values less than k.Proof of Theorem 4.6.3Recall from (4.29),Vk(pi, l) = maxu\u2208{1,2}Qk(pi, l, u),To prove Theorem 4.6.3, we show Qk(pi, l, u) is MLR increasing in pi for u = {1, 2}.Recall from (4.31),Qk(pi, l, 1) = r\u2032pi + \u03c1\u2211y Vk\u22121(T (pi, y), l \u2212 1)\u03c3(pi, y),Using Theorem 4.6.1 and the induction hypothesis, the term\u2211y Vk\u22121(T (pi, y), l \u22121)\u03c3(pi, y) is MLR increasing in pi. From Assumption (A3), r\u2032pi is MLR increasing inpi. The proof for Qk(pi, l, 2) MLR increasing in pi is similar and is omitted. Hence,Vk(pi, l) is MLR increasing in pi.944.6. Proof of theoremsProof of Theorem 4.6.4The proof follows by induction. Recall from (4.38), we haveWk(pi, l \u2212 1) =\u2211yWk\u22121(T (pi, y), l \u2212 1)\u03c3(pi, y)ICl\u22121k (pi)+r\u2032piICl\u22122k \u2229Sl\u22121k (pi)+ (4.39)\u2211yWk\u22121(T (pi, y), l \u2212 2)\u03c3(pi, y)ISl\u22122k (pi)Hence, we compare Wk(pi, l) and Wk(pi, l \u2212 1) in the following 4 regions:a.) Sl\u22122k : Wk(pi, l)\u2212Wk(pi, l \u2212 1) =\u2211y(Wk\u22121(T (pi, y), l \u2212 1)\u2212Wk\u22121(T (pi, y), l \u2212 2))\u03c3(pi, y),which is non-negative by the induction assumption.b.) C l\u22122k \u2229 Sl\u22121k : Wk(pi, l)\u2212Wk(pi, l \u2212 1) =\u2211yWk\u22121(T (pi, y), l \u2212 1)\u03c3(pi, y)\u2212 r\u2032pi,which is non-negative since pi \u2208 Sl\u22121k .c.) C l\u22121k \u2229 Slk : Wk(pi, l)\u2212Wk(pi, l \u2212 1) =r\u2032pi \u2212\u2211yWk\u22121(T (pi, y), l \u2212 1)\u03c3(pi, y),which is non-negative since pi \u2208 C l\u22121k .d.) C lk : Wk(pi, l)\u2212Wk(pi, l \u2212 1) =\u2211y(Wk\u22121(T (pi, y), l)\u2212Wk\u22121(T (pi, y), l \u2212 1))\u03c3(pi, y),which is non-negative by the induction assumption.Proof of Theorem 4.6.5If pi \u2208 Sl\u22121k , then r\u2032pi \u2265\u2211yWk\u22121(T (pi, y), l \u2212 1)\u03c3(pi, y). By Theorem 4.6.4, r\u2032pi \u2265\u2211yWk\u22121(T (pi, y), l)\u03c3(pi, y). Hence pi \u2208 Slk.954.6. Proof of theoremsProof of Theorem 4.2.1Existence of optimal policy: In order to show the existence of a threshold policyof L1, we need to show that Qk+1(pi, l, 2)\u2212Qk+1(pi, l, 1) is submodular in pi \u2208 L(e1).Since,Qk+1(pi, l, 2)\u2212Qk+1(pi, l, 1) = \u03c1\u2211yWk(T (pi, y), l)\u03c3(pi, y)\u2212 r\u2032pi.We need to show that \u03c1\u2211yWk(T (pi, y), l)\u03c3(pi, y)\u2212 r\u2032pi is MLR decreasing in pi.\u03c1\u2211yWk(T (pi, y), l)\u03c3(pi, y)\u2212 r\u2032pi (4.40)=\u2211y(\u03c1Wk(T (pi, y), l)\u2212 r\u2032pi)\u03c3(pi, y)=\u2211y((\u03c1Wk(T (pi, y), l)\u2212 \u03c1r\u2032T (pi, y))\u2212 (r\u2032pi \u2212 \u03c1r\u2032T (pi, y)))\u03c3(pi, y)= \u03c1\u2211y(Wk(T (pi, y), l)\u2212 r\u2032T (pi, y))\u03c3(pi, y)\u2212 r\u2032(I \u2212 \u03c1P \u2032)pi (4.41)The term \u2212r\u2032(I \u2212 \u03c1P \u2032)pi in (4.41) is MLR decreasing in pi due to our assumption.Hence, to show that \u03c1\u2211yWk(T (pi, y), l)\u03c3(pi, y) \u2212 r\u2032pi is MLR decreasing in pi it issufficient to show that Wk(pi, l)\u2212 r\u2032pi is MLR decreasing in pi. Define,W\u00afk(pi, l) , Wk(pi, l)\u2212 r\u2032pi (4.42)Now, W\u00afk(pi, l) =(\u2211y \u03c1((W\u00afk\u22121(T (pi, y), l) + r\u2032T (pi, y))\u2212 r\u2032pi)\u03c3(pi, y)) IClk(pi)+(\u2211y \u03c1((W\u00afk\u22121(T (pi, y), l \u2212 1) + r\u2032T (pi, y))\u2212 r\u2032pi)\u03c3(pi, y)) ISlk(pi)=(\u2211y(\u03c1W\u00afk\u22121(T (pi, y), l)\u03c3(pi, y))\u2212 r\u2032(I \u2212 \u03c1P )\u2032pi) IClk(pi)+(\u2211y(\u03c1W\u00afk\u22121(T (pi, y), l \u2212 1)\u03c3(pi, y))\u2212 r\u2032(I \u2212 \u03c1P )\u2032pi) ISlk(pi) (4.43)We prove using induction that W\u00afk(pi, l) is MLR decreasing in pi, using the recursiverelation over k in (4.43).964.6. Proof of theoremsFor k = 0,W\u00af0(pi, l) = W0(pi, l)\u2212 r\u2032pi = V0(pi, l)\u2212 V0(pi, l \u2212 1)\u2212 r\u2032pi (4.44)The initial conditions of the value iteration algorithm can be chosen such that W\u00af0(pi, l)in (4.44) is decreasing in pi. A suitable choice of the initial conditions is given below:V0(pi, l) = r\u2032(l\u22121\u2211j=0\u03c1jP j)\u2032pi. (4.45)The intuition behind the initial conditions in (4.45) is that the value function, V0(pi, l)gives the expected total reward if we stop l times successively starting at belief pi.Next, we show that W\u00afk(pi, l) is MLR decreasing in pi, if W\u00afk\u22121(pi, l) is MLR decreas-ing in pi. For pi1 \u2265r pi2, consider the following cases: (a) pi1, pi2 \u2208 Sl\u22121k , (b) pi1 \u2208 Sl\u22121k ,pi2 \u2208 C l\u22121k \u2229 Slk, (c) pi1, pi2 \u2208 C l\u22121k \u2229 Slk, (d) pi1 \u2208 C l\u22121k \u2229 Slk, pi2 \u2208 C lk, (e) pi1, pi2 \u2208 C lk,(f) pi1 \u2208 Sl\u22121k , pi2 \u2208 C lk. For cases (a), (c), (e), W\u00afk(pi1, l) \u2264 W\u00afk(pi2, l) by the inductionassumption. For case (b) W\u00afk(pi1, l) \u2264 W\u00afk(pi2, l), since pi1 \u2208 Sl\u22121k . Case (d) is similarto case (b). For case (f),W\u00afk(pi1, l)\u2212 W\u00afk(pi2, l)=(\u2211y(\u03c1W\u00afk\u22121(T (pi1, y), l \u2212 1)\u03c3(pi1, y))\u2212 r\u2032(I \u2212 \u03c1P )\u2032pi1)\u2212(\u2211y(\u03c1W\u00afk\u22121(T (pi2, y), l)\u03c3(pi2, y))\u2212 r\u2032(I \u2212 \u03c1P )\u2032pi2)\u2264 \u03c1(\u2211y((W\u00afk\u22121(T (pi1, y), l \u2212 1)\u2212 W\u00afk\u22121(T (pi1, y), l))\u03c3(pi1, y)))\u2264 0,where the first inequality is due to induction hypothesis and the second inequality isdue to Theorem 4.6.4. Hence, it is clear that W\u00afk(pi, l) is decreasing in pi, if W\u00afk\u22121(pi, l)is decreasing in pi, finishing the induction step.Characterization of the switching curve \u0393l: For each p\u00afi \u2208 H construct theline segment L(e1, p\u00afi). The line segment can be described as (1\u2212\u03b5)p\u00afi+\u03b5e1. On the linesegment L(e1, p\u00afi) all the belief states are MLR orderable. Since \u00b5\u2217(pi, l) is monotonedecreasing in pi, for each l, we pick the largest \u03b5 such that \u00b5\u2217(pi, l) = 1. The beliefstate, pi\u03b5\u2217,p\u00afi is the threshold belief state, where \u03b5\u2217 = inf {\u03b5 \u2208 [0, 1] : \u00b5\u2217(pi\u03b5,p\u00afi) = 1}.Denote by \u0393(p\u00afi) = pi\u03b5\u2217,p\u00afi. The above construction implies that there is a uniquethreshold \u0393(p\u00afi) on L(e1, p\u00afi). The entire simplex can be covered by considering allpairs of lines L(e1, p\u00afi), for p\u00afi \u2208 H1, i.e. \u03a0 = \u222ap\u00afi\u2208HL(e1, p\u00afi). Combining, all points yield974.6. Proof of theoremsa unique threshold curve in \u03a0 given by \u0393 = \u222ap\u00afi\u2208H1\u0393(p\u00afi).Connectedness of Sl: Since e1 \u2208 Sl for all l, call Sla, the subset of Sl thatcontains e1. Suppose Slb is the subset that was disconnected from Sla. Since everypoint on \u03a0 lies on the line segment L(e1, p\u00afi), for some p\u00afi, there exists a line segmentstarting from e1 \u2208 Sla that would leave the set Sla, pass through the set where action2 is optimal and then intersect set Slb, where action 1 is optimal. But, this violatesthe requirement that the policy \u00b5\u2217(pi, l) is monotone on L(e1, p\u00afi). Hence, Sla and Slbare connected.Connectedness of C l: Assume eS \u2208 C l, otherwise C l is empty and there isnothing to prove. Call the set that contains eS as Cla. Suppose Clb \u2282 C l is disconnectedfrom C la. Since every point in \u03a0 lies on the line segment L(eS, p\u00afi), for some p\u00afi, thereexists a line starting from eS \u2208 C la would leave set C la, pass through the set whereaction 1 is optimal and then intersect the set C lb (where action 2 is optimal). But thisviolates the monotone property of \u00b5\u2217(pi, l).Nested structure: The proof is straightforward from Theorem 4.6.5.Proof of Theorem 4.3.1The proof of Theorem 4.3.1 is similar to the proof of Theorem 12.4.1 in [99]. Recall,that the linear threshold policies is given by:\u00b5\u03b8(pi, l) =\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f31 if[0 1 \u03b8l]\uf8ee\uf8f0 pi\u22121\uf8f9\uf8fb \u2264 02 else .For any number of stops remaining, e1 (the belief that the state is 1) belongs to thestopping set, Sl ,which gives the first condition \u03b8l(S \u2212 1) \u2265 0.Consider pi1 \u2265L1 pi2. Then pi1 = \u03b51e1 + (1 \u2212 \u03b51)p\u00afi and pi2 = \u03b52e1 + (1 \u2212 \u03b52)p\u00afi, forsome p\u00afi \u2208 H and \u03b51 \u2265 \u03b5256. For the linear policy to the MLR decreasing on lines,\u00b5\u03b8(pi1, l) \u2264 \u00b5\u03b8(pi2, l). Hence,[0 1 \u03b8l] [pi1\u22121]\u2264[0 1 \u03b8l] [pi2\u22121],[0 1 \u03b8l] [pi1 \u2212 pi20]\u2264 0,56Refer to Remark 5984.6. Proof of theorems[0 1 \u03b8l] [(\u03b51 \u2212 \u03b52)e1 \u2212 (\u03b51 \u2212 \u03b52)p\u00afi0]\u2264 0,\u2212(\u03b51 \u2212 \u03b52) [p\u00afi(2) + \u03b8l(1)p\u00afi(3) + \u00b7 \u00b7 \u00b7+ \u03b8l(S \u2212 2)p\u00afi(S)] \u2264 0,giving the second set of conditions \u03b8l(i) \u2265 0, i \u2264 S \u2212 2.The proof of the second part is similar and hence is omitted.Proof of Theorem 4.3.2For l1 > l2, due to the nested structure in Theorem 4.2.1 Sl2 \u2282 Sl1 . This implies thefollowing\u00b5\u03b8(l2, pi) \u2265 \u00b5\u03b8(l1, pi)[0 1 \u03b8l2] [ pi\u22121]\u2265[0 1 \u03b8l1] [ pi\u22121][0 0 \u03b8l2 \u2212 \u03b8l1] [ pi\u22121]\u2265 0 (4.46)It is straightforward to check that the conditions in (4.16) in Theorem 4.3.2 satisfythe conditions in (4.46).99Chapter 5Conclusion5.1 Summary of findingsThe unifying theme of the thesis was to devise a set of theories and methods fordetection, estimation and control in online social media. This chapter concludes thework and presents a summary of findings along with some direction for future researchand development.\u2022 Chapter 2 considered the problem of detection of change in utility by a lin-ear perturbation. Necessary and sufficient conditions for the detection of thechange point were derived. In addition, in the presence of noise, we provideda procedure for detecting the unknown change point and a hypothesis test fordetecting dynamic utility maximization. The results were illustrated on datasetfrom Yahoo! Tech Buzz. Chapter 2 also considered the practical problem ofdetecting utility maximizing behaviour in high dimensional datasets. The prob-lem of computational complexity associated with high dimensional datasets wassolved using a dimensionality reduction algorithm using Johnson-Lindenstrausstransform.\u2022 Chapter 3 conducted a data-driven study of YouTube. The main result isthe sensitivity of the meta-level features on the view counts of YouTube videos.Next, optimizing the meta-data after the video is posted improves the popularityof the video. This is useful for multi-platform networks like BBTV to generatemore view count using existing content. Chapter 3 also discusses the socialdynamics (the interaction of the channel) that affects the popularity of thechannel. Using the Granger causality test, we showed that the view count hasa casual effect on the subscriber count of the channel.\u2022 Chapter 4 considered the problem of optimal scheduling of ads on live per-sonalized online social broadcasting channels. First, we cast the problem asan optimal multiple stopping problem in the POMDP framework. Second, wecharacterized the structural results of the optimal multiple stopping policy. By1005.2. Directions for future researchexploiting the structural results of the optimal multiple stopping policy we com-puted optimal linear threshold policies using a stochastic gradient algorithm.Finally, we validated the results on real datasets. Through a real dataset fromPeriscope, the linear threshold policies found outperformed conventional peri-odic scheduling by 25%.5.2 Directions for future researchThe work presented in this thesis can be extended in various directions.\u2022 The change detection problem in Chapter 2 considered the utility change of asingle agent. An extension of the change detection framework could includemultiple agents (possibly over a social network). A challenging problem is howto model the interation of the agents and the utility maximization frameworkfor multiple agents. Some work in this direction has been done by consideringconcave potential games in [119] and [120]. Another possible research directionis by considering multiple change points or considering higher order perturbationfunctions. However, there is a trade-off in identifiability and the generalizationof the model. As shown in [52], if change points are allowed at all time instantsthen any dataset satisfy the model.\u2022 Chapter 3 conducted a data-driven study of YouTube. However the conclusionsin Chapter 3 are based on the BBTV dataset. Extrapolating these results toother YouTube datasets is an important problem worth addressing in futurework. Another extension of the current work could involve studying the effectof video characteristics on different traffic sources, for example the effect oftweets or posts of videos on Twitter or Facebook.\u2022 Chapter 4 considered the problem of optimal scheduling of ads on live onlinesocial broadcasting channels. The optimal linear threshold policies were ob-tained through a stochastic gradient algorithm. However, it is of interest todevelop upper and lower myopic bounds to the optimal policy as in [121]. Theupper and lower myopic bound policies are computationally easy to implementand can be constructed to be close to the optimal policy. Chapter 4 assumesall ads have the same length and revenue is obtained only through advertising.However, these are rarely true. Some of the ads may be \u201csponsored\u201d (the adsare already paid for) and the ads may be of varying length. Optimizing for adlength and external sources of revenue is an interesting problem to consider.1015.2. Directions for future researchThese issues promise to offer interesting avenues for future work.102Bibliography[1] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, andL. Brilliant, \u201cDetecting influenza epidemics using search engine query data,\u201dNature, vol. 457, no. 7232, pp. 1012\u20131014, 2009.[2] J. Strebel, T. Erdem, and J. Swait, \u201cConsumer search in high technology mar-kets: exploring the use of traditional information channels,\u201d Journal of Con-sumer Psychology, vol. 14, no. 1, pp. 96\u2013104, 2004.[3] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, andL. Brilliant, \u201cDetecting influenza epidemics using search engine query data,\u201dNature, vol. 457, no. 7232, pp. 1012\u20131014, 2009.[4] H. A. Carneiro and E. Mylonakis, \u201cGoogle trends: A web-based tool for real-time surveillance of disease outbreaks,\u201d Clinical Infectious Diseases, vol. 49,no. 10, pp. 1557\u20131564, 2009.[5] X. Zhou, J. Ye, and Y. Feng, \u201cTuberculosis surveillance by analyzing googletrends,\u201d IEEE Trans. on Biomedical Engineering, vol. 58, no. 8, pp. 2247\u20132254,Aug 2011.[6] A. Seifter, A. Schwarzwalder, K. Geis, and J. Aucott, \u201cThe utility of googletrends for epidemiological research: Lyme disease as an example,\u201d GeospatialHealth, vol. 4, no. 2, pp. 135\u2013137, 2010.[7] J. A. Doornik, \u201cImproving the timeliness of data on influenza-like illnesses usinggoogle trends,\u201d Tech. Rep., 2010.[8] L. Wu and E. Brynjolfsson, The Future of Prediction: How Google SearchesForeshadow Housing Prices and Sales. University of Chicago Press, 2009, p.147.[9] A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe, \u201cPredicting elections withtwitter: What 140 characters reveal about political sentiment,\u201d in Int. AAAIConference on Web and Social Media, 2010, pp. 178\u2013185.103Bibliography[10] B. K. Kaye and T. J. Johnson, \u201cOnline and in the know: Uses and gratifica-tions of the web for political information,\u201d Journal of Broadcasting & ElectronicMedia, vol. 46, no. 1, pp. 54\u201371, 2002.[11] T. Smith, M. Obrist, and P. Wright, \u201cLive-streaming changes the (video) game,\u201din Proc. of the 11th European Conference on Interactive TV and Video. ACM,2013, pp. 131\u2013138.[12] N. Archak, V. Mirrokni, and S. Muthukrishnan, \u201cBudget optimization for onlinecampaigns with positive carryover effects,\u201d in Proc. of the 8th InternationalConference on Internet and Network Economics. Springer-Verlag, 2012, pp.86\u201399.[13] N. Archak, V. S. Mirrokni, and S. Muthukrishnan, \u201cMining advertiser-specificuser behavior using adfactors,\u201d in Proceedings of the 19th International Con-ference on World Wide Web, ser. WWW \u201910. ACM, 2010, pp. 31\u201340.[14] P. Samuelson, \u201cA note on the pure theory of consumer\u2019s behaviour,\u201d Economica,vol. 5, no. 17, pp. 61\u201371, 1938.[15] S. Afriat, \u201cThe construction of utility functions from expenditure data,\u201d Inter-national economic review, vol. 8, no. 1, pp. 67\u201377, 1967.[16] H. Varian, \u201cThe nonparametric approach to demand analysis,\u201d Econometrica,vol. 50, no. 1, pp. 945\u2013973, 1982.[17] W. Diewert and C. Parkan, \u201cTests for the consistency of consumer data,\u201d Jour-nal of Econometrics, vol. 30, no. 1-2, pp. 127\u2013147, 1985.[18] H. Varian, \u201cRevealed preference,\u201d Samuelsonian economics and the twenty-firstcentury, pp. 99\u2013115, 2006.[19] V. Krishnamurthy and W. Hoiles, \u201cAfriat\u2019s test for detecting malicious agents,\u201dIEEE Signal Processing Letters, vol. 19, no. 12, pp. 801\u2013804, 2012.[20] M. Barni and F. Pe\u00b4rez-Gonza\u00b4lez, \u201cCoping with the enemy: Advances inadversary-aware signal processing,\u201d in IEEE Conf. on Acoustics, Speech andSignal Processing, 2013, pp. 8682\u20138686.[21] W. Hoiles and V. Krishnamurthy, \u201cNonparametric demand forecasting and de-tection of energy aware consumers,\u201d IEEE Transactions on Smart Grid, vol. 6,no. 2, pp. 695\u2013704, March 2015.104Bibliography[22] S. Currarini, M. O. Jackson, and P. Pin, \u201cIdentifying the roles of race-basedchoice and chance in high school friendship network formation,\u201d Proc. of theNational Academy of Sciences, vol. 107, no. 11, pp. 4857\u20134861, 2010.[23] R. K.-X. Jin, D. C. Parkes, and P. J. Wolfe, \u201cAnalysis of bidding networksin eBay: aggregate preference identification through community detection,\u201d inProc. of AAAI workshop on PAIR, 2007.[24] G. Gu\u00a8rsun, M. Crovella, and I. Matta, \u201cDescribing and forecasting video accesspatterns,\u201d in 2011 Proc. of INFOCOM. IEEE, 2011, pp. 16\u201320.[25] H. Pinto, J. Almeida, and M. GonXcalves, \u201cUsing early view patterns to predictthe popularity of YouTube videos,\u201d in Proc. of the sixth ACM Int. Conf. onWeb search and Data mining. ACM, 2013, pp. 365\u2013374.[26] C. Richier, E. Altman, R. Elazouzi, T. Jimenez, G. Linares, and Y. Por-tilla, \u201cBio-inspired models for characterizing YouTube viewcout,\u201d in 2014IEEE\/ACM Int. Conf. on Advances in Social Networks Analysis and Mining.IEEE, 2014, pp. 297\u2013305.[27] C. Richier, R. Elazouzi, T. Jimenez, E. Altman, and G. Linares, \u201cForecastingonline contents\u2019 popularity,\u201d arXiv preprint arXiv:1506.00178, 2015.[28] A. Zhang, \u201cJudging YouTube by its covers,\u201d Department of Computer Scienceand Engineering, University of California, San Diego, Tech. Rep., 2015.[Online]. Available: http:\/\/cseweb.ucsd.edu\/\u223cjmcauley\/cse255\/reports\/wi15\/Angel%20Zhang.pdf[29] T. Yamasaki, S. Sano, and K. Aizawa, \u201cSocial popularity score: Predicting num-bers of views, comments, and favorites of social photos using only annotations,\u201din Proc. of the First Int. Workshop on Internet-Scale Multimedia Management.ACM, 2014, pp. 3\u20138.[30] T. Yamasaki, J. Hu, K. Aizawa, and T. Mei, \u201cPower of tags: Predicting pop-ularity of social media in geo-spatial and temporal contexts,\u201d in Advances inMultimedia Information Processing. Springer, 2015, pp. 149\u2013158.[31] T. Trzcinski and P. Rokita, \u201cPredicting popularity of online videos using sup-port vector regression,\u201d arXiv preprint arXiv:1510.06223, 2015.105Bibliography[32] Y. Ding, Y. Du, Y. Hu, Z. Liu, L. Wang, K. Ross, and A. Ghose, \u201cBroadcastyourself: Understanding YouTube uploaders,\u201d in Proc. of the ACM SIGCOMMConf. on Internet Measurement. New York, NY, USA: ACM, 2011, pp. 361\u2013370.[33] S. Bollapragada, M. R. Bussieck, and S. Mallik, \u201cScheduling commercial video-tapes in broadcast television,\u201d Oper. Res., vol. 52, no. 5, pp. 679\u2013689, Oct.2004.[34] D. G. Popescu and P. Crama, \u201cAd revenue optimization in live broadcasting,\u201dManagement Science, vol. 62, no. 4, pp. 1145\u20131164, 2015.[35] S. Seshadri, S. Subramanian, and S. Souyris, \u201cScheduling spots on television,\u201d2015.[36] H. Kang and M. P. McAllister, \u201cSelling you and your clicks: examining the audi-ence commodification of google,\u201d Journal for a Global Sustainable InformationSociety, vol. 9, no. 2, pp. 141\u2013153, 2011.[37] R. Terlutter and M. L. Capella, \u201cThe gamification of advertising: analysis andresearch directions of in-game advertising, advergames, and advertising in socialnetwork games,\u201d Journal of Advertising, vol. 42, no. 2-3, pp. 95\u2013112, 2013.[38] J. Turner, A. Scheller-Wolf, and S. Tayur, \u201cScheduling of dynamic in-gameadvertising,\u201d Operations Research, vol. 59, no. 1, pp. 1\u201316, 2011.[39] T. Nakai, \u201cThe problem of optimal stopping in a partially observable Markovchain,\u201d Journal of Optimization Theory and Applications, vol. 45, no. 3, pp.425\u2013442, 1985.[40] W. Stadje, \u201cAn optimal k-stopping problem for the poisson process,\u201d in Math-ematical Statistics and Probability Theory. Springer, 1987, pp. 231\u2013244.[41] M. Nikolaev, \u201cOn optimal multiple stopping of Markov sequences,\u201d Theory ofProbability & Its Applications, vol. 43, no. 2, pp. 298\u2013306, 1999.[42] A. Krasnosielska-Kobos, \u201cMultiple-stopping problems with random horizon,\u201dOptimization, vol. 64, no. 7, pp. 1625\u20131645, 2015.[43] E. Bayraktar and R. Kravitz, \u201cQuickest detection with discretely controlledobservations,\u201d Sequential Analysis, vol. 34, no. 1, pp. 77\u2013133, 2015.106Bibliography[44] J. Geng, E. Bayraktar, and L. Lai, \u201cBayesian quickest change-point detectionwith sampling right constraints,\u201d IEEE Transactions on Information Theory,vol. 60, no. 10, pp. 6474\u20136490, 2014.[45] T. L. Lai, \u201cOn optimal stopping problems in sequential hypothesis testing,\u201dStatistica Sinica, vol. 7, no. 1, pp. 33\u201351, 1997.[46] \u2014\u2014, Sequential analysis. Wiley Online Library, 2001.[47] S. H. J. Alexander G. Nikolaev, \u201cStochastic sequential decision-making with arandom number of jobs,\u201d Operations Research, vol. 58, no. 4, pp. 1023\u20131027,2010.[48] S. Savin and C. Terwiesch, \u201cOptimal product launch times in a duopoly: Bal-ancing life-cycle revenues with product cost,\u201d Operations Research, vol. 53,no. 1, pp. 26\u201347, 2005.[49] I. Lobel, J. Patel, G. Vulcano, and J. Zhang, \u201cOptimizing product launches inthe presence of strategic consumers,\u201d Management Science, vol. 62, no. 6, pp.1778\u20131799, 2015.[50] K. E. Wilson, R. Szechtman, and M. P. Atkinson, \u201cA sequential perspectiveon searching for static targets,\u201d European Journal of Operational Research, vol.215, no. 1, pp. 218 \u2013 226, 2011.[51] M. Atkinson, M. Kress, and R.-J. Lange, \u201cWhen is information sufficient for ac-tion? search with unreliable yet informative intelligence,\u201d Operations Research,vol. 64, no. 2, pp. 315\u2013328, 2016.[52] A. Adams, R. Blundell, M. Browning, and I. Crawford, \u201cPrices versuspreferences: taste change and revealed preference,\u201d Mar 2015. [Online].Available: \/uploads\/publications\/wps\/WP201511.pdf[53] D. L. McFadden and M. Fosgerau, \u201cA theory of the perturbed consumer withgeneral budgets,\u201d National Bureau of Economic Research, Tech. Rep., 2012.[54] D. J. Brown and R. L. Matzkin, \u201cEstimation of nonparametric functions in si-multaneous equations models, with an application to consumer demand,\u201d 1998.[55] D. Fudenberg, R. Iijima, and T. Strzalecki, \u201cStochastic choice and revealedperturbed utility,\u201d Econometrica, vol. 83, no. 6, pp. 2371\u20132409, 2015.107Bibliography[56] G. C. Chasparis and J. Shamma, \u201cControl of preferences in social networks,\u201din Decision and Control (CDC), IEEE Conf. on, Dec 2010, pp. 6651\u20136656.[57] M. Basseville and I. Nikiforov, Detection of Abrupt Changes \u2014 Theory andApplications, ser. Information and System Sciences Series. New Jersey, USA:Prentice Hall, 1993.[58] M.-F. Balcan, A. Daniely, R. Mehta, R. Urner, and V. V. Vazirani, \u201cLearningeconomic parameters from revealed preferences,\u201d in Int. Conf. on Web andInternet Economics. Springer, 2014, pp. 338\u2013353.[59] M. Zadimoghaddam and A. Roth, \u201cEfficiently learning from revealed prefer-ence,\u201d in Internet and Network Economics. Springer, 2012, pp. 114\u2013127.[60] E. Beigman and R. Vohra, \u201cLearning from revealed preference,\u201d in Proc. of the7th ACM Conference on Electronic Commerce, ser. EC \u201906, 2006, pp. 36\u201342.[61] O. Chapelle, B. Schlkopf, and A. Zien, Semi-Supervised Learning, 1st ed. TheMIT Press, 2010.[62] W. B. Johnson and J. Lindenstrauss, \u201cExtensions of Lipschitz mappings into aHilbert space,\u201d Contemporary mathematics, vol. 26, pp. 189\u2013206, 1984.[63] D. Achlioptas, \u201cDatabase-friendly random projections: Johnson-lindenstrausswith binary coins,\u201d Journal of Computer and System Sciences, vol. 66, pp.671\u2013687, 2003.[64] S. S. Vempala, The random projection method. American Mathematical Soc.,2005, vol. 65.[65] B. Mangold, M. Dooley, G. Flake, H. Hoffman, T. Kasturi, D. Pennock, andR. Dornfest, \u201cThe Tech Buzz Game,\u201d Computer, vol. 38, no. 7, pp. 94\u201397, July2005.[66] Y. Chen, D. M. Pennock, and T. Kasturi, \u201cAn empirical study of dynamic pari-mutuel markets: Evidence from the tech buzz game,\u201d in Proc. of WebKDD,2008.[67] P. Milgrom and C. Shannon, \u201cMonotone comparative statistics,\u201d Econometrica,vol. 62, no. 1, pp. 157\u2013180, 1992.108Bibliography[68] M. Zeni, D. Miorandi, and F. D. Pellegrini, \u201cYoustatanalyzer: A tool foranalysing the dynamics of YouTube content popularity,\u201d in Proc. of the 7thIntn. Conf. on Performance Evaluation Methodologies and Tools, 2013, pp. 286\u2013289.[69] X. Cheng, M. Fatourechi, X. Ma, C. Zhang, L. Zhang, and J. Liu, \u201cInsight dataof YouTube from a partner\u2019s view,\u201d in Proc. of NOSSDAV Workshop. ACM,2014, pp. 73\u201378.[70] A. Brodersen, S. Scellato, and M. Wattenhofer, \u201cYouTube around the world:Geographic popularity of videos,\u201d in Proc. of the 21st Int. Conf. on World WideWeb, ser. WWW \u201912. ACM, 2012, pp. 241\u2013250.[71] Y. Ding, Y. Du, Y. Hu, Z. Liu, L. Wang, K. Ross, and A. Ghose, \u201cBroadcastyourself: Understanding YouTube uploaders,\u201d in Proc. of SIGCOMM Conf. onInternet Measurement Conference, 2011, pp. 361\u2013370.[72] M. Gerasimov, V. Kruglov, and A. Volodin, \u201cOn negatively associated randomvariables,\u201d Lobachevskii Journal of Mathematics, vol. 33, no. 1, pp. 47\u201355, 2012.[73] K. Joag-Dev and F. Proschan, \u201cNegative association of random variables withapplications,\u201d The Annals of Statistics, pp. 286\u2013295, 1983.[74] G. Huang, Q. Zhu, and C. Siew, \u201cExtreme learning machine: theory and appli-cations,\u201d Neurocomputing, vol. 70, no. 1, pp. 489\u2013501, 2006.[75] G. Huang and L. Chen, \u201cConvex incremental extreme learning machine,\u201d Neu-rocomputing, vol. 70, no. 16, pp. 3056\u20133062, 2007.[76] G. H. Golub and C. F. Van Loan, Matrix computations. JHU Press, 2012,vol. 3.[77] H. Liu and H. Motoda, Feature selection for knowledge discovery and data min-ing. Springer Science & Business Media, 2012, vol. 454.[78] U. Stan\u00b4czyk and L. Jain, Feature Selection for Data and Pattern Recognition.Springer, 2015.[79] M. Gevrey, I. Dimopoulos, and S. Lek, \u201cReview and comparison of methods tostudy the contribution of variables in artificial neural network models,\u201d Ecolog-ical modelling, vol. 160, no. 3, pp. 249\u2013264, 2003.109Bibliography[80] M. Yamada, W. Jitkrittum, L. Sigal, E. Xing, and M. Sugiyama, \u201cHigh-dimensional feature selection by feature-wise kernelized lasso,\u201d Neural compu-tation, vol. 26, no. 1, pp. 185\u2013207, 2014.[81] C. Hutto and E. Gilbert, \u201cVader: A parsimonious rule-based model for sen-timent analysis of social media text,\u201d in Eighth International Conference onWeblogs and Social Media, 2014.[82] M. A. Hall, \u201cCorrelation-based feature selection for machine learning,\u201d Ph.D.dissertation, The University of Waikato, 1999.[83] H. Drucker, \u201cImproving regressors using boosting techniques,\u201d in ICML, vol. 97,1997, pp. 107\u2013115.[84] T. Hothorn, K. Hornik, and A. Zeileis, \u201cUnbiased recursive partitioning: A con-ditional inference framework,\u201d Journal of Computational and Graphical statis-tics, vol. 15, no. 3, pp. 651\u2013674, 2006.[85] C. Bishop, Pattern Recognition and Machine Learning. Springer-Verlag NewYork, 2006.[86] N. Meinshausen, \u201cRelaxed lasso,\u201d Computational Statistics & Data Analysis,vol. 52, no. 1, pp. 374\u2013393, 2007.[87] W. Venables and B. Ripley, Modern applied statistics with S-PLUS. SpringerScience & Business Media, 2013.[88] T. Hothorn, P. Bu\u00a8hlmann, T. Kneib, M. Schmid, and B. Hofner, \u201cModel-basedboosting 2.0,\u201d Journal of Machine Learning Research, vol. 11, pp. 2109\u20132113,2010.[89] C. W. Granger, \u201cInvestigating causal relations by econometric models and cross-spectral methods,\u201d Econometrica: Journal of the Econometric Society, pp. 424\u2013438, 1969.[90] G. M. Ljung and G. E. P. Box, \u201cOn a measure of lack of fit in time seriesmodels,\u201d Biometrika, vol. 65, no. 2, pp. 297\u2013303, 1978.[91] A. Wald, Sequential analysis. Dover, 1973.[92] L. Jiang, Y. Miao, Y. Yang, Z. Lan, and A. Hauptmann, \u201cViral video style: acloser look at viral videos on YouTube,\u201d in Proceedings of International Con-ference on Multimedia Retrieval. ACM, 2014, p. 193.110Bibliography[93] F. Figueiredo, F. Benevenuto, and J. Almeida, \u201cThe tube over time: character-izing popularity growth of youtube videos,\u201d in Proceedings of the fourth ACMinternational conference on Web search and data mining. ACM, 2011, pp.745\u2013754.[94] A. Tartakovsky, I. Nikiforov, and M. Basseville, Sequential analysis: Hypothesistesting and changepoint detection. CRC Press, 2014.[95] S. Burer and A. Letchford, \u201cNon-convex mixed-integer nonlinear programming:a survey,\u201d Surveys in Operations Research and Management Science, vol. 17,no. 2, pp. 97\u2013106, 2012.[96] D. P. Bertsekas, Nonlinear programming. Athena scientific, 1999.[97] T. Hastie, R. Tibshirani, and J. H. Friedman, The elements of statistical learn-ing: data mining, inference, and prediction, 2nd Edition., ser. Springer seriesin statistics. Springer, 2009.[98] D. P. Bertsekas, Dynamic programming and optimal control. Athena ScientificBelmont, MA, 2017, vol. 1, no. 4.[99] V. Krishnamurthy, Partially Observed Markov Decision Processes. CambridgeUniversity Press, 2016.[100] C. H. Papadimitriou and J. Tsitsiklis, \u201cThe compexity of Markov decision pro-cesses,\u201d Mathematics of Operations Research, vol. 12, no. 3, pp. 441\u2013450, 1987.[101] G. Yin and Q. Zhang, Discrete-time Markov chains: two-time-scale methodsand applications. Springer Science & Business Media, 2006, vol. 55.[102] G. Piao and J. G. Breslin, \u201cExploring dynamics and semantics of user interestsfor user modeling on twitter for link recommendations,\u201d in Proceedings of the12th International Conference on Semantic Systems. ACM, 2016, pp. 81\u201388.[103] M. L. Puterman, Markov decision processes: discrete stochastic dynamic pro-gramming. John Wiley & Sons, 2005.[104] G. C. Pflug, Optimization of stochastic models: the interface between simulationand optimization. Springer Science & Business Media, 2012, vol. 373.[105] J. C. Spall, Introduction to stochastic search and optimization: estimation, sim-ulation, and control. John Wiley & Sons, 2005, vol. 65.111Bibliography[106] W. Lovejoy, \u201cA survey of algorithmic methods for partially observed Markovdecision processes,\u201d Annals of Operations Research, vol. 28, pp. 47\u201366, 1991.[107] B. Wang, X. Zhang, G. Wang, H. Zheng, and B. Y. Zhao, \u201cAnatomy of a person-alized livestreaming system,\u201d in Proceedings of the 2016 Internet MeasurementConference, ser. IMC \u201916. New York, NY, USA: ACM, 2016, pp. 485\u2013498.[108] A. Baldominos Go\u00b4mez, E. Albacete Garc\u00b4\u0131a, I. Marrero, and Y. Saez Achaeran-dio, \u201cReal-time prediction of gamers behavior using variable order Markov andbig data technology: a case of study,\u201d 2016.[109] F. Benevenuto, T. Rodrigues, M. Cha, and V. Almeida, \u201cCharacterizing userbehavior in online social networks,\u201d in Proceedings of the 9th ACM SIGCOMMConference on Internet Measurement, ser. IMC \u201909, 2009, pp. 49\u201362.[110] K. Lewis, M. Gonzalez, and J. Kaufman, \u201cSocial selection and peer influencein an online social network,\u201d Proceedings of the National Academy of Sciences,vol. 109, no. 1, pp. 68\u201372, 2012.[111] W. A. Hamilton, O. Garretson, and A. Kerne, \u201cStreaming on twitch: Fosteringparticipatory communities of play within live mixed media,\u201d in Proceedings ofthe 32Nd Annual ACM Conference on Human Factors in Computing Systems,2014, pp. 1315\u20131324.[112] M. Del Vicario, A. Bessi, F. Zollo, F. Petroni, A. Scala, G. Caldarelli, H. E.Stanley, and W. Quattrociocchi, \u201cThe spreading of misinformation online,\u201dProceedings of the National Academy of Sciences, vol. 113, no. 3, pp. 554\u2013559,2016.[113] J. Lehmann, M. Lalmas, E. Yom-Tov, and G. Dupret, \u201cModels of user en-gagement,\u201d in International Conference on User Modeling, Adaptation, andPersonalization. Springer, 2012, pp. 164\u2013175.[114] W. Zucchini and I. L. MacDonald, Hidden Markov models for time series: anintroduction using R. CRC press, 2009.[115] V. Krishnamurthy and C. R. Rojas, \u201cReduced complexity hmm filtering withstochastic dominance bounds: A convex optimization approach,\u201d IEEE Trans-actions on Signal Processing, vol. 62, no. 23, pp. 6309\u20136322, Dec 2014.112Bibliography[116] H. Kurniawati, D. Hsu, and W. S. Lee, \u201cSARSOP: Efficient point-basedPOMDP planning by approximating optimally reachable belief spaces.\u201d inRobotics: Science and Systems., 2008.[117] S. Karlin and Y. Rinott, \u201cClasses of orderings of measures and related cor-relation inequalities. I. Multivariate totally positive distributions,\u201d Journal ofMultivariate Analysis, vol. 10, no. 4, pp. 467\u2013498, December 1980.[118] D. M. Topkis, Supermodularity and complementarity. Princeton universitypress, 2011.[119] R. Deb, \u201cA testable model of consumption with externalities,\u201d Journal of Eco-nomic Theory, vol. 144, no. 4, pp. 1804 \u2013 1816, 2009.[120] W. Hoiles, V. Krishnamurthy, and A. Aprem, \u201cPac algorithms for detectingnash equilibrium play in social networks: From twitter to energy markets,\u201dIEEE Access, Special Section: Socially Enabled Networking and Computing,vol. 4, pp. 8147\u20138161, 2016.[121] V. Krishnamurthy and U. Pareek, \u201cMyopic bounds for optimal policy ofPOMDPs: An extension of Lovejoy\u2019s structural results,\u201d Operations Research,vol. 62, no. 2, pp. 428\u2013434, 2015.113","attrs":{"lang":"en","ns":"http:\/\/www.w3.org\/2009\/08\/skos-reference\/skos.html#note","classmap":"oc:AnnotationContainer"},"iri":"http:\/\/www.w3.org\/2009\/08\/skos-reference\/skos.html#note","explain":"Simple Knowledge Organisation System; Notes are used to provide information relating to SKOS concepts. There is no restriction on the nature of this information, e.g., it could be plain text, hypertext, or an image; it could be a definition, information about the scope of a concept, editorial information, or any other type of information."}],"Genre":[{"label":"Genre","value":"Thesis\/Dissertation","attrs":{"lang":"en","ns":"http:\/\/www.europeana.eu\/schemas\/edm\/hasType","classmap":"dpla:SourceResource","property":"edm:hasType"},"iri":"http:\/\/www.europeana.eu\/schemas\/edm\/hasType","explain":"A Europeana Data Model Property; This property relates a resource with the concepts it belongs to in a suitable type system such as MIME or any thesaurus that captures categories of objects in a given field. It does NOT capture aboutness"}],"GraduationDate":[{"label":"GraduationDate","value":"2018-02","attrs":{"lang":"en","ns":"http:\/\/vivoweb.org\/ontology\/core#dateIssued","classmap":"vivo:DateTimeValue","property":"vivo:dateIssued"},"iri":"http:\/\/vivoweb.org\/ontology\/core#dateIssued","explain":"VIVO-ISF Ontology V1.6 Property; Date Optional Time Value, DateTime+Timezone Preferred "}],"IsShownAt":[{"label":"IsShownAt","value":"10.14288\/1.0361160","attrs":{"lang":"en","ns":"http:\/\/www.europeana.eu\/schemas\/edm\/isShownAt","classmap":"edm:WebResource","property":"edm:isShownAt"},"iri":"http:\/\/www.europeana.eu\/schemas\/edm\/isShownAt","explain":"A Europeana Data Model Property; An unambiguous URL reference to the digital object on the provider\u2019s website in its full information context."}],"Language":[{"label":"Language","value":"eng","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/language","classmap":"dpla:SourceResource","property":"dcterms:language"},"iri":"http:\/\/purl.org\/dc\/terms\/language","explain":"A Dublin Core Terms Property; A language of the resource.; Recommended best practice is to use a controlled vocabulary such as RFC 4646 [RFC4646]."}],"Program":[{"label":"Program","value":"Electrical and Computer Engineering","attrs":{"lang":"en","ns":"https:\/\/open.library.ubc.ca\/terms#degreeDiscipline","classmap":"oc:ThesisDescription","property":"oc:degreeDiscipline"},"iri":"https:\/\/open.library.ubc.ca\/terms#degreeDiscipline","explain":"UBC Open Collections Metadata Components; Local Field; Indicates the program for which the degree was granted."}],"Provider":[{"label":"Provider","value":"Vancouver : University of British Columbia Library","attrs":{"lang":"en","ns":"http:\/\/www.europeana.eu\/schemas\/edm\/provider","classmap":"ore:Aggregation","property":"edm:provider"},"iri":"http:\/\/www.europeana.eu\/schemas\/edm\/provider","explain":"A Europeana Data Model Property; The name or identifier of the organization who delivers data directly to an aggregation service (e.g. Europeana)"}],"Publisher":[{"label":"Publisher","value":"University of British Columbia","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/publisher","classmap":"dpla:SourceResource","property":"dcterms:publisher"},"iri":"http:\/\/purl.org\/dc\/terms\/publisher","explain":"A Dublin Core Terms Property; An entity responsible for making the resource available.; Examples of a Publisher include a person, an organization, or a service."}],"Rights":[{"label":"Rights","value":"Attribution-NonCommercial-NoDerivatives 4.0 International","attrs":{"lang":"*","ns":"http:\/\/purl.org\/dc\/terms\/rights","classmap":"edm:WebResource","property":"dcterms:rights"},"iri":"http:\/\/purl.org\/dc\/terms\/rights","explain":"A Dublin Core Terms Property; Information about rights held in and over the resource.; Typically, rights information includes a statement about various property rights associated with the resource, including intellectual property rights."}],"RightsURI":[{"label":"RightsURI","value":"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/","attrs":{"lang":"*","ns":"https:\/\/open.library.ubc.ca\/terms#rightsURI","classmap":"oc:PublicationDescription","property":"oc:rightsURI"},"iri":"https:\/\/open.library.ubc.ca\/terms#rightsURI","explain":"UBC Open Collections Metadata Components; Local Field; Indicates the Creative Commons license url."}],"ScholarlyLevel":[{"label":"ScholarlyLevel","value":"Graduate","attrs":{"lang":"en","ns":"https:\/\/open.library.ubc.ca\/terms#scholarLevel","classmap":"oc:PublicationDescription","property":"oc:scholarLevel"},"iri":"https:\/\/open.library.ubc.ca\/terms#scholarLevel","explain":"UBC Open Collections Metadata Components; Local Field; Identifies the scholarly level of the author(s)\/creator(s)."}],"Title":[{"label":"Title","value":"Detection, estimation and control in online social media","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/title","classmap":"dpla:SourceResource","property":"dcterms:title"},"iri":"http:\/\/purl.org\/dc\/terms\/title","explain":"A Dublin Core Terms Property; The name given to the resource."}],"Type":[{"label":"Type","value":"Text","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/type","classmap":"dpla:SourceResource","property":"dcterms:type"},"iri":"http:\/\/purl.org\/dc\/terms\/type","explain":"A Dublin Core Terms Property; The nature or genre of the resource.; Recommended best practice is to use a controlled vocabulary such as the DCMI Type Vocabulary [DCMITYPE]. To describe the file format, physical medium, or dimensions of the resource, use the Format element."}],"URI":[{"label":"URI","value":"http:\/\/hdl.handle.net\/2429\/63811","attrs":{"lang":"en","ns":"https:\/\/open.library.ubc.ca\/terms#identifierURI","classmap":"oc:PublicationDescription","property":"oc:identifierURI"},"iri":"https:\/\/open.library.ubc.ca\/terms#identifierURI","explain":"UBC Open Collections Metadata Components; Local Field; Indicates the handle for item record."}],"SortDate":[{"label":"Sort Date","value":"2017-12-31 AD","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/date","classmap":"oc:InternalResource","property":"dcterms:date"},"iri":"http:\/\/purl.org\/dc\/terms\/date","explain":"A Dublin Core Elements Property; A point or period of time associated with an event in the lifecycle of the resource.; Date may be used to express temporal information at any level of granularity. Recommended best practice is to use an encoding scheme, such as the W3CDTF profile of ISO 8601 [W3CDTF].; A point or period of time associated with an event in the lifecycle of the resource.; Date may be used to express temporal information at any level of granularity. Recommended best practice is to use an encoding scheme, such as the W3CDTF profile of ISO 8601 [W3CDTF]."}]}