Applied Science, Faculty of
Electrical and Computer Engineering, Department of
DSpace
UBCV
Aprem, Anup
2017-12-04T23:13:56Z
2017
Doctor of Philosophy - PhD
University of British Columbia
Due to large scale use of online social media there has been growing interest in modeling and analysis of data from online social media. The unifying theme of this thesis is to develop a set of mathematical tools for detection, estimation and control in online social media. The following are the main contributions of this thesis: Chapter 2 deals with nonparametric change detection for dynamic utility maximization agents. Using the revealed preference framework, necessary and sufficient conditions for detecting the change point are derived. In the presence
of noisy measurements, we construct a decision test to check for dynamic utility maximization behaviour and the change point. Experiments on the Yahoo! Tech Buzz dataset show that the framework can be used to detect changes in ground truth using online search data. Chapter 3 studies engagement dynamics and sensitivity analysis of YouTube
videos. Using machine learning and sensitivity analysis techniques it is shown that the video view count is sensitive to 5 meta-level features. In addition, changing the meta-level after the video has been posted increases the popularity of the video. In addition, we examine how the social dynamics of a YouTube channel affect it's popularity. The results are empirically validated on a real-world data consisting of about 6 million videos spread over 25 thousand channels. Chapter 4 considers the problem of scheduling advertisements in live personalized online social media. Broadcasters aim to opportunistically schedule advertisements (ads) so as to generate maximum revenue. The problem is formulated as a multiple stopping problem and is addressed in a partially observed Markov decision process (POMDP) framework. Structural results are provided on the optimal ad scheduling policy. By exploiting the structure of the optimal policy, optimum linear threshold policies are computed using a stochastic gradient algorithm.
The proposed model and framework are validated on a Periscope dataset and it was found that the revenue can be improved by 25% in comparison to currently employed periodic scheduling.
https://circle.library.ubc.ca/rest/handle/2429/63811?expand=metadata
Detection, Estimation andControl in Online Social MediabyAnup Aprem 2017M.E., Indian Institute of Science, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2017c© Anup Aprem 2017AbstractDue to large scale use of online social media there has been growing interest in model-ing and analysis of data from online social media. The unifying theme of this thesis isto develop a set of mathematical tools for detection, estimation and control in onlinesocial media. The following are the main contributions of this thesis:• Chapter 2 deals with nonparametric change detection for dynamic utility max-imization agents. Using the revealed preference framework, necessary and suf-ficient conditions for detecting the change point are derived. In the presenceof noisy measurements, we construct a decision test to check for dynamic util-ity maximization behaviour and the change point. Experiments on the Yahoo!Tech Buzz dataset show that the framework can be used to detect changes inground truth using online search data.• Chapter 3 studies engagement dynamics and sensitivity analysis of YouTubevideos. Using machine learning and sensitivity analysis techniques it is shownthat the video view count is sensitive to 5 meta-level features. In addition,changing the meta-level after the video has been posted increases the popu-larity of the video. In addition, we examine how the social dynamics of aYouTube channel affect it’s popularity. The results are empirically validated ona real-world data consisting of about 6 million videos spread over 25 thousandchannels.• Chapter 4 considers the problem of scheduling advertisements in live personal-ized online social media. Broadcasters aim to opportunistically schedule adver-tisements (ads) so as to generate maximum revenue. The problem is formulatedas a multiple stopping problem and is addressed in a partially observed Markovdecision process (POMDP) framework. Structural results are provided on theoptimal ad scheduling policy. By exploiting the structure of the optimal policy,optimum linear threshold policies are computed using a stochastic gradient algo-rithm. The proposed model and framework are validated on a Periscope datasetand it was found that the revenue can be improved by 25% in comparison tocurrently employed periodic scheduling.iiLay SummaryThis thesis studies three problems in online social media:• A framework for detecting changes in ground truth, such as onset of epidemicdiseases, using online search data. The proposed framework is quite general; theonly assumption is that the data is generated by users with rational behaviour.• Estimating the parameters that make YouTube videos popular. In addition, weexamine how the social dynamics of a YouTube channel affect the popularity ofit’s videos.• A framework for scheduling advertisements in personalized live social mediaapplications like Periscope so as to maximize the advertisement revenue. Nu-merical results show that the proposed framework performs 25% better thanexisting methods.iiiPrefaceThe work presented in this thesis is based on the research and development con-ducted in the Statistical Signal Processing Laboratory at the University of BritishColumbia (Vancouver). The research work presented in the chapters of this disser-tation is performed by the author with feedback and assistance provided by Prof.Vikram Krishnamurthy. The author is responsible for writeup, problem formulation,research development, data analyses and numerical studies presented in this disser-tation with frequent suggestions, technical and editorial feedback from Prof. VikramKrishnamurthy. The ELM formulation in Chapter 3 is due in part to Dr. WilliamHoiles. The dataset used in Chapter 3 is from BroadBandTV Corp. The mathemat-ical proofs in Chapter 4 were by Prof. Krishnamurthy. For Chapter 4, Sujay Bhattprovided valuable insights into the application and formulation of the problem.The work presented in different chapters of the thesis has appeared in severalpublications which are listed below. In these publications, all co-authors contributedto the editing of the manuscript.• The work of Chapter 2 has been presented in the following publications:– [Journal Paper] A. Aprem and V. Krishnamurthy, Utility Change PointDetection in Online Social Media: A Revealed Preference Framework, IEEETransactions on Signal Processing, vol.64, no.7, pp.1869-1880– [Conference Paper] A. Aprem and V. Krishnamurthy, A Data Cen-tric Approach to Utility Change Detection in Online Social Media., IEEEConf. on Acoustics, Speech, and Signal Processing, New Orleans, LA, USA,March 59, 2017• The work of Chapter 3 has been presented in the following publications:– [Journal Paper] W. Hoiles, A. Aprem and V. Krishnamurthy, Engage-ment dynamics and sensitivity analysis of YouTube videos, IEEE Transac-tions on Knowledge and Data Engineering, vol.7, pp.1426-1437• Materials in Chapter 4 have appeared in the following publications and pre-prints for possible publicationivPreface– [Conference Paper] V. Krishnamurthy, A. Aprem and S. Bhatt, Mul-tiple Stopping Time Problems: Structural Results, 54th Annual AllertonConference on Communication, Control, and Computing, 2016– [Journal Paper] V. Krishnamurthy, A. Aprem and S. Bhatt, Opportunis-tic Advertisement Scheduling in Live Social Media: A Multiple StoppingTime POMDP Approach, https://arxiv.org/abs/1611.00291• Although not presented in this thesis, the discussion and results in Chapter 2was inspired by the work presented in the following publication– [Journal Paper] W. Hoiles, V. Krishnamurthy and A. Aprem, PAC Al-gorithms for Detecting Nash Equilibrium Play in Social Networks: FromTwitter to Energy Markets, IEEE Access, Special Section: Socially En-abled Networking and Computing.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Utility change detection in online social media . . . . . . . . 21.1.2 Engagement dynamics and sensitivity analysis of YouTube videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Interactive advertisement scheduling in personalized live socialmedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Utility change detection in online social media . . . . . . . . 61.2.2 Engagement dynamics and sensitivity analysis of YouTube videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.3 Interactive advertisement scheduling in personalized live socialmedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.1 Utility change detection in online social media . . . . . . . . 9viTable of Contents1.3.2 Engagement dynamics and sensitivity analysis of YouTube videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.3 Interactive advertisement scheduling in live personalized socialmedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Utility Change Detection in Online Social Media . . . . . . . . . . 132.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Background: Utility maximization and Revealed preference . . . . . 142.2.1 Revealed preference in a noisy setting . . . . . . . . . . . . . 162.3 Revealed preference: Utility change point detection (deterministic case). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Utility change model . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Recovery of minimum perturbation and base utility function . 202.3.3 Comparison with classical change detection . . . . . . . . . . 212.4 Revealed preference: Utility change point detection in noise . . . . . 212.4.1 Estimation of unknown change point . . . . . . . . . . . . . . 222.4.2 Recovering the linear perturbation coefficients for minimumfalse alarm probability . . . . . . . . . . . . . . . . . . . . . . 222.5 Dimensionality reduction: Revealed preference for big data . . . . . 242.6 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.6.1 Detection of unknown change point in the presence of noise . 262.6.2 Yahoo! Buzz Game . . . . . . . . . . . . . . . . . . . . . . . 272.6.3 Youstatanalyzer database . . . . . . . . . . . . . . . . . . . . 302.7 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.8 Proof of theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.8.1 Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . 322.8.2 Negative dependence of random variables . . . . . . . . . . . 342.8.3 Proof of Theorem 2.2.2 . . . . . . . . . . . . . . . . . . . . . 352.8.4 Proof of Theorem 2.4.1 . . . . . . . . . . . . . . . . . . . . . 352.8.5 Proof of Lemma 2.8.2 . . . . . . . . . . . . . . . . . . . . . . 372.8.6 Proof of Lemma 2.8.3 . . . . . . . . . . . . . . . . . . . . . . 382.8.7 CUSUM algorithm for utility change point detection . . . . . 393 Engagement Dynamics and Sensitivity Analysis of YouTube videos 403.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2 Sensitivity analysis of YouTube meta-level features . . . . . . . . . . 42viiTable of Contents3.2.1 Extreme learning machine (ELM) . . . . . . . . . . . . . . . 423.2.2 Sensitivity analysis (Background) . . . . . . . . . . . . . . . . 433.2.3 Sensitivity of YouTube meta-level features and predicting viewcount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.4 Sensitivity to meta-level optimization . . . . . . . . . . . . . 503.3 Social interaction of the channel with YouTube users . . . . . . . . . 533.3.1 Causality between subscribers and view count in YouTube . . 543.3.2 Scheduling dynamics in YouTube . . . . . . . . . . . . . . . . 563.3.3 Modeling the view count dynamics of videos with exogenousevents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.3.4 Video playthrough dynamics . . . . . . . . . . . . . . . . . . 603.4 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.5 Supplementary material . . . . . . . . . . . . . . . . . . . . . . . . . 623.5.1 Description of YouTube dataset . . . . . . . . . . . . . . . . . 623.5.2 Background: Statistical learning algorithms . . . . . . . . . . 644 Interactive Advertisement Scheduling in Personalized Live SocialMedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.1 Sequential multiple stopping and stochastic dynamic programming . 704.1.1 Optimal multiple stopping: POMDP formulation . . . . . . . 714.1.2 Belief state formulation of the objective . . . . . . . . . . . . 734.1.3 Stochastic dynamic programming . . . . . . . . . . . . . . . . 744.2 Optimal multiple stopping: Structural results . . . . . . . . . . . . . 744.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2.3 Main result: Optimality of threshold policies . . . . . . . . . 774.3 Stochastic gradient algorithm for estimating optimal linear thresholdpolicies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.3.1 Structure of optimal linear threshold policies for multiple stop-ping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.3.2 Simulation-based stochastic gradient algorithm for estimatinglinear threshold policies . . . . . . . . . . . . . . . . . . . . . 804.4 Numerical examples: Interactive advertising in live social media . . . 824.4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.2 Real dataset: Interactive ad scheduling on Periscope using viewerengagement . . . . . . . . . . . . . . . . . . . . . . . . . . . 84viiiTable of Contents4.4.3 Large state space models & Comparison with SARSOP . . . 884.5 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.6 Proof of theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.6.1 Preliminaries and Definitions . . . . . . . . . . . . . . . . . . 904.6.2 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 924.6.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.1 Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.2 Directions for future research . . . . . . . . . . . . . . . . . . . . . . 101Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103ixList of Tables2.1 Comparison of revealed preference with classical change detection al-gorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1 Performance and feature sensitivity . . . . . . . . . . . . . . . . . . . 503.2 Sensitivity to meta-level optimization . . . . . . . . . . . . . . . . . . 523.3 Sensitivity of various traffic sources to meta-level optimization . . . . 533.4 Fraction of channels satisfying the Granger causality hypothesis . . . 563.5 Metadata of YouTube channel and video . . . . . . . . . . . . . . . . 623.6 Dataset summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.7 YouTube dataset categories (out of 6 million videos) . . . . . . . . . 633.8 Popularity distribution of videos in the dataset . . . . . . . . . . . . . 643.9 Optimization summary statistics . . . . . . . . . . . . . . . . . . . . . 644.1 BIC model order selection . . . . . . . . . . . . . . . . . . . . . . . . 874.2 Comparison of expected reward and the computational complexity . . 89xList of Figures1.1 Meta-level features of a YouTube video . . . . . . . . . . . . . . . . . 42.1 Upper bound on the CDF of M . . . . . . . . . . . . . . . . . . . . . 162.2 Estimated change point detection and comparison to CUSUM algorithm 282.3 Buzz scores and trading price for WIFI and WIMAX . . . . . . . . . 292.4 Recovered utility function. . . . . . . . . . . . . . . . . . . . . . . . . 303.1 View count and subscriber count of YouTube videos . . . . . . . . . . 423.2 Sensitivity of meta-level features . . . . . . . . . . . . . . . . . . . . . 483.3 ELM predictive accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 513.4 Granger causality test . . . . . . . . . . . . . . . . . . . . . . . . . . 563.5 View count due to virality, migration and exogenous events. . . . . . 603.6 Video playthrough dynamics . . . . . . . . . . . . . . . . . . . . . . . 613.7 Fraction of YouTube videos in the dataset as a function of the age ofthe videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.1 Advertisement scheduling in personalized live social media . . . . . . 704.2 Visual illustration of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . 774.4 Periscope: Plot of likes and QQ-plot . . . . . . . . . . . . . . . . . . 87xiAcknowledgementsI owe my sincere gratitude to Prof. Vikram Krishnamurthy. I am deeply indebted tohim for his guidance, encouragement and support that I received during the comple-tion of this work. This feat was not possible without his consistent support, fruitfuldiscussions, valuable feedbacks, and constructive suggestions. It has been a pleasureworking under his supervision.The Statistical Signal Processing Lab in ECE department provided an engaging,productive, and stimulating lab environment to conduct research. Special thanks toDr. William Hoiles for his time, effort and helpful discussions during our collabora-tions.Last but certainly not the least, I am indebted to my parents for their unfailingconfidence, encouragement and love. The opportunities that they have given me andtheir unlimited sacrifices are the reasons where I am and what I have accomplishedso far.Above all I thank the Almighty, for the blessings that He has showered on mewhich gave me the strength and power to sail through difficult times.xiiDedicationTo my parents. . .xiiiChapter 1Introduction1.1 OverviewIn recent years, social media has become ubiquitous for networking and content shar-ing and have become the preferred means by which humans interact online. Forexample, close to 1.2 billion users log onto Facebook daily to share content withother people on the social network and Twitter, a micro-blogging website, gets over50 million tweets a day. Facebook posts and tweets on Twiter typically reflect eventsas they happen in real life. With the large scale usage of social media and the avail-ability of data, online social media has become the new sensors for social behaviour.Another paradigm shift in online social media is that more and more users arecreating, sharing and distributing content on the Web. One of the most popular sitesfor user-generated video content is YouTube. YouTube has grown so popular thataround 300 hours of video are uploaded every minute. YouTube earns revenue fromuser-generated content through advertising. In 2015, YouTube grossed 8.5 billiondollars in advertisement revenue. Through the Partner program YouTube sharesthe advertising revenue with the content creators. In recent years, due to improvedbandwidth and the ability to create content in real-time, live streaming has becomea popular means to generate and share content online. Live streaming was madepopular by Twitch.tv which deals with live video gaming, play through of videogames, and e-sport competitions. The revenue of Twitch.tv is around 3.8 billion forthe year 2015, out of which 77% of the revenue was generated from advertisement.New applications like Periscope and Meerkat brought personalized live streaming tosmartphones and made it even more popular.As a result of the enormous importance and potential for generating revenue inonline social media, there has been growing interest in modeling and analysis of onlinesocial media. Statistical inference from social media is an area that has witnessedtremendous progress recently. For example, [1] shows that using online search data theonset of influenza-type disease can be detected with a lag of 1-2 days, and outperformscomparable methods from the Centre for Disease Control (CDC), which usually hasa lag of 1-2 weeks. However, online behaviour in social media is not influenced by11.1. Overvieweconomic incentives but rather by intrinsic utility like “curiosity”, “fun” and “fame”.Hence, these problems are non-standard and traditional model-based methods arenot suitable to analyze these problems.Advertising revenue in online social media, for example YouTube videos, is relatedto the “popularity” of the content, i.e. the number of views gathered by the video.One of the important features that affect popularity of a YouTube video is the meta-level features such as the title, tag, keywords and description of the video. The studyof popularity of YouTube videos based on meta-level features is a challenging problemgiven the diversity of users and content providers.In additional to traditional offline videos in YouTube, live streaming in YouTubeand Twitch.tv and personalized live streaming in applications like Periscope also offersopportunities to generate revenue through advertisement. The revenue obtained inlive videos depends on the click rate and completion rate of the advertisements. Itis well known that the click rate and completion rate depends on the interest in thecontent of the video at any time. Viewers click and watch the advertisement if it isinserted when the content is interesting. Hence, advertisements need to be scheduledwhen the interest is high, so as to obtain maximum revenue.With the above applications of online social media, there is a strong motivationto develop a set of mathematical models and algorithmic tools to study online socialmedia. The generation and consumption of social media is driven by humans andhence research in social media requires borrowing techniques from multiple disciplinessuch as micro-economics, decision theory and machine learning. The thesis is devotedto development of algorithms and procedures that are aimed at detecting, estimatingand decision making in online social media. The rest of this section is devoted toan overview of these topics along with motivation and research goals that have beenaddressed in this thesis.1.1.1 Utility change detection in online social mediaUtility maximization is the fundamental problem humans face, wherein humans maxi-mize utility given their limited resources of money or attention. However, the majorityof content in social media is user-generated with limited or no economic incentives.As explained above, incentives such as “fun” and “fame” are some of the major at-tributes of the utility function of online social behaviour. It is therefore difficult toanalytically characterize the utility function and hence any detection of utility max-imization behaviour in online social media needs to be necessarily nonparametric innature. The problem of nonparametric detection of utility maximizing behaviour21.1. Overviewis the central theme in the area of revealed preferences in microeconomics. This isfundamentally different to the theme used widely in the signal processing literature,where one postulates an objective function (typically convex) and then develops op-timization algorithms. In contrast, the revealed preference framework, considered inthis thesis, is data centric: Given a dataset, Revealed preference answer the questionwhether the dataset is consistent with utility-maximization behaviour.Chapter 2 of this thesis considers the extension of the classical revealed preferenceframework to agents with “dynamic utility function”. The utility function jumpchanges at an unknown time instant (change point) by a linear perturbation. Suchchanges in utility function arise in online search in social media. The online searchis currently the most popular method for information retrieval [2]. The online searchprocess can be seen as an example of an agent maximizing the information utility, i.e.the amount of information consumed by an online agent given the limited resource ontime and attention. There has been a gamut of research which links internet searchbehaviour to ground truths such as symptoms of illness, political election, or majorsporting events [3–8]. Detection of utility change in online search, therefore, is helpfulto identify changes in ground truth and useful, for example, for early containment ofdiseases [4] or predicting changes in political opinion [9, 10].Research GoalsMotivated by detection of change in ground truth using data from online search, ouraim is to develop a nonparametric test to detect the change point and the utilityfunctions before and after the change, which is henceforth referred to as the changepoint detection problem. In practical settings, the data from online social media ismeasured in noise. Hence, we need a decision test such that dynamic utility maxi-mization can be detected under noisy measurements. A related important practicalissue that we also consider in this thesis is the high dimensionality of data arisingin online social media. As an example of high dimensional data arising in online so-cial media, we investigate the detection of the utility maximization process inherentin video sharing via YouTube. This motivates developing computationally efficientalgorithms for detecting utility maximization.31.1. Overview1.1.2 Engagement dynamics and sensitivity analysis ofYouTube videosYouTube generates billions in revenue through advertising and through the Partnerprogram shares the revenue with the content creators. The video view count is akey metric of the measure of popularity or “user engagement” of a video and themetric by which YouTube pays content providers. Hence, of significant importance forcontent creators is the meta-level features that are most sensitive for promoting videopopularity. Figure 1.1 shows the meta-level features of a YouTube video. The meta-level features of a YouTube video are the title, tags, keywords and the description ofthe video.Figure 1.1: The meta-level features of a video include the title, tag, keywords and thedescription of the video.Meta-level features, apart from the video content, play an important role in drivingtraffic to YouTube video, and thereby increasing the popularity of the video. Thetitle and the description of the video are the critical features through which a videois discovered via YouTube/Google search. In addition to title and the description ofthe video, the tags are used by YouTube to recommend similar videos to users. Thethumbnail of the video should be eye-catching and of good quality to attract users toclick and watch the video.Research GoalsThe key question is: How do meta-level features of a posted video drive user engage-ment in the YouTube social network? Specifically, what are the most critical featuresthat affect the popularity of the video? However, the content alone does not influence41.1. Overviewthe popularity of a video. YouTube also has a social network layer on top of it’s mediacontent. The main social component is how the content creators (also called “chan-nels”) interact with the users. So another key question is: How does the interactionof the YouTube channel with the user affect popularity of videos? Chapter 3 studiesboth the above questions. In particular, our aim is to examine how the individualvideo features (through the meta-level data) and the social dynamics contribute tothe popularity of a video.1.1.3 Interactive advertisement scheduling in personalizedlive social mediaPopularity of live video streaming has seen a sharp growth due to improved bandwidthfor streaming and the ease of sharing live content on the internet platforms. Inaddition, with the advent of high quality camera smartphones and the widespreaddeployment of 4G LTE based data services, personalized online video streaming isgrowing in popularity and includes mobile applications such as Periscope and Meerkat.One of the primary motivations for users to generate live content is that platformslike YouTube, Twitch etc., allow users to generate revenue through advertising androyalties.Some of the common ways advertisements (ads) are scheduled on pre-recordedvideo contents on social media like YouTube are pre-roll, mid-roll and post-roll; wherethe names indicate the time at which the ads are displayed. In a recent research,Adobe Research1 concluded that mid-roll video ads constitute the most engaging adtype for pre-recorded video contents, outperforming pre-roll and post-rolls when itcomes to completion rate (the probability that the ad will not be skipped). Viewersare more likely to engage with an ad if they are interested in the content of the videothat the ad is been inserted into.When a channel is streaming a live video, the mid-roll ads need to be scheduledmanually. Twitch allows only periodic ad scheduling [11] and YouTube and otherpersonalized live services currently offers no automated method of scheduling adsfor live channels. The ad revenue in live channel depends on the click rate (theprobability that the ad will be clicked), which in turn depend on the interest in thechannel content. Hence, ads need to be scheduled when the interest in the content ishigh.1https://gigaom.com/2012/04/16/adobe-ad-research/51.2. Main ContributionsResearch GoalsChapter 4 deals with optimal scheduling of ads on personalized live channels in socialmedia, by considering viewer engagement, termed as active scheduling, to maximizethe revenue generated from the advertisements. We model the interest of the livecontent using a Markov chain [12, 13]. The viewer engagement of the content is notobserved directly, however, noisy observation of the viewer engagement is obtainedby comments and likes by the viewers. Hence, the problem of computing the optimalpolicy of scheduling ads on live channel can then be formulated as an instance ofa stochastic control problem called the partially observable Markov decision process(POMDP). Typically, computing the optimal policy of a POMDP is computationallyintractable. However, by introducing assumptions on the POMDP model, importantstructural properties of the optimal policy can be determined without brute-forcecomputations. These structural properties can then be exploited to compute theoptimal policy.1.2 Main ContributionsIn this section, a brief summary of major novel contributions of the chapters thatconstitute this thesis is provided in order that they appear in the thesis. More detaileddescription of the contributions and findings of each chapter is provided in individualchapters.1.2.1 Utility change detection in online social mediaChapter 2 considers the problem of utility change detection under the revealed pref-erence framework.The main contributions of Chapter 2 are summarized below:1. In Sec 2.3, we derive necessary and sufficient conditions for change point de-tection, for dynamic utility maximizing agents under the revealed preferenceframework.2. Sec 2.4, studies the change point detection problem in the presence of noise. Inthe presence of noisy measurements, we propose a method to detect the changepoint and construct a decision test.3. To reduce the computational cost associated with high dimensional data arisingin the context of revealed preference, a dimensionality reduction algorithm using61.2. Main ContributionsJohnson-Lindenstrauss transform is presented.4. The results developed are illustrated on the Yahoo! Tech Buzz dataset. Byusing the results developed in the paper, several useful insights can be gleanedfrom these data sets. First, the changes in ground truth affecting the utility ofthe agent can be detected by utility maximization behaviour in online search.Second, the recovered utility functions satisfy the single crossing property indi-cating strategic substitute behaviour in online search.1.2.2 Engagement dynamics and sensitivity analysis ofYouTube videosChapter 3 investigate how the meta-level features and the interaction of the YouTubechannel with the users affect the popularity of videos.The main contributions of Chapter 3 are summarized below:1. The five dominant meta-level features that affect the popularity of a video are:first day view count , number of subscribers, contrast of the video thumbnail,Google hits, and number of keywords. Sec. 3.2 discusses this further.2. Optimizing the meta-level features (e.g. thumbnail, title, tags, description) aftera video has been posted increases the popularity of the video. In addition,optimizing the title increases the traffic due to YouTube search, optimizing thethumbnail increases the traffic from related videos and optimizing the keywordsincreases the traffic from related and promoted videos. Sec. 3.2.4 providesdetails on this analysis.3. Insight into the causal relationship between the subscribers and view count forYouTube channels is also explored. For popular YouTube channels, we foundthat the channel view count affects the subscriber count, see Sec. 3.3.1.4. New insights into the scheduling dynamics in YouTube gaming channels arealso found. For channels with a dominant periodic uploading schedule, going“off the schedule” increases the popularity of the channel, see Sec. 3.3.2.5. The generalized Gompertz model can be used to distinguish views due to virality(views from subscribers), migration (views from non-subscribers) and exogenousevents, see Sec. 3.3.3.71.3. Related Works6. New insights into playlist dynamics. The early view count dynamics of aYouTube videos are highly correlated with the long term “migration” of viewersto the video. Also, early videos in a game playthrough typically contain higherviews compared with later videos in a game playthrough playlist, see Sec. 3.3.4.7. The number of subscribers of a channel only affects the early view count dy-namics of videos in a playthrough, see Sec. 3.3.4.1.2.3 Interactive advertisement scheduling in personalizedlive social mediaChapter 4 considers the problem of optimal scheduling of ads in live social media.The main contributions of Chapter 4 are summarized below:1. A POMDP framework for the optimal ad-scheduling problem on live personal-ized channels and show that it is an instance of the optimal multiple stoppingproblem.2. Structural results are derived for the optimal multiple stopping policy. It isshown in Sec. 4.2 that the optimal multiple stopping policy is a threshold policyon the space of Bayesian posteriors. In addition, it was shown that the optimalmultiple stopping policy satisfy a nesting property.3. A stochastic gradient algorithm to compute linear threshold approximation tothe threshold policy. Numerical simulations show that the linear threshold poli-cies have a performance close to a brute-force POMDP solver. However, thelinear threshold policies are computationally cheaper to estimate and imple-ment.4. Numerical results on a Periscope real dataset show a significant improvementin the expected revenue by using the multiple stopping framework to schedulethe advertisements. The revenue can be improved by 25% in comparison tocurrently employed periodic scheduling and by 10% against heuristic schedulingusing the single stopping approach.1.3 Related WorksThis section is devoted to the literature review of topics and advances in the fieldsrelated to this thesis.81.3. Related Works1.3.1 Utility change detection in online social mediaChapter 2 of this thesis considers the problem of utility change detection in onlinesocial media under the revealed preference framework. Revealed preference deals withthe problem of nonparametric detection of utility maximizing behaviour. Major con-tributions to the area of revealed preferences are due to Samuelson [14], Afriat [15],Varian [16], and Diewert [17] in the microeconomics literature. Afriat [15] devised anonparametric test (called Afriat’s theorem), which provides necessary and sufficientconditions to detect utility maximizing behaviour for a dataset. For an agent satis-fying utility maximization, Afriat’s theorem [15] provides a method to reconstruct autility function consistent with the data. The utility function, so obtained, can beused to predict future response of the agent. Varian [18] provides a comprehensivesurvey of revealed preference literature.Despite being originally developed in economics, there has been some recent workon application of revealed preference to social networks and signal processing. In thesignal processing literature, revealed preference framework was used for detection ofmalicious nodes in a social network in [19, 20] and in demand estimation in smartgrids in [21]. [22] analyzes social behaviour and friendship formation using revealedpreference among high school friends. In online social networks, [23] uses revealedpreference to obtain information about products from bidding behaviour in eBay orsimilar bidding networks.1.3.2 Engagement dynamics and sensitivity analysis ofYouTube videosChapter 3 investigate how the meta-level features and the interaction of the YouTubechannel with the users affect the popularity of videos. The study of popularity ofYouTube videos based on meta-level features is a challenging problem given the di-versity of users and content providers. Several models on characterizing the popularityof YouTube videos are parametric in form, where the view count time series is usedto estimate the model parameters. For example, ARMA time series models [24],multivariate linear regression models [25], modified Gompertz models [26, 27], havebeen utilized to estimate the future video view counts given past view count timeseries. Using only the title of the video (one of the meta-level features) [28] considersthe problem of predicting whether the view count will be high or low. In a relatedcontext, [29, 30] studied the importance of tags for Flicker data. Aside from textbased meta-level features (title and tags), in [31] Support Vector Regression (SVR)91.3. Related Worksis proposed to predict the popularity using features of the video frames (e.g. facepresent, rigidity, color, clutter). It is illustrated in [31] that using the combination ofvisual features and temporal dynamics results in improved performance of the SVRfor predicting view count compared to using only visual features or temporal dynam-ics alone. In the social context, the uploading behaviour of YouTube content creatorswas studied in [32]. Specifically, the paper finds that YouTube users within a socialnetwork are more popular compared to other users.1.3.3 Interactive advertisement scheduling in livepersonalized social mediaChapter 4 considers the problem of optimal scheduling of ads in personalized livesocial media. The problem of optimal scheduling of ads has been well studied inthe context of advertising in television; see [33],[34], [35] and the references therein.However, scheduling ads on live online social media is different from scheduling ads ontelevision in two significant ways [36]: (i) real-time measurement of viewer engagement(ii) revenue is based on ads rather than a negotiated contract. Prior literature onscheduling ads on social media is limited to ad scheduling in real-time for socialnetwork games, where the ads are served to either the video game consoles in realtime over the Internet [37], or in digital games that are played via major socialnetworks [38].In Chapter 4 we formulate the problem of optimal scheduling of ads as an optimalmultiple stopping problem. The problem of optimal multiple stopping has been wellstudied in the literature see [39], [40], [41], [42] and the references therein. The optimalmultiple stopping problem generalizes the classical (single) stopping problem, wherethe objective is to stop once to obtain maximum reward. Nakai [39] considers optimalL-stopping over a finite horizon of length N in a partially observed Markov chain.More recently, [42] considers L-stopping over a random horizon.However, due to the spontaneous2 nature of personalized live channels the duration(or the horizon) is not known apriori. Therefore, we extend the results in Nakai [39]to the infinite horizon case. The extension is both important and non-trivial.The optimal multiple stopping problem can be contrasted to the recent work onsampling with “causality constraints”. In sampling with causality constraints, notall the observations are observable. [43] considers the case where an agent is limitedto a finite number of observations (sampling constraints) and must adaptively decide2https://medium.com/@mchang/periscope-and-spontaneous-attention-seeking-43831eefac16101.4. Thesis outlinethe observation strategy so as to perform quickest detection on a data stream. Theextension to the case where the sampling constraints are replenished randomly isconsidered in [44]. In the multiple stopping problem, considered in this paper, thereis no constraint on the observations and the objective is to stop L times at states thatcorrespond to maximum reward.The optimal multiple stopping problem, considered in Chap. 4, is similar tosequential hypothesis testing [45, 46], sequential scheduling problem with uncer-tainty [47] and the optimal search problem considered in the literature. [48] and[49] consider the problem of finding the optimal launch times for a firm under strate-gic consumers and competition from other firms to maximize profit. [50],[51] consideran optimal search problem where the searcher receives imperfect information on a(static) target location and decides optimally to search or interdict by solving a clas-sical optimal stopping problem (L = 1). However, the multiple-stopping problemconsidered in this thesis is equivalent to a search problem where the underlying pro-cess is evolving (Markovian) and the searcher needs to optimally stop L > 1 times toachieve a specific objective.1.4 Thesis outlineIn this section, we present the organization of the thesis. The rest of the thesis isdivided into four chapters as outlined below:• Chapter 2 considers non-parametric detection of dynamics utility maximizingbehaviour. Necessary and sufficient conditions for detecting a linear perturba-tion in the utility function is derived. In addition, when the dataset is measuredin noise, Chapter 2 proposes a procedure to estimate the change point and con-struct a decision test. Chapter 2 also deals with reducing the computationalcomplexity of detecting utility maximizing behaviour in high dimensional data.Finally, the results are illustrated on real datasets.• Motivated by increasing the popularity of YouTube videos and thereby increas-ing advertising revenue, Chapter 3 studies the sensitivity of meta-level featuresof YouTube videos. It is found that the popularity is dependent on five dominantmeta-level features. In addition, changing the meta-level improves the popu-larity of YouTube video. We characterize how changing the various meta-levelfeatures affect the various major traffic sources. The popularity of the videoalso depends on how content creators interact with YouTube users. Chapter 3111.4. Thesis outlineshows novel insights into the causality between the view count and subscribercount, the scheduling dynamics in gaming channels and playthrough dynamics.• Chapter 4 considers the problem of scheduling advertisements in personalizedlive social media. The interest of the live content is modeled as a Markov chain.The viewer engagement is not observed directly, but noisy observations can beobtained by the comments and likes of the viewers. Hence, we formulate the adscheduling problem as a POMDP. We derive structural results on the optimalmultiple stopping policy. Using the structural results we compute optimal lin-ear threshold policies using a stochastic gradient algorithm. It is shown usingreal datasets that the linear threshold policies so obtained outperform currentperiodic scheduling by 25%.Chapter 5 outlines a summary of findings and provides a direction for futureresearch and development in the fields related to this thesis.12Chapter 2Utility Change Detection in OnlineSocial Media2.1 IntroductionCan an observer detect if a dataset (time series) is generated by optimizing a utilityfunction? More generally, given a dataset, can an observer detect if there is suddenchange in the utility function? Such data-driven detection of utility maximization isstudied in microeconomics under the framework of revealed preference3.The answer to the first question is addressed by Afriat’s Theorem in the revealedpreference literature. In Section 2.2, we provide a brief background on the revealedpreference framework and the Afriat’s Theorem. The novel contribution of this chap-ter is to address the second question. In Section 2.3, we extend the classical revealedpreference framework of Afriat to agents with a “dynamic utility function”: Theutility function jump changes at an unknown time instant by a linear perturbation.Given the dataset of probe and responses of an agent, the objective is to develop anonparametric test for the change point detection problem, i.e. to detect the changepoint (jump time) and the utility functions before and after the change.Application: Such change point detection problems arise in online search insocial media. Online search is currently the most popular method for informationretrieval [2] and can be viewed as an agent maximizing the information utility, i.e.the amount of information consumed by an online agent given the limited resource ontime and attention. There has been a gamut of research which links internet searchbehaviour to ground truths such as symptoms of illness, political election, or majorsporting events [3–8]. Hence, a change in the utility in the online search correspondsto change in ground truth or exogenous events affecting the utility of agent, such asthe onset of disease or the announcement of major political decision. Detection ofutility change in online search, therefore, is helpful to identify changes in ground truthand useful, for example, for early containment of diseases [4] or predicting changes in3In signal processing terminology, such problems can be viewed as set-valued system identifica-tion of an argmax system.132.2. Background: Utility maximization and Revealed preferencepolitical opinion [9, 10].A related important practical issue that we also consider in this chapter is the ap-plication of revealed preference framework to high dimensional data (“big-data”). Asan example of high dimensional data arising in online social media, we investigate thedetection of the utility maximization process inherent in video sharing via YouTube.Detecting utility maximization behaviour with such high dimensional data is compu-tationally demanding. In this chapter, we use dimensionality reduction through theJohnson-Lindenstrauss lemma to overcome the computational cost associated withhigh dimensional data.Remark: The problem we consider is fundamentally different to the theme usedwidely in the signal processing literature, where one postulates an objective func-tion (typically convex) and then develops optimization algorithms. In contrast, therevealed preference framework, considered in this chapter, is data centric - given adataset, we wish to determine if is consistent with utility maximization, and thendetect changes in the utility function based on the observed behaviour.This chapter is organized as follows: Sec. 2.2 provides a brief background on therevealed preference framework. Sec. 2.3 derives necessary and sufficient conditionsfor change point detection, for dynamic utility maximizing agents under the revealedpreference framework. In Sec. 2.4, we study the change point detection problem in thepresence of noise. Section 2.5 address the problem of high dimensional data arisingin the context of revealed preference. Section 2.6 presents numerical results. First,we compare the proposed approach with the popular CUSUM test and correspondingROC curves are presented. Second, we illustrate the result developed on two realworld datasets: Yahoo! Tech Buzz dataset and Youstatanalyzer dataset.2.2 Background: Utility maximization andRevealed preferenceUtility Maximization: A utility-maximization behaviour (or utility maximizer) isdefined as follows:Definition 2.2.1. An agent is a utility maximizer if, at each time t, for input probept, the output response, xt, satisfiesxt = x(pt) ∈ argmaxu(x){p′tx≤It}. (2.1)142.2. Background: Utility maximization and Revealed preferenceHere, u(x) denotes a locally non-satiated4 utility function5. Also, It ∈ R+, is thebudget of the agent. The linear constraint, p′tx ≤ It imposes a budget constraint onthe agent, where p′tx denotes the inner product between pt and x.Given a dataset, D, consisting of probe, pt ∈ Rm+ , and response, xt ∈ IRm+ , of anagent for T time instants:D = {(pt, xt), t = 1, 2, . . . , T} . (2.2)Revealed preference aims to answer the following question: Is the datasetD in (2.2)consistent with utility-maximization behaviour of an agent? Afriat’s Theorem an-swers the above question.Theorem 2.2.1 (Afriat’s Theorem [15]). Given the dataset D in (2.2), the followingstatements are equivalent:1. The agent is a utility maximizer and there exists a monotonically increasing6and concave7utility function that satisfies (2.1).2. For ut and λt > 0 the following set of inequalities has a feasible solution:us − ut − λtp′t(xs − xt) ≤ 0 ∀t, s ∈ {1, 2, . . . , T}. (2.3)3. A monotone and concave utility function that satisfies (2.1) is given by:u(x) = mint∈{1,2,...,T}{ut + λtp′t(x− xt)} (2.4)4. The dataset D satisfies the Generalized Axiom of Revealed Preference (GARP),namely for any t ≤ T , p′txt ≥ p′txt+1 ∀t ≤ k − 1 =⇒ p′kxk ≤ p′kx1.The remarkable property of Afriat’s Theorem is that it gives necessary and suffi-cient conditions for the dataset to satisfy utility maximization (2.1). The feasibilityof the set of inequalities can be checked using a linear programming solver or by usingWarshall’s algorithm with O(T 3) computations [18] [16]. A utility function consistent4Local non-satiation means that for any point, x, there exists another point, y, within an εdistance (i.e. ‖x− y‖ ≤ ε), such that the point y provides a higher utility than x (i.e. u(x) < u(y)).Local non-satiation models the human preference: more is preferred to less5The utility function is a function that captures the preference of the agent. For example, if xis preferred to y, u(x) ≥ u(y).6In this paper, we use monotone and local non-satiation interchangeably. Afriat’s theorem wasoriginally stated for a non-satiated utility function.7Concavity of utility function models the human preference: averages are better than the ex-tremes. It is also related to the law of diminishing marginal utility, i.e. the rate of utility decreaseswith x.152.2. Background: Utility maximization and Revealed preferenceφ* ( y )5 10 15 20 25 30Noise CDF00.10.20.30.40.50.60.70.80.91Upper bound on the CDF of MM.C. Sim. of Noise Dist.Analytical Exp.Figure 2.1: Upper bound on the CDF of M in (2.10) which constitute a lower bound tothe false alarm probability. The analytical expression in (2.11) is compared with a MonteCarlo evaluation of M .with the data can be constructed using (2.4). The recovered utility is not uniquesince any positive monotone transformation of (2.4) also satisfies Afriat’s Theorem.Remark: In signal processing terminology, Afriat’s Theorem can be viewed as aset-valued system identification method for an argmax nonlinear system with a con-straint on the inner product of the input and output of a system. Afriat’s theoremhas several interesting consequences including the fact that if a dataset is consistentwith utility maximization, then it is rationalizable by a concave, monotone and con-tinuous utility function. Hence, the preference of the agent represented by a concaveutility function can never be refuted based on a finite dataset, see [18]. Further, onecan impose monotone and concave restrictions on the utility function with no loss ofgenerality.2.2.1 Revealed preference in a noisy settingAfriat’s theorem (Theorem 2.2.1) assumes perfect observation of the probe and re-sponse. However, when the response of the agents are measured in noise, violationof the inequalities in Afriat’s Theorem could be either due to measurement noise orabsence of utility maximization. In this section, we construct a decision test to detectthe preference of utility maximization in the presence of noise.We assume the additive noise model for measurement errors given by:yt = xt + wt, (2.5)where yt is the noisy measurement of response xt and wt ∈ Rm is the independent162.2. Background: Utility maximization and Revealed preferenceand identically distributed (i.i.d) noise. Given the noisy datasetDobs = {(pt, yt) : t ∈ {1, . . . , T}} , (2.6)[19] proposes the following statistical test for testing utility maximization (2.1) in adataset due to measurement errors. LetH0 denote the null hypothesis that the datasetDobs in (2.6) satisfies utility maximization. Similarly, let H1 denote the alternativehypothesis that the dataset does not satisfy utility maximization. There are twopossible sources of error:Type-I errors: Reject H0 when H0 is valid.Type-II errors: Accept H0 when H0 is invalid. (2.7)The following statistical test can be used to detect if an agent is seeking to maximizea utility function.+∞∫Φ∗(y)fM(ψ)dψH0≷H1γ . (2.8)In the statistical test– (2.8):(i) γ is the “significance level” of the test.(ii) The “test statistic” Φ∗(y), with y = [y1, y2, . . . , yT ] is the solution of the followingconstrained optimization problem :min Φs.t. us − ut − λtp′t(ys − yt)− λtΦ ≤ 0λt > 0 Φ ≥ 0 for t, s ∈ {1, 2, . . . , T}.(2.9)(iii) fM is the pdf of the random variable M whereM , maxt,st6=s[p′t(wt − ws)] . (2.10)The probability of false alarm or Type-I error, the probability of rejecting H0, whentrue, is given by P {M ≥ Φ∗(y)}.Below, we derive an analytical expression for a lower bound on the false alarmprobability of the statistical test in (2.8). The motivation stems from the followingfact: Given the significance level of the statistical test in (2.8), a Monte Carlo simula-tion is required to compute the threshold. However, from an analytical expression for172.3. Revealed preference: Utility change point detection (deterministic case)the lower bound on false alarm probability, we can obtain an upper bound of the teststatistic, denoted by Φ∗(y). Hence, given a dataset Dobs in (2.6), if the solution to theoptimization problem (2.9) is such that Φ > Φ∗(y), then the conclusion is that thedataset does not satisfy utility maximization, for the desired false alarm probability.Theorem 2.2.2 provides a lower bound on the false alarm probability and the proofis provided in Sec. 2.8.3.Theorem 2.2.2. If {w1, w2, . . . , wT} in (2.5) are i.i.d zero mean unit variance Gaus-sian vectors, then the probability of false alarm in (2.7) is lower bounded by1−∏t1−√2pi√2‖pt‖2 exp (−Φ∗(y)2/4‖pt‖2)Φ∗(y) +√Φ∗(y)2 + 8‖pt‖2 . (2.11)The key idea in the proof of Theorem 2.2.2 is to bound M in (2.10) by the highestorder statistic of a carefully chosen set of random variables which are negativelydependent ; see Sec. 2.8.2.Figure 2.1 compares the upper bound of the cdf (and correspondingly the lowerbound on the false alarm probability (2.11)) and the Monte Carlo simulation of actualdensity of M . As can be seen from Fig 2.1 that the upper bound of the cdf (lowerbound on false alarm probability) is tight at all regimes. The upper bound of the teststatistic, Φ∗(y), can be obtained by setting the analytical expression in (2.11) to beequal to the desired false alarm probability.2.3 Revealed preference: Utility change pointdetection (deterministic case)In this section, we consider agents with a dynamic utility function.2.3.1 Utility change modelConsider an agent that maximizes a utility function that jump changes by a linearperturbation at a time that is unknown to the observer. The aim is to estimate thejump time (change point) and the utility before and after the change point.The agent selects a response x at time t to maximize the utility function givenby:u(x, α; t) = v(x) + α′x1{t ≥ τ}, (2.12)182.3. Revealed preference: Utility change point detection (deterministic case)subject to the following budget constraint p′tx ≤ It. Here, 1{·} denotes the indicatorfunction and α = (α1, α2, . . . , αm)′ denotes the m-dimensional linear perturbationvector. The utility function, u(x, α; t) in (2.12) consists of two components: a baseutility function, v(x), and a linear perturbation, α′x, which occurs at an unknowntime τ . The base utility function, v(x) is assumed to be monotone and concave. Wewill restrict the components of the vector α to be (strictly) greater than 0, so thatthe utility function, u, conditioned on α is monotone and concave. The objectiveis to derive necessary and sufficient conditions to detect the time, τ at which linearperturbation is introduced to the base utility function.Motivation: The utility change model in (2.12) is motivated by several reasons.First, the linear perturbation assumption provides sufficient selectivity such that thenon-parametric test is not trivially satisfied by all datasets but still provides enoughdegrees of freedom. Second, the linear perturbation can be interpreted as the changein the marginal rate of utility relative to a “base” utility function. In online socialmedia, the linear perturbation coefficients measure the impact of marketing or themeasure of severity of the change in ground truth on the utility of the agent. Thisis similar to the linear perturbation models used to model taste changes [52–55] inmicroeconomics. Finally, in social networks, linear change in the utility is often usedto model the change in utility of an agent based on the interaction with the agent’sneighbours [56]. Compared to the taste change model, our model is unique in thatwe allow the linear perturbation to be introduced at an unknown time.Theorem 2.3.1 provides necessary and sufficient conditions to detect the changein utility function according to the model in (2.12) and the proof is in Sec. 2.8.1.Theorem 2.3.1. The dataset D in (2.2) is consistent with a utility function satis-fying the model in (2.12), if we can find set of scalars {vt}t=1,...,T , {λt > 0}t=1,...,T ,{αk}k=1,...,m, such that there exists a feasible solution to the following inequalities:vt + λtp′t(xs − xt) ≥ vs (t < τ) (2.13)vt + λtp′t(xs − xt)− α′(xs − xt) ≥ vs (t ≥ τ) (2.14)αi ≤ λtpit (∀i, t ≥ τ), (2.15)where pit is the ith component of the probe pt.The inequalities in (2.13) to (2.15) resemble the Afriat inequalities (2.3). Thetime instant τ at which the inequalities are satisfied is the time at which the linearperturbation is introduced.192.3. Revealed preference: Utility change point detection (deterministic case)2.3.2 Recovery of minimum perturbation and base utilityfunctionComputing the linear perturbation coefficients in (2.12) gives an indication of theseverity of the ground truth or the effect of marketing and advertising in social media.The solution to the following convex optimization provides the minimum value of theperturbation coefficients:min ‖α‖22 (2.16)s.t. vt + λtp′t(xs − xt) ≥ vs (t < τ) (2.17)vt + λtp′t(xs − xt)− α′(xs − xt)≥ vs (t ≥ τ) (2.18)αi ≤ λtpit(∀i, t ≥ τ) (2.19)λt > 0v1 = β, λ1 = δ, (2.20)where, β and δ are arbitrary constants.The inequalities (2.17) to (2.19) correspond to the revealed preference inequali-ties (2.13) to (2.15). The normalization conditions (2.20) are required because of theordinality8of the utility function. The ordinality of the utility function implies thatgiven any set of feasible values {v¯t}, {λ¯t} and {α¯k} satisfying, for example, (2.18),the following set of inequalities (a scaled and translated version of (2.18)) also hold:β(v¯s + δ)− β(v¯t + δ)− βλ¯tp′t(xs − xt) + βα¯′(xs − xt) ≤ 0.This can be avoided by the normalization conditions in (2.20).Recall that the base utility function v(x), is the utility function before the linearchange.Corollary 2.3.1. The recovered base utility function isvˆ(x) = mint{vt + λtp˜′t(x− xt)}, (2.21)wherep˜it =pit t < τ,pit − αi/λt t ≥ τ. (2.22)8Clearly, for a given probe vector, any positive monotone transformation of u(x) in (2.1) givesthe same response vector.202.4. Revealed preference: Utility change point detection in noiseMethod Data model Change model ReferenceCUSUM xti.i.d∼ pθ θ ={θ0 t < τθ1 t ≥ τ[57]Semi-supervised/ D = {((p1, I1), U(p1, I1)) , . . . , ((pT , IT ), U(pT , IT ))} Not Applicable [58–60]Supervised Learning (pi, Ii) ∼ P , U(pi, xi): Optimal response for utility U [61]Revealed Preference xt = argmaxp′txt≤Itu(xt) u(x) ={v(x) t < τv(x) + α′x t ≥ τ This workTable 2.1: Comparison of revealed preference with classical change detection algo-rithms; see the discussion in Sec. 2.3.3.In (2.21) and (2.22) {vt}, {λt}, {αk} are the solution of (2.16) to (2.20).2.3.3 Comparison with classical change detectionTable 2.1 compares the revealed preference framework (this work) with classicalchange detection algorithms. The key difference is that revealed preference considersa system which maximizes an unknown utility function subject to linear constraints(budget constraint). In comparison, a classical CUSUM type change detection al-gorithm requires knowledge of a parametrized utility function (see Sec. 2.6.1 for anumerical example when v(x) is a Cobb-Douglas9utility function).The revealed preference problem is related to supervised learning when the para-metric class of functions for empirical risk minimization (ERM) is limited to concaveand monotone functions [58–60]. The change detection problem can be thought ofas a multi-class learning problem with the first class being the utility function beforethe change and the second class being the utility function after the change [58]. How-ever, this paper provides an algorithmic approach to detect change points by derivingnecessary and sufficient conditions.2.4 Revealed preference: Utility change pointdetection in noiseSec. 2.3 dealt with utility change point detection in the deterministic case. In thissection, we consider a dynamic utility maximizing agent (the utility function under-goes a sudden linear perturbation at an unknown time) whose response is measuredin noise according to (2.5). The organization below is as follows. Sec. 2.4.1 proposes a9The Cobb-Douglas is a widely used utility function in economics. When m = 2, i.e. thedimension of the probe and response is 2, the utility function can be expressed as u(x) = xa1xb2. Theutility function is parameterized by a and b.212.4. Revealed preference: Utility change point detection in noiseprocedure to detect the unknown change point in the presence of noise. In Sec. 2.4.2,we formulate a hypothesis test to check whether the dataset is rationalizable by autility function satisfying the model in (2.12), given the estimated change point fromSec. 2.4.1. Once the unknown change point is estimated and the dataset satisfies thehypothesis test in Sec. 2.4.2, we need to recover the base utility function and thelinear perturbation. We will recover the linear perturbation coefficient correspondingto minimum false alarm probability.2.4.1 Estimation of unknown change pointIn the presence of noise, the inequalities in (2.13) to (2.15) may not be satisfied forany value of τ . Hence, we consider the following linear programming problem, to findthe minimum error or “adjustment” such that the inequalities in (2.13) to (2.15) aresatisfied.Φτ = min Φ (2.23)s.t. vs − vt − λtp′t(ys − yt)− Φ ≤ 0 (t < τ)vs − vt − λtp′t(ys − yt) + α′(ys − yt)− Φ≤ 0 (t ≥ τ)αi − λtpit ≤ 0 (∀i, t ≥ τ)Φ ≥ 0, λt > 0The solution of the linear program (2.23) depends on the choice of the change pointvariable τ . When the data is measured without noise, the equations are satisfied withzero error at the correct change point. The estimated change point, τˆ , correspondsto time point with minimum adjustment.τˆ = argmin1≤τ≤TΦτ (2.24)The intuition for (2.24) is if τ is the true change point, then the perturbation Φ needsto compensate only for the noise.2.4.2 Recovering the linear perturbation coefficients forminimum false alarm probabilityAs in (2.7), define the null hypothesis H0, that the dataset satisfies utility maximiza-tion under the model in (2.12), and the alternative hypothesis H1 that the dataset222.4. Revealed preference: Utility change point detection in noisedoes not satisfy utility maximization under the model in (2.12). Type-I errors andType-II errors are defined, similarly, as in Sec. 2.2.1.Consider the following statistical test:+∞∫Φ∗(y)fM(ψ)dψH0≷H1γ . (2.25)In the statistical test –(2.25):1. γ is the significance level of the test.2. The test statistic “Φ∗(y)”, is the solution of the following constrained optimiza-tion problem with τ = τˆ (from Sec. 2.4.1).min Φ (2.26)s.t. vs − vt − λtp′t(ys − yt)− λtΦ ≤ 0 (t < τ)vs − vt − λtp′t(ys − yt) + α′(ys − yt)− λtΦ ≤ 0 (t ≥ τ)αi − λtpit≤ 0 (∀i, t ≥ τ)Φ ≥ 0, λt > 0The optimization problem above (2.26) is similar to the optimization prob-lem (2.23). The inequalities in (2.26) enable us to compute a bound on the teststatistic in (2.30).3. fM is the pdf of the random variable, M , whereM ,M1 +M2, (2.27)where M1 and M2 are defined as:M1 , maxs,ts 6=t[p′t(wt − ws)] , (2.28)M2 , maxs,ts 6=t,t≥τ[α′(wt − ws)/λt] , (2.29)where α and λ are the solution of (2.26). The set of inequalities in (2.26) can be232.5. Dimensionality reduction: Revealed preference for big datare-written using (2.5) as:(vs − vt)/λt − p′t(xs − xt) ≤ p′t(wt − ws) (t < τ),(vs − vt)/λt − p′t(xs − xt) + α′(xs − xt)/λt≤ p′t(wt − ws)− α′(wt − ws)/λt (t ≥ τ).(2.30)Therefore for the dataset satisfying utility maximization under the model in (2.12) itshould be the case that the test statistic “Φ∗(y) ≤M”.The pdf of the random variable M can be estimated via Monte Carlo simulation.Below, we bound the probability of false alarm and recover the linear perturbationcoefficients corresponding to minimum false alarm probability.Theorem 2.4.1. Assume {w1, w2, . . . , wT} in (2.5) are i.i.d zero mean unit varianceGaussian vectors. Given ε > 0, let the change point τ and the length of the datasetT be τ = O(log 1/ε) and T − τ = O(log 1/ε). Then, the optimization criterion torecover the linear perturbation coefficients with minimum probability of Type-I erroris to minimize the Euclidean norm (i.e., ‖α‖2) subject to the constraints in (2.26).The proof is in Sec. 2.8.4. Theorem 2.4.1 provides the motivation for minimizingthe Euclidean norm of the linear perturbation coefficients in the optimization prob-lem (2.16) to (2.20). The recovery of the base utility function is similar to that inSec. 2.3.2.2.5 Dimensionality reduction: Revealedpreference for big dataClassical revealed preference deals with the case m < T (recall m is the dimension ofthe probe vector and T is the number of observations). Below, we consider the “bigdata” domain: m T . Checking whether a dataset, D, satisfies utility maximiza-tion (2.1) can be done by verifying whether GARP (statement 4 of Theorem 2.2.1)is satisfied. For m T , the computational cost for checking GARP is dominated bythe number of computations required to evaluate the inner product in GARP, givenby mT 2. The computational cost for evaluating the inner product can be reduced byembedding the m-dimensional probe and observation vector into a lower dimensionalsubspace, of dimension k, and checking GARP on the lower dimensional subspace.242.5. Dimensionality reduction: Revealed preference for big dataWe use Johnson-Lindenstrauss Lemma (JL Lemma) to achieve this10.Lemma 2.5.1 (Johnson-Lindenstrauss (JL) [62]). Suppose x1, x2, . . . , xn ∈ IRm aren arbitrary vectors. Then, for any ε ∈ (0, 12), there exists a mapping f : IRm → IRk,k = O(log n/ε2), such that the following conditions are satisfied:(1− ε)‖xi‖2 ≤ ‖f(xi)‖2 ≤ (1 + ε)‖xi‖2 ∀i (2.31)(1− ε)‖xi − xj‖2 ≤ ‖f(xi)− f(xj)‖2≤ (1 + ε)‖xi − xj‖2 ∀i, j. (2.32)To implement JL efficiently, one possible method of [63] is summarized in Theo-rem 2.5.1. This method utilizes a linear map for f and hence can be represented by aprojection matrix R. The key idea in [63] is to construct the projection matrix R withelements +1 or −1 so that the computing the projection involves no multiplications(only additions).Theorem 2.5.1 ([63]). Let A = [x1, x2, . . . , xn]′ denote the n×m matrix containingthe n vectors each of dimension m. Given ε, β > 0, let R be a m× k random binarymatrix, with independent and equiprobable elements +1 and −1, wherek >4 + 2βε2/2− ε3/3 log n. (2.33)Then, B = 1√kAR, with dimension n×k, contains the n projected vectors of dimensionk and with probability at least 1− δ, where δ = 1nβ, the inequalities (2.31) and (2.32)holds.The linear map f : IRm → IRk in Theorem 2.5.1 maps the ith row of A to the ithrow of B. The inequalities in (2.31) and (2.32) hold in a probabilistic sense i.e. withprobability abreast 1− δ, the following inequality holds for all i, j(1− ε)‖xi − xj‖2 ≤ ‖f(xi)− f(xj)‖2 ≤ (1 + ε)‖xi − xj‖2.Checking the GARP conditions (statement 4 of Theorem 2.2.1) depends only onthe relative value of the inner product between the probe and response vectors. Hence,we can scale both the probe and response vector such that their norms are less that10Other dimensionality reduction techniques such as Principal Component Analysis (PCA) arenot compatible with GARP.252.6. Numerical resultsone. In this case, as a consequence of preservation of the norms of the vector, theJohnson-Lindenstrauss embedding also preserves the inner product.Corollary 2.5.1. Let xi, xj ∈ IRm and ‖xi‖ ≤ 1, ‖xj‖ ≤ 1 be such that (2.32) issatisfied with probability at least 1− δ. Then,P ((x′ixj − f(xi)′f(xj)) ≥ ε) ≤ δ.The proof is available, for example, in [64]. The JL embedding of the vectorspreserves the inner product to within a ε fraction of the original value.Therefore, to check for utility maximization behaviour, we first project the highdimensional probe and response vector to a lower dimension using JL (using The-orem 2.5.1). The inner products in the lower dimensional space is then used forchecking the GARP condition for detecting utility maximization giving mksavings incomputation.2.6 Numerical resultsThe aim of this section is three fold. First, we illustrate the change point detectionalgorithm in Sec. 2.4 and show how the revealed preference framework is fundamen-tally different from classical change detection algorithms. Second, we show that thetheory developed in Sec. 2.3 and Sec. 2.4, for utility change point detection, candetect changes in ground truths from online search behaviour. Also, the recoveredutility functions satisfy the single crossing condition indicating strategic substitutionbehaviour11 in online search. Third, we show user behaviour in YouTube satisfiesutility maximization. To reduce the computational cost associated with checking theutility maximization behaviour, we use dimensionality reduction techniques discussedin Sec. 2.5.2.6.1 Detection of unknown change point in the presence ofnoiseIn this section, we present numerical results on change point detection in the presenceof noise. Assume that the probe and response vector is of dimension 2 (i.e. m = 2),11The substitution behaviour in economics says that consumers, constrained by a budget, substi-tute more expensive items with less costly alternatives.262.6. Numerical resultsand the utility function follows the model in (2.34). The base utility function v(x) isa Cobb-Douglas utility function12 with parameter a1 and a2.v(x) = xa11 xa22u(x) =v(x) t < τv(x) + α′x t ≥ τ(2.34)The response is measured in noise as specified in (2.5).Fig. 2.2a shows the estimate for Φτ in (2.23) as a function of τ . The estimatedchange point (τˆ) is the point at which Φτ attains minimum. Fig. 2.2b comparesthe ROC curve for the revealed preference framework and the CUSUM algorithmfor change detection. The CUSUM algorithm for change detection is provided inSec. 2.8.7. The CUSUM algorithm is used as a reference for comparing the perfor-mance of the revealed preference framework presented in this paper. The CUSUMalgorithm in Sec. 2.8.7 makes two critical assumptions: (i) Knowledge of the util-ity function before change (ii) Knowledge of the linear perturbation coefficients, andhence the utility function after the change. The only unknown is the change pointat which the utility changed. However, if the linear perturbation coefficients are alsounknown, then the CUSUM algorithm in Sec. 2.8.7 can be modified to search overIRm+ and select the parameter with the highest likelihood. The critical assumptionin CUSUM is the knowledge of the utility function before the change point. Oneheuristic solution is to estimate the utility function using some initial data, assumingno change point, utilizing the Afriat’s Theorem and then applying the CUSUM algo-rithm. Such a procedure is clearly suboptimal. In comparison, the revealed preferenceprocedure in Sec. 2.4 makes no assumption about the base utility function or the lin-ear perturbation coefficients. As can be gleaned from Fig. 2.2b, the performance ofthe revealed preference algorithm is comparable to the CUSUM algorithm, given thenon-parametric assumptions.2.6.2 Yahoo! Buzz GameHere we present an example of a real dataset of the online search process. Theobjective is to investigate the utility maximization of the online search process and12The Cobb-Douglas utility function is one of the most widely used utility function in the micro-economics literature. One of the main reasons for the popularity of Cobb-Douglas utility function isits simplicity. The utility maximization problem with Cobb-Douglas utility function can be computedin closed form. Alternatives to the Cobb-Douglas utility include the linear utility, the min utilityfunction e.t.c.272.6. Numerical resultsTime Index0 10 20 30 40 50log Φτ-1.05-1-0.95-0.9-0.85-0.8-0.75-0.7-0.65-0.6(a) Plot of Φτ obtained by solving theoptimization problem in (2.23). The es-timated change point (τˆ) is the point atwhich Φτ attains its minimum.False Alarm0 0.2 0.4 0.6 0.8 1True Positive Rate00.10.20.30.40.50.60.70.80.91Revealed PreferenceCUSUM(b) ROC curve.Figure 2.2: Estimating the utility change point using the revealed preference framework(Fig. 2.2a), where no knowledge of the utility function is assumed. Fig. 2.2b comparesthe ROC curve of the revealed preference framework with the CUSUM algorithm. TheCUSUM algorithm in Sec. 2.8.7 assumes knowledge of the utility function before and afterthe change point. However, the revealed preference framework considered in this paperassumes no parametric knowledge of the utility function. The plots were generated with 1000independent simulations. The parameters of Cobb-Douglas utility v(x) equal to (a1, a2) =(0.6, 0.4) and α = (1, 1) with the change point set as 26. The budget is set to 5. The noisevariance is 0.50.to detect time points at which the utility has changed.The dataset that we use in our study is the Yahoo! Buzz Game Transactionsfrom the Webscope datasets13 available from Yahoo! Labs. In 2005, Yahoo! alongwith O’Reilly Media started a fantasy market where the trending technologies atthat point where pitted against each other. For example, in the browser market therewere “Internet Explorer”, “Firefox”, “Opera”, “Mozilla”, “Camino”, “Konqueror”,and “Safari”. The players in the game have access to the “buzz”, which is the onlinesearch index, measured by the number of people searching on the Yahoo! searchengine for the technology. The objective of the game is to use the buzz and tradestocks accordingly. The interested reader is referred to [65] for an overview of theBuzz game. An empirical study of the dataset [66] reveals that most traders in theBuzz game follow utility maximization behaviour. Hence, the dataset falls within therevealed preference framework, if we consider the buzz as the probe and the “tradingprice14” as the response to the utility maximizing behaviour.13Yahoo! Webscope dataset: A2 - Yahoo! Buzz Game Transactions with Buzz Scores, version1.0 http://research.yahoo.com/Academic Relations14The trading price is indicative of the value of the stock.282.6. Numerical results03/29 04/03 04/08 04/13 04/18 04/23 04/28 05/030100200300400500600700Buzz Scores for WIFI and WIMAX WIFIWIMAX(a)03/29 04/03 04/08 04/13 04/18 04/23 04/28 05/03567891011121314Trading Prce of WIFI and WIMAX WIFIWIMAX(b)Figure 2.3: Buzz scores and trading price for WIFI and WIMAX in the WIRELESSmarket from April 1 to April 29, 2005. The change point was estimated as April 18. Thiscorresponds to a new WIFI product announcement. The change is visible in the suddenpeak of interest in WIFI around April 18.We consider a subset of the dataset containing the “WIRELESS” market whichcontained two main competing technologies: “WIFI” and “WIMAX”. Figure 2.3shows the buzz and the “trading price” of the technologies starting from April 1 toApril 29, 2005. The buzz is published by Yahoo! at the start of each day and the“trading price” in Figure 2.3 was computed by averaging over the corresponding stocktrading prices for the day.Choose the probe and response vector for this dataset aspt = [Buzz(WIFI)t Buzz(WIMAX)t]xt = [Trading price(WIFI)t Trading price(WIMAX)t] .(2.35)The inner product of the probe and response vector in (2.35) provides the totalvaluation of the “WIRELESS” market which forms the budget constraint.Checking the Afriat inequalities (2.3), we find that the dataset does not satisfyutility maximization for the entire duration from April 1 to April 29. However, thedataset does satisfy utility maximization from April 1 to April 17 (using the Afriatinequalities). Using the inequalities (2.13) to (2.15), we detected a change in utility at(τ =) April 18. This change point corresponds to Intel’s announcement of WIMAXchip15.Also, by minimizing the 2-norm of the linear perturbation, i.e. solving the opti-mization problem (2.16) to (2.20), we find that the recovered linear coefficients whichcorrespond to minimum perturbation is α = [0 5.9]. This is intuitive, a positive15http://www.dailywireless.org/2005/04/17/intel-shipping-wimax-silicon/292.6. Numerical results(a)-90-80-70-60-50-50-40-40-30-30-20-20-10-10-10000Price (Valuation) WIFI10 10.5 11 11.5 12 12.5 13Price (Valuation) WIMAX66.577.588.599.510(b)Figure 2.4: Fig. 2.4a shows the recovered utility function v(x) using (2.21). The indifferencecurve of the recovered utility function, is shown in Fig. 2.4b. The indifference curve indicatesthe strategic substitution behaviour in online search; See discussion in Sec. 2.6.2.change in the WIMAX utility, due to the change in ground truth. The recovered util-ity function, v(x), is shown in Fig. 2.4a and the indifference curve (contour plot) ofthe base utility is shown in Fig. 2.4b. The recovered base utility function in Fig. 2.4asatisfies the single crossing condition16indicating strategic substitute behaviour in on-line search. The substitute behaviour in online search can also be noticed from theindifference curve in Fig. 2.4b. This is due to the fact that WIFI and WIMAX werecompeting technologies for the problem of providing wireless local area network.2.6.3 Youstatanalyzer databaseWe now analyze the utility maximizing behaviour of YouTube users. YouTube is anexample of a content-aware utility maximization, where the utility depends on thequality of the video present at any point. We measure the quality of the video usingtwo measurable metrics: the number of subscribers and the number of views. We usedthe Youstatanalyzer database17 which is built using the Youstatanalyzer tool [68]. Thedatabase is particularly suited for our study of dimensionality reduction in revealed16Utility function, U(x1, x2), satisfy the single crossing condition if ∀ x′1 > x1, x′2 > x2, we haveU(x′1, x2) − U(x1, x2) ≥ 0 =⇒ U(x′1, x′2) − U(x1, x′2) ≥ 0; see [67]. The single crossing conditionimplies that argmaxx1 U(x1, x2) is non-decreasing in x2; this defines substitution behavior. Thesingle crossing condition is an ordinal condition and therefore compatible with Afriat’s Theorem.17The Youstatanalyzer dataset can be downloaded from http://www.congas-project.eu/sites/www.congas-project.eu/files/Dataset/youstatanalyzer1000k.json.gz. The dataset containsmeta-level data for over 1 million videos from the period of July 2012 to September 2013. Thecomplete list of collected parameters are given in Table 1 in [68].302.6. Numerical resultspreference (Sec. 2.5), since the database contains statistics of 1 million videos.From the database, we aggregated the statistics of all popular videos existingin the time interval from 08 July, 2012 to end of 07 Sept, 2013, having at least 2subscribers. This time interval was divided into 15 time periods, corresponding toeach month of the duration, giving us a total of 15 observations (T = 15). For therevealed preference analysis, the probe is the number of subscribers and the response isthe number of views during the time period. The aim is to detect utility maximizingbehaviour between the number of people subscribed to a particular video and thenumber of views that the video received. Formally, the probe and the response arechosen as follows:pt = [1/#Subscriber(Video1), . . . , 1/#Subscriber(VideoN)] , (2.36)xt = [#Views(Video1), . . . ,#Views(VideoN)] . (2.37)The motivation for this definition is that as the number of subscribers to a videoincreases, the number of views also increase [69]. The inner product of the probeand the response vectors gives the sum of the “view focus” of all videos [70]. Also,a recent study shows that 90% of the YouTube views are due to 20% of the videosuploaded [71]. Hence, if we restrict attention to popular videos, the view focus tendto remain constant during a time period which correspond to the linear constraint inthe revealed preference setting.The number of videos satisfying the above requirements is 7605, and therefore,the dimension of the probe and response, m = 7605. The number of inner prod-uct computations required for checking the GARP conditions (statement 4 of Theo-rem 2.2.1) is mT 2. Hence, we apply Johnson-Lindenstrauss Lemma (Lemma 2.5.1) tothe dataset using the “database friendly” transform (Theorem 2.5.1) in Sec 2.5. Wechoose ε = 0.1, so that the inner product are within 90% of the accuracy. Also, wechoose β = 0.65, such that the above condition on the inner products hold with prob-ability at least 0.9. Substituting the values of T , ε and β in (2.33) we find that thedimension of the embedded subspace is k = 3800. We see that the GARP condition issatisfied with probability 0.9, which is inline with what we expect. For this example,the number of inner product computations required to compute GARP condition inthe lower dimensional subspace is given by kT 2, which is less than the number ofcomputations required to compute the GARP in the original space by a factor of 2.The lower dimensional utility function obtained in Sec 2.5 is useful for visualizationpurposes, and gives a sparse representation of the original utility function.312.7. Closing remarks2.7 Closing remarksThe revealed preference framework is a nonparametric approach to detection of utilitymaximization. This chapter extended the classical revealed preference frameworkto dynamic utility maximizing agents. The main result (Theorem 2.3.1) providednecessary and sufficient conditions on the dataset for the existence of the change pointat which the utility function jump changes by a linear perturbation. In addition, weproposed convex programs to recover the minimum linear perturbation and the baseutility function. In the presence of noise Theorem 2.3.1 may not be satisfied for anyvalue of change point. Hence, we provided a procedure for detecting the unknownchange point and a hypothesis test for detecting dynamic utility maximization. We,then, considered the problem of detection of utility maximization behaviour in the bigdata domain. In order to reduce the computational cost, we proposed dimensionalityreduction through the Johnson-Lindenstrauss transform.The results were illustrated on real dataset from Yahoo! Tech Buzz. The Yahoodataset is an example of an online search dataset. The application of results providednovel insights into the utility maximizing behaviour of agents in online search. Thechange point reflect the ground truth and the recovered utility function show thestrategic substitution behaviour in online search.2.8 Proof of theorems2.8.1 Proof of Theorem 2.3.1Necessary Condition: Assume that the data has been generated by a utility functionssatisfying the model in (2.12). An optimal interior point solution to the problem mustsatisfy the first order optimality conditions:5xitv(xt) = λtpit (t < τ) (2.38)5xitv(xt) + αi = λtpit (t ≥ τ) (2.39)At time t, the concavity of the utility function implies:u(xt, α, t) +5xtu(xt, α, t)′(xs − xt) ≥ u(xs, α, t) ∀s. (2.40)322.8. Proof of theoremsSubstituting the first order conditions (2.38) and (2.39) into (2.40), yieldsv(xt) + λtp′t(xs − xt) ≥ v(xs) (t < τ) (2.41)v(xt) + λtp′t(xs − xt)− α′(xs − xt) ≥ v(xs) (t ≥ τ) (2.42)Denoting v(xt) = vt yields the set of inequalities (2.13), (2.14). (2.15) holds since theutility function v(x) is monotone increasing.Sufficient Condition: We first construct a piecewise linear utility function V(x)from the lower envelope of the T overestimates, to approximate the function v(x)defined in (2.12),V(x) = mint{vt + λtp˜′t(x− xt)}, (2.43)where each element of p˜t is defined as,p˜it =pit t < τpit − αi/λt t ≥ τ (2.44)To verify that the construction in (2.43) is indeed correct, consider an arbitraryresponse, xˆ, such that: p′txˆ ≤ p′txt18. We need to show V(xˆ) + α′xˆ ≤ V(xt) + α′xt.First, we show that V(xt) = vt, t = 1, . . . , T as follows: From (2.43),V(xt) = vm + λmp˜′m(xt − xm),for some m. In particular, if m ≥ τ ,V(xt) = vm + λmp˜′m(xt − xm)= vm + λmp′m(xt − xm)− α′(xt − xm)≤ vt + λtp′t(xt − xt) (2.45)= vtIf the inequality (2.45) is true, then it would violate (2.42). Similarly, it can be shownthat if m < τ , V(xt) = vt. Hence, V(xt) = vt.18In economics, xt is said to be “revealed preferred” to xˆ. Since xt was chosen as the response tothe probe pt, the utility at xt should be higher than the utility at xˆ.332.8. Proof of theoremsNext, we show V(xˆ) + α′xˆ ≤ V(xt) + α′xt. If, t ≥ τ ,V(xˆ) + α′xˆ ≤ vt + λtp˜′t(xˆ− xt) + α′xˆ= vt + λtp′t(xˆ− xt)− α′(xˆ− xt) + α′xˆ= vt + λtp′t(xˆ− xt) + α′xt≤ vt + α′xt= V(xt) + α′xtThe inequality holds, similarly, for the case t < τ . Therefore, we can construct autility function consistent with the model in (2.12).2.8.2 Negative dependence of random variablesDefinition 2.8.1 ([72]). Random variables X1, . . . , Xn, n ≥ 2, are said to be nega-tively dependent, ifP {∩nk=1 {Xk ≤ xk}} ≤n∏k=1P {Xk ≤ xk} ,andP {∩nk=1 {Xk > xk}} ≤n∏k=1P {Xk > xk} .Negative dependence allows us to bound the joint distribution of the randomvariables in terms of marginals.The variable M in (2.10) is the highest order statistic of the set of random variablesM defined as:M , {(p′t(wt − ws)) : s, t = {1, 2, . . . , T} , s 6= t} . (2.46)Define, ξ ⊂M asξ = {p′1(w1 − w2), p′2(w2 − w3) . . . , p′T (wT − w1)} . (2.47)Lemma 2.8.1. If {w1, w2, . . . , wT} in (2.47) are i.i.d zero mean unit variance Gaus-sian vectors, then the set of random variables in ξ are negatively dependent.Proof. Each of the random variables in the set ξ (defined in (2.47)), is Gaussian.Hence to show negative dependence of random variables in ξ, it is sufficient to show342.8. Proof of theoremsthat these variables are negatively correlated [72, 73]. Any element in ξ, p′i(wi−wi+1),is correlated with either:1. Element of the form p′i+1(wi+1 − wi+2):E{(p′i(wi − wi+1))(p′i+1(wi+1 − wi+2))}= −p′ipi+1 < 0.2. Element of the form p′k(wk − wk+1), k /∈ {i, i+ 1}:E {(p′i(wi − wi+1)) (p′k(wk − wk+1))} = 0.So the random variables in ξ (2.47) are negatively correlated and hence, negativelydependent, as defined in Def. 2.8.1.2.8.3 Proof of Theorem 2.2.2For any subset of the random variables, ξ,ξ ⊂M = {(p′t(wt − ws)) : s, t = {1, 2, . . . , T} , s 6= t}P {M ≤ x} ≤ P{maxiξi ≤ x}= P {ξ1 ≤ x, . . . , ξT ≤ x}Choosing the set ξ to be set defined in (2.47). Also, from Lemma 2.8.1 the randomvariables in ξ are negatively dependent, as defined in Def. 2.8.1. Hence,≤∏iP {ξi ≤ x}Each of the term in ξ, (p′t(wt − wt+1)) is distributed as N (0, 2‖pt‖2). Using standardlower bound for the tail of the Gaussian distribution, we have≤∏t1−√2pi√2‖pt‖2x+√x2 + 8‖pt‖2exp(−x2/4‖pt‖2)The false alarm probability is given by 1 − P {M ≤ Φ∗(y)}. Substituting the upperbound for P {M ≤ Φ∗(y)}, we get a lower bound for the false alarm probability.2.8.4 Proof of Theorem 2.4.1The proof of Theorem 2.4.1 relies on two lemmas: Lemma 2.8.2 and Lemma 2.8.3which are stated below. Lemma 2.8.2 states that for “sufficient” number of observa-tions the random variables are “almost” positive.352.8. Proof of theoremsLemma 2.8.2. Assume {w1, w2, . . . , wT} in (2.5) are i.i.d zero mean unit varianceGaussian random vectors. For ε > 0, T = O(log 1/ε) and T − τ = O(log 1/ε), wehave:P {M1 ≤ 0} < ε,P {M2 ≤ 0} < εDefine, auxiliary random variables, Mˆ1 and Mˆ2, which corresponds to the trun-cated distributions of M1 and M2 as shown below:fMˆi(x) = fMi(x)1 {x ≥ 0}+ P (Mi < 0) δ(x) ; i = 1, 2, (2.48)where, δ(x) is the delta function. Then, Lemma 2.8.3 states that the expectation ofthe auxiliary random variables Mˆi; i = 1, 2, are close to the expectation of the originalrandom variables, Mi; i = 1, 2.Lemma 2.8.3. Assume {w1, w2, . . . , wT} in (2.5) are i.i.d zero mean unit varianceGaussian random vectors. For ε > 0, T = O(log 1/ε) and T − τ = O(log 1/ε), wehave:|EMˆ1 − EM1| < 2ε,|EMˆ2 − EM2| < 2ε.The proof of Lemma 2.8.2 and Lemma 2.8.3 are provided in Sec. 2.8.5 and Sec. 2.8.6,respectively.Proof (Theorem 2.4.1). For Φ∗(y) > 0, the probability of Type-I error is given byP {M ≥ Φ∗(y)}.P {M ≥ Φ∗(y)} = P {M1 +M2 ≥ Φ∗(y)}If τ = O(1/ε), by Lemma 2.8.2 and Lemma 2.8.3, the truncated distribution have asmall probability of being less than 0 and the expectation of the truncated distributionis close to the original distribution. Hence,P {M ≥ Φ∗(y)} = P{Mˆ1 + Mˆ2 ≥ Φ∗(y)}By Markov inequality,≤E{Mˆ1 + Mˆ2}Φ∗(y)=E{Mˆ1}Φ∗(y)+E{Mˆ2}Φ∗(y)362.8. Proof of theoremsSince, Mˆ2 is a positive random variable,=E{Mˆ1}Φ∗(y)+∞∫0P(Mˆ2 > z)dzΦ∗(y)≤E{Mˆ1}Φ∗(y)+∞∫0∑s,tt≥τ,s6=tP (α (wt − ws) /λt > z) dzΦ∗(y)=E{Mˆ1}Φ∗(y)+∞∫0∑s,tt≥τ,s6=texp(−z2λ2t/4‖α‖2) dzΦ∗(y)Hence, the probability of Type-I error, is minimized by minimizing ‖α‖2.2.8.5 Proof of Lemma 2.8.2P {M1 ≤ 0} = Pmaxs,ts 6=t(pt(wt − ws)) ≤ 0Choosing the set ξ ⊂M as defined in (2.47) and since the set ξ are negative dependentfrom Lemma 2.8.1,≤ P{maxiξi ≤ 0}≤∏iP {ξi ≤ 0}Each of the term if ξi = (pt(wt − wt+1)) is distributed as N (0, 2‖pt‖2). Let FN (µ,σ2)is the cdf of Gaussian random variable with mean µ and variance σ2. Noting thatFN (0,σ2)(0) = 1/2, we have the following=∏tFN (0,2‖pt‖2)(0) =∏t12=12T< ε.The proof for the second part is similar by an appropriate choice of a negative de-pendent set, ξ and is hence omitted.372.8. Proof of theorems2.8.6 Proof of Lemma 2.8.3From the definition of the random variable Mˆ1 in (2.48),E{Mˆ1}=+∞∫0xfM1(x)dx+ P (M1 < 0)<+∞∫0xfM1(x)dx+ ε, (2.49)where the inequality in (2.49) follows from Lemma 2.8.2. The expectation of M1 isgiven byE {M1} = E {M11 {x ≥ 0}}+ E {M11 {x ≤ 0}} (2.50)To continue with the proof, we derive a lower bound on E {M11 {x ≤ 0}}, the secondterm in (2.50).For computing the lower bound, we proceed by integration by parts,E {M11 {x ≤ 0}} =0∫−∞xfM1(x)dx = −0∫−∞P {M1 ≤ x} dx.Choosing the negative dependent subset ξ ⊂ M defined in (2.47), and noting thateach ξi is distributed as N (0, 2‖pi‖2) and using analytical expression for bounds ofthe cdf of the Gaussian density, we obtainE {M11 {x ≤ 0}} ≥ −0∫−∞∏iP {ξi ≤ x} dx. ≥ −ε (2.51)From (2.49) and (2.51) we get the first part of the Lemma 2.8.3.The proof for thesecond part is similar and hence omitted.382.8. Proof of theorems2.8.7 CUSUM algorithm for utility change point detectionAlgorithm 1 CUSUM algorithm for utility change point detection1: Initialize:Set threshold ρ > 0.Set cumulative sum S(0) = 0.Set decision function G(0) = 0.2: for t = 1 to T do3: For probe pt and observed response yt,4: xt(0) = argmax{p′tx≤It}v(x), with v(x) as in (2.34).5: xt(1) = argmax{p′tx≤It}v(x) + α′x.6: Likelihood `(yt, i) = P(yt|xt(i))19 ; i = 0, 1.7: Instantaneous log likelihood s(t) = log( `(yt,1)`(yt,0)).8: S(t) = S(t− 1) + s(t).9: G(t) = {G(t− 1) + s(t)}+, where {x}+ = max {x, 0}.10: if G(t) > ρ then11: Change Point Estimate τˆ = argmin1≤τ≤tS(τ − 1)12: break13: end if14: end for19In our example, the probability is given by the Gaussian distribution.39Chapter 3Engagement Dynamics andSensitivity Analysis of YouTubevideos3.1 IntroductionThe YouTube social network contains over 1 billion users who collectively watch mil-lions of hours of YouTube videos and generate billions of views every day. Addition-ally, users upload over 300 hours of video content every minute. YouTube generatesbillions in revenue through advertising and through the Partner program shares therevenue with the content creators.The video view count is a key metric of the measure of popularity of a video andthe metric by which YouTube pays the content providers20. A key question is: Howdo meta-level features of a posted video (e.g. thumbnail, title, tags, description) driveuser engagement in the YouTube social network? However, the content alone doesnot influence the popularity of a video. YouTube also has a social network layer ontop of it’s media content. The main social component is how the content creators (alsocalled “channels”) interact with the users. So another key question is: How does theinteraction of the YouTube channel with the user affect popularity of videos? In thischapter, we study both the above questions. In particular, our aim is to examine howthe individual video features (through the meta-level data) and the social dynamicscontribute to the popularity of a video.Main results: In this chapter, we investigate how the meta-level features and theinteraction of the YouTube channel with the users affect the popularity of videos.The main empirical conclusions of this chapter are:1. The five dominant meta-level features that affect the popularity of a video are:first day view count , number of subscribers, contrast of the video thumbnail,Google hits, and number of keywords. Sec. 3.2 discusses this further.2. Optimizing the meta-level features (e.g. thumbnail, title, tags, description) after20However, recently, view time is gaining more prominence than view count.403.1. Introductiona video has been posted increases the popularity of the video. In addition,optimizing the title increases the traffic due to YouTube search, optimizing thethumbnail increases the traffic from related videos and optimizing the keywordsincreases the traffic from related and promoted videos. Sec. 3.2.4 providesdetails on this analysis.3. Insight into the causal relationship between the subscribers and view count forYouTube channels is also explored. For popular YouTube channels, we foundthat the channel view count affects the subscriber count, see Sec. 3.3.1.4. New insights into the scheduling dynamics in YouTube gaming channels arealso found. For channels with a dominant periodic uploading schedule, going“off the schedule” increases the popularity of the channel, see Sec. 3.3.2.5. The generalized Gompertz model can be used to distinguish views due to virality(views from subscribers), migration (views from non-subscribers) and exogenousevents, see Sec. 3.3.3.6. New insights into playlist dynamics. The early view count dynamics of aYouTube videos are highly correlated with the long term “migration” of viewersto the video. Also, early videos in a game playthrough typically contain higherviews compared with later videos in a game playthrough playlist, see Sec. 3.3.4.7. The number of subscribers of a channel only affects the early view count dy-namics of videos in a playthrough, see Sec. 3.3.4.All the above results21 are validated on a YouTube dataset consisting of over 6 millionvideos across 25 thousand channels. This dataset22 was provided to us by Broad-bandTV Corp. (BBTV). The dataset consists of daily samples of metadata of theYouTube videos on the BBTV platform from April, 2007 to May, 2015. BBTV is oneof the largest Multi-channel network (MCN) in the world23. The results of the chapterallows YouTube partners such as BBTV to adapt their user engagement strategies togenerate more views and hence increase revenue.The organization of the chapter is as follows. The chapter contains two main sec-tions which address each of the above question. In Sec. 3.2, we use several machinelearning methods to characterize the sensitivity of meta-level features on the popu-larity of YouTube videos. In Sec. 3.3, we use time series methods to analyze how the21Caveat: It is important to note that the above empirical conclusions are based on the BBTVdataset. These videos cover the YouTube categories of gaming, entertainment, food, music, andsports as described in Table 3.7 of the Sec. 3.5.1. Whether the above conclusions hold for othertypes of YouTube videos is an open issue that is beyond the scope of this thesis.22Sec. 3.5.1 summarizes the key features of the YouTube dataset that we have used.23http://variety.com/2016/digital/news/broadbandtv-mcn-disney-maker-comscore-1201696857/413.2. Sensitivity analysis of YouTube meta-level featuresinteraction of the channel affect the popularity of the content.3.2 Sensitivity analysis of YouTube meta-levelfeaturesIn this section we apply machine learning methods to study how meta-level featuresof a YouTube video impacts the view count of the video. The main machine learningmethod that we use is the Extreme Learning Machine (ELM). Section 3.2.1 providesa brief background on ELM. Given a trained ELM, Section 3.2.2 provides a briefbackground on the various sensitivity analysis methods. Section 3.2.3 provides thesensitivity analysis results on the BBTV dataset. In Section 3.2.3, we compare theELM algorithm with several state of the art machine learning algorithms. Fig. 3.1illustrates a trace of the subscribers (one of the meta-level features) when the videowas posted, and the associated view count 14 days after the video has been posted.The machine learning algorithms must be able to address the challenging problem ofmapping from such noisy meta-level features (as shown in Fig. 3.1) to the associatedview count of a video. Of all machine learning methods, it is shown in Section 3.2.3,that the ELM provides sufficient performance to both be used to estimate the meta-level features which significantly contribute to the view count of a video, and forpredicting the view count of videos.Video Index100 101 102 103 104ViewCount102104106108Video Index100 101 102 103 104Subscribers100102104106108Video Index100 101 102 103 104ViewCount102104106108Video Index100 101 102 103 104Subscribers100102104106108Figure 3.1: The left figure shows the view count of all videos (arranged according todecreasing order of view count) after 14 days of the video being posted. The rightfigure shows the associated subscriber count when the video was posted.3.2.1 Extreme learning machine (ELM)The dataset of features (described in Sec. 3.2.3) and view count are denoted asD = {(xi, vi)}Ni=1 where xi ∈ Rm is the feature vector, of dimension m, for video423.2. Sensitivity analysis of YouTube meta-level featuresi, and vi is the total view count for video i. Here, N is the number of videos in thetraining dataset (The ELM was trained for three categories of videos, for details seeSec. 3.2.3). The ELM is a single hidden-layer feed-forward neural network–that is,the ELM consists of an input layer, a single hidden layer of L neurons and an outputlayer. Each hidden-layer neuron can have a unique transfer function. Popular trans-fer functions include the sigmoid, hyperbolic tangent, and Gaussian. However anynon-linear piecewise continuous function can be utilized. The output layer is obtainedby a weighted linear combination of the output of the L hidden neurons.The ELM model presented in [74, 75] is given by:vi =L∑k=1βkhk(xi; θk), (3.1)where βk is the weight of neuron k, and hk(·; θk) is the hidden-layer neuron transferfunction with parameter θk, and L is the total number of hidden-layer neurons inthe ELM. Given D, how can the ELM model parameters βk, θk, and L in (3.1) beselected? Given L, the ELM trains βk and θk in two steps. First, the hidden layerparameters θk are randomly initialized. Any continuous probability distribution canbe used to initialize the parameters θk. Second, the parameters βk are selected tominimize the square error between the model output and the measured output fromD. Formally,β∗ ∈ argmaxβ∈RL{||Hβ − V ||22}, (3.2)where H denotes the hidden-layer output matrix with entries Hkj = hk(xj; θk) fork ∈ {1, 2, . . . , L} and j ∈ {1, 2 . . . , N}, and V the target output with entries V =[v1, v2, . . . , vN ]. The solution to (3.2) is given by β∗ = H†V where H† denotes theMoore-Penrose generalized inverse of H. The major benefit of using the ELM, com-pared to other single layer feed-forward neural network, is that the training onlyrequires the random generation of the parameters θk, and the parameters βk can becomputed as the solution of a set of linear equations. The computational cost oftraining the ELM is O(N3) for constructing the Moore-Penrose inverse [76].3.2.2 Sensitivity analysis (Background)There are several sensitivity analysis techniques available in the literature [77, 78]which can be classified into two groups: filter methods, and wrapper methods. Thefilter methods consider only the meta-level features and the viewcount without the433.2. Sensitivity analysis of YouTube meta-level featuresinformation available from a machine learning algorithm. The wrapper methods, onthe other hand, utilize the information from the machine learning algorithm. Typi-cally, wrapper methods give a more accurate measure of the sensitivity compared tofilter methods [77, 78]. However, filter methods are computationally less expensivethan wrapper methods and do not require the training and evaluation of the machinelearning algorithm. Given the noise present in the meta-level features (Fig. 3.1) andthe non-linearity between the meta-level features and view count , filter methods arenot suitable for the sensitivity analysis of the meta-level features. Hence, in thissection we focus on two wrapper methods suitable for estimating the sensitivity ofmeta-level features on the view count of YouTube videos.For the first method we focus on the ELM (3.1) for evaluating the sensitivity ofthe meta-level features, however the method can be used for any machine learningmethod. Given that the ELM (3.1) is a single feed-forward hidden layer neural net-work, it is possible to evaluate the sensitivity of the meta-level features by taking thepartial derivative of (3.1) for the trained ELM. Note that this method is utilized toestimate the sensitivity of input features in neural networks [79]. The sum of squaresderivatives, denoted by SSDk for meta-level feature x(k), is given by:SSDk =N∑i=1( ∂vi∂x(k))2=N∑i=1( L∑k=1βk∂hk(xi; θk)∂x(k))2. (3.3)The variable with the largest SSDk is most influential to the prediction of the view count vin (3.1) using the ELM. Note that since the ELM is trained using all the meta-levelfeatures, the SSDk evaluates the average sensitivity of changes in a single meta-levelfeature with all other features held constant.A state of the art filter method when there are significant interdependency re-lationships is the Hilbert-Schmidt Independence Criterion Lasso (HSIC-Lasso) [80].The main idea of this method is to use the benefits of least absolute shrinkage andselection operator (Lasso) with a feature wise kernel to capture the non-linear input-output dependency. The HSIC-Lasso is given by the solution to the following convexoptimization problem:minα∈IRm12∥∥∥∥∥L¯−m∑k=1αkK¯(k)∥∥∥∥∥2F+ λ‖α‖1, (3.4)where λ is the regularization parameter, L¯ = ΓLΓ and K¯(k) = ΓK(k)Γ are centeredGram matrices, K(k)i,j = K(xk,i, xk,j) and Li,j = L(vi, vj) are Gram matrices, K(·, ·)443.2. Sensitivity analysis of YouTube meta-level featuresand L(·, ·) are kernel functions24, Γ = I − 1N1N1′N is the centering matrix, I is theidentity matrix and 1N is the vector of ones. A measure of the importance of ameta-level feature is then given by the vector α.Both of these methods will be applied to the YouTube dataset to study the sen-sitivity of the meta-level features of YouTube videos on the videos view count .3.2.3 Sensitivity of YouTube meta-level features andpredicting view countIn this section, the ELM (3.1) and other state-of-the art machine learning methodsare applied to the YouTube dataset to compute the sensitivity of a videos meta-levelfeatures on the view count of the video based on the feature importance measure SSDk(3.3). Videos of different popularity, (i.e. highly popular, popular, and unpopular asdefined in Table 3.8 in the Sec. 3.5.1), may have different sensitivities to the meta-level features. Hence, we independently perform the sensitivity analysis on the threepopularity categories. First we define the meta-level features for each video, thenevaluate the meta-level feature sensitivities on the associated view count , and finallyprovide methods to predict the view count of YouTube videos using various machinelearning techniques. The analysis provides insight into which meta-level features areuseful for optimizing the view count of a YouTube video.Meta-level feature constructionEach YouTube video contains four primary components: the Thumbnail of the video,the Title of the video, the Keywords (also known as tags), and the description of thevideo. However, in typical user searches only a subset of the description is provided tothe user. Therefore, we do not consider the contents of the description to significantlyaffect the view count of the video. The meta-level features are constructed25 usingthe Thumbnail, Title, and Keywords. For the Thumbnail, 19 meta-level featuresare computed which include: the blurriness (e.g. CannyEdge, Laplace Frequency),brightness, contrast (e.g. tone), overexposure, and entropy of the thumbnail. For theTitle, 23 meta-level features are computed which include: word count, punctuationcount, character count, Google hits (e.g. if the title is entered into the Google search24In Section 3.2.3 we used the Gaussian kernel.25The meta-level features were constructed manually. The features were constructed based onexisting literature and features that can be extracted using off-the-shelf software. In addition, ananalysis by experts at Broadband TV Corp confirmed that the 54 features are comprehensive forthe sensitivity analysis.453.2. Sensitivity analysis of YouTube meta-level featuresengine how many results are found), and the Sentiment/Subjectivity of the title com-puted using Vader [81], and TextBlob 26. For the Keywords, 7 meta-level features arecomputed which include: the number of keywords, and keyword length. In addition,to the above 49 meta-level features, we also include auxiliary user meta-level featuresincluding: the number of subscribers, resolution of the thumbnail used, category ofthe video, the length of the video, and the first day view count of the video. Notethat our analysis does not consider the video or audio quality of the YouTube video.Our analysis is focused on the sensitivity of the view count based on the Thumbnail,Title, Keywords, and auxiliary channel information of the user that uploaded thevideo. In total 54 meta-level features are computed for each video. The completedataset used for the sensitivity analysis is given by D = {(xi, vi)}Ni=1, with xi ∈ R54the computed meta-level features for video i ∈ {1, . . . , N}, vi the view count 14 daysafter the video is published, and N = 104, the total number of videos used for thesensitivity analysis. Note that the view count vi is on the log scale (i.e. if a video has106 views then vi = 6). This is a necessary step as the range of view counts is from102 to above 107.Prior to performing any analysis, we pre-process the meta-level features in thedataset D. First, all the meta-level features are scaled to satisfy x(k) ∈ [0, 1]. Notethat the meta-level features were not whitened (e.g. the meta-level data as not trans-formed to have an identity covariance matrix). The second pre-processing step in-volves removing redundant features in D. Feature selection is a popular method foreliminating redundant meta-level features. In this work, we employ a correlationbased feature selection based on the Pearson correlation coefficient (which was usedfor feature selection in [82]) to eliminate the redundant meta-level features. Of theoriginal 54 meta-level features, m = 29 meta-level features remain after the removal ofthe correlated meta-level features. Note that removal of these features does not signif-icantly impact the performance of the machine learning algorithms or the sensitivityanalysis results.Meta-level feature sensitivityGiven the dataset D = {(xi, vi)}Ni=1 constructed in Sec.3.2.3, the goal is to estimatewhich features significantly contribute to the view count of a video. To perform thissensitivity analysis five machine learning algorithms which include: the ELM, Bagged26http://textblob.readthedocs.io/en/dev/463.2. Sensitivity analysis of YouTube meta-level featuresMARS using gCV Pruning [83]27, Conditional Inference Random Forest (CIRF) [84]28,Feed-Forward Neural Network (FFNN) [85], and the feature selection method Hilbert-Schmidt Independence Criterion Lasso (HSIC-Lasso) [80]. Each of these models istrained using a 10-fold cross validation technique, and the design parameters of eachwas optimized via extensive empirical evaluation. We selected the ELM (3.1) to con-tain L = 100 neurons which ensures that we have sufficient accuracy on the predictedview count given the features xi, while reducing the effects of over-fitting. For theCIRF the design parameter for randomly selected predictors was set to 6, and theFFNN we have 10 neurons in the hidden-layer. The HSIC-Lasso regularization pa-rameter was set to 100. Given the trained models, the sensitivity of the view count onthe meta-level features of a video is computed by evaluating the sum of squares deriva-tives, SSDk (3.3). Fig. 3.2 shows the normalized29 SSDk for the five highest sensitivitymeta-level features of these five machine learning methods. Note that for the HSIC-Lasso we do not use the SSDk but instead the values of the coefficient α in (3.4) whichprovides an estimate of the feature sensitivity. Recall, from Sec. 3.2.2, that larger theSSDk value or higher the value of αk the more sensitive the view count is to varia-tions in the meta-level feature. From Fig. 3.2, the meta-level features with the highestsensitivities are: first day view count , number of subscribers, contrast of the videothumbnail, Google hits, number of keywords, video category, title length, and numberof upper-case letters in the title respectively. Notice that all these methods have thefirst day view count and number of subscribers as the most sensitive meta-level fea-tures as expected. The FFNN and Bagged MARS however do not have the contrastof the video thumbnail as the third most sensitive meta-level feature compared withthe other algorithms. This results as the learning method and learning rate of eachof these algorithms is different which results in differences in the meta-level featuresensitivity. However as we can see from Fig. 3.2, the view count of a video is depen-dent on these eight meta-level features with the first day view count and number ofsubscribers being the most sensitive features.As expected, Fig. 3.2 shows that if the first day view count is high then the as-sociated view count 14 days after the video is posted will be high. Additionally, ifthere is a large number of subscribers to the channel that posted the video, then theassociated view count after 14 days is also expected to be large. As expected, theproperties of the title and keywords also contribute to the view count of the videohowever with less sensitivity than the thumbnail of the video. Therefore, to increase27Refer to Sec. 3.5.2.28Refer to Sec. 3.5.2.29The normalization is with respect to the highest value among the computed SSDk.473.2. Sensitivity analysis of YouTube meta-level featuresMeta-Level Feature x(k)x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8)FeatureSensitivity00.20.40.60.81ELMBagged MARSCIRFHSIC-LassoFFNNFigure 3.2: Sensitivity of the meta-level features computed using the sum of squaresderivatives SSDk (3.3) for the ELM, Bagged MARS, CIRF, and FFN, and the asso-ciated coefficient of the centered Gram matrix for the HSIC-Lasso using the datasetD defined in Sec.3.2.3. The meta-level features k=1 to k=8 are associated with: firstday view count , number of subscribers, contrast of the video thumbnail, Google hits,number of keywords, video category, title length, and number of upper-case letters inthe title respectively. Similar results are obtained for highly popular, popular, andunpopular videos as defined in Table 3.8.the view count of a video it is vital to increase the number of subscribers, and focuson the quality of the Thumbnail used. A surprising result is that the sensitivity of theview count resulting from changes in these meta-level features are negligible acrossthe three popularity classes of videos (i.e. highly popular, popular, and unpopular asdefined in Table 3.8). Therefore, regardless of the expected popularity of a video, achannel owner should focus on maximizing the number of subscribers and the qualityof the thumbnail to increase the associated view count of a video.Predicting the view count of YouTube videosIn this section we illustrate how machine learning methods can be used to theview count of a YouTube video. The machine learning methods used for predictioninclude: the Extreme Learning Machine (3.1), Feed-Forward Neural Network [85],Lasso, Relaxed Lasso [86], Conditional Inference Random Forest [84], Boosted Gen-eralized Additive Model [87, 88], Bagged MARS using gCV Pruning [83], and General-ized Linear Model with Stepwise Feature Selection using Akaike information criterion.For each method their predictive performance and the top-five highest sensitivitymeta-level features are provided.To perform the analysis we train each model using an identical 10-fold cross val-idation method with the dataset D = {(xi, vi)}Ni=1 with all the meta-level featuresincluded. The predictive performance of the machine learning methods are evaluatedusing the root-mean-square error (RMSE) and the R2 (e.g. coefficient of determina-483.2. Sensitivity analysis of YouTube meta-level featurestion). Note that for both training and evaluation the view count is pre-processed tobe on the log scale (i.e. if the view count is 106, the associated label is vi = 6).The predictive performance and the top-five highest sensitivity meta-level featuresof the machine learning methods are provided in Table 3.1. In Table 3.1 the meta-levelfeature numbers are identical to those defined in Fig. 3.2. As seen from Table 3.1, theELM has the lowest RMSE of 0.44 which is comparable to the RMSE of the Condi-tional Inference Random Forest and Feed-Forward Neural Network which have 0.47and 0.48 respectively. The R230of the ELM, Feed-Forward Neural Network and Con-ditional Inference Random Forest are also comparable with values of 0.77, 0.79, and0.80. Therefore any of these methods could be used to estimate the view count of aYouTube video. A key question is which of the meta-level features x(k) are most sen-sitive between these machine learning methods. As seen from the results in Table 3.1the top two most important features are the first day view count and the number ofsubscribers, and the majority of methods suggest that the number of Google hits isalso an important meta-level feature. Interestingly the Conditional Inference RandomForest, Boosted Generalized Additive Model, and the Bagged MARS using gCV Prun-ing do not consider the number of Google hits in the top five most sensitive featuresand instead use the video category. This is consistent with the result that videos inthe “Music” category are the most viewed on YouTube, followed by “Entertainment”and “People and Blogs”. Only the Bagged MARS using gCV Pruning considers themeta-level features of title length and number of upper-case letters in the title to bein the five most sensitive features compared with the other machine learning meth-ods. This result suggests that the number of Google hits associated with the titlesignificantly contributes to the video’s popularity however the view count is not verysensitive to the specific length and number of upper-case letters in the title. There-fore, when performing meta-level feature optimization for a video a user should focuson the meta-level features of: first day view count , number of subscribers, contrastof the video thumbnail, Google hits, number of keywords, and video category.To estimate the view count of an unpublished video (a video that is about to beposted for the first time) we can not utilize the most sensitive meta-level feature of themachine learning algorithms which is the first day view count . Is it still possible toestimate the view count with the remaining meta-level features? To answer this ques-tion we compare the performance of the ELM using the 28 meta-level features with30The R2 is a popular measure of the goodness of fit. It is given by the ratio of the variation(measured using sum of squares) explained using a model to the total variation in the data. Theimportant property of R2 is that it is bounded between 0 and 1. A high value of R2 implies thatthe variation in the data can be explained using the model in question.493.2. Sensitivity analysis of YouTube meta-level featuresTable 3.1: Performance and feature sensitivityMethod RMSE R2 Features x(k)Extreme Learning Machine 0.44 0.77 1 2 3 4 5Feed-Forward Neural Network 0.48 0.79 2 1 5 3 6Lasso 0.53 0.66 1 2 3 4 5Relaxed Lasso 1.14 0.64 1 2 3 4 5CI Random Forest 0.47 0.80 1 2 6 4 5Boosted GAM 0.50 0.77 1 2 6 4 5Bagged MARS 0.50 0.77 1 2 6 7 8GLM with Feature Selection 0.53 0.67 1 2 3 4 5the view count on the first day removed. Fig. 3.3a shows the predicted view count ofthe ELM trained using 29 meta-level features, and Fig. 3.3b shows the predictedview count using the 28 meta-level features. As expected, Fig. 3.3 illustrates thatthe predictive accuracy of the ELM decreases if the view count on the first day isremoved. Though there is a drop in the predictive accuracy of the ELM trained usingthe 28 meta-level features, it still contains sufficient predictive accuracy to aid in theselection of the meta-level features to increase the view count of a video. Note thatsimilar performance results are obtained for the Feed-Forward Neural Network andConditional Inference Random Forest when performing the prediction with the firstday view count removed. Therefore, these prediction methods can be used to opti-mize the meta-level features of unpublished videos where the optimization can focuson the meta-level features of: number of subscribers, contrast of the video thumbnail,Google hits, number of keywords, and video category.3.2.4 Sensitivity to meta-level optimizationSec. 3.2.3 described how meta-level features (e.g. number of subscribers) can beused to estimate the popularity of a video. In this section, we analyze how changingmeta-level features, after a video is posted, impacts the user engagement of the video.Meta-level data plays a significant role in the discovery of content, through YouTubesearch, and in video recommendation, through the YouTube related videos. Hence,“optimizing” the meta-level data to enhance the discoverability and user engagementof videos is of significant importance to content providers. Therefore, in this section,we study how optimizing the title, thumbnail or keywords affect the view count ofYouTube videos.503.2. Sensitivity analysis of YouTube meta-level features(a) (b)Figure 3.3: Predictive view count using an ELM with the actual view count (blackdots) and predicted view count indicated by the (gray dots). Fig. 3.3a illustrates theresults for a trained ELM (3.1) using all 29 meta-level features defined in Sec. 3.2.3.Fig. 3.3b illustrates the results for a trained ELM (3.1) using the 28 meta-level features(first day view count removed from the 29 meta-level features defined in Sec. 3.2.3).To perform the analysis, we utilize the dataset (see Table 3.9 in the Sec. 3.5.1),and remove any time-sensitive videos. Time-sensitive videos are those videos thatare relevant for a short period of time and the popularity of such videos cannot beimproved by optimization. We removed the following two time-sensitive categoriesof videos: “politics” and “movies and trailers”. In addition, we removed videos(from other categories) which contained the following keywords in their video meta-data: “holiday”, “movie”, or “trailers”. For example, holiday videos are not watchedfrequently during off-holiday times.Let τˆi be the time at which the meta-level optimization was performed on videoi and let si, denote the corresponding sensitivity. We characterize the sensitivity tometa-level optimization as follows:si =(∑τˆi+6t=τˆivi(t))/7(∑τˆit=τˆi−6 vi(t))/7(3.5)The numerator of (3.5) is the mean value of the view count 7 days after optimiza-tion. Similarly, the denominator of (3.5) is the mean value of the view count 7 daysbefore optimization. The results are provided in Table 3.2 for optimization of thetitle, thumbnail, and keywords. As shown in Table 3.2, at least half of the optimiza-31“No change” was obtained by randomly selecting 104 videos which performed no optimization513.2. Sensitivity analysis of YouTube meta-level featuresOptimization Fraction of Videos with increased popularityTitle change 0.52Thumbnail change 0.533Keyword change 0.50No change31 0.35Table 3.2: Sensitivity to Meta-Level Optimization. The table shows than in more than50% the videos, meta-level optimization resulted in an increase in the popularity ofthe video.tions resulted in an increase in the popularity of the video. In addition, comparedto videos with no optimization, the meta-level optimization improves the probabilityof increased popularity by 45%. This is consistent with YouTube and BBTV rec-ommendation to optimize meta-level features to increase user engagement. However,some class of videos benefit from optimizing meta-data much more than others. Theeffect may be due to small user channels, which have limited number of videos andsubscribers, gain by optimizing the meta-level data of the video compared to hugelypopular channels such as Sony or CNN. The highly popular channel (e.g. Sony orCNN) upload videos frequently (even multiple times daily), so video content becomesirrelevant quickly. The question of which class of users gain by optimizing the metalevel features of the video is part of our ongoing research.Table 3.3 summarizes the impact of various meta-level changes on the three majorsources of YouTube traffic, i.e. YouTube search32, YouTube promoted33 and trafficfrom related videos34. For those videos where meta-level optimization increased thepopularity (the ratio of the mean value of the views after and before optimization ishigher than one), we computed the sensitivity for various traffic sources as in (3.5).Table 3.3 summarizes the median statistics of the ratio of the traffic sources beforeand after optimization. The title optimization resulted in significant improvement(approximately 25%) from the YouTube search. Similarly, thumbnail optimizationimproved traffic from the related videos and keyword optimization resulted in in-creased traffic from related and promoted videos.Summary: This section studied the sensitivity of view count with respect to meta-level optimization.The main finding is that meta-level optimization increased the popularity of videosin the majority of cases. In addition, we found that optimizing the title improvedand evaluating si 3 months from the date of posting the video.32Video views from YouTube search results33Video views from an unpaid YouTube promotion34Video views from a related video listing on another video watch page523.3. Social interaction of the channel with YouTube usersOptimization Related Promoted SearchTitle change 1.13 NAa 1.24Thumbnail change 1.20 NAa 1.125Keyword change 1.10 1.16 1aNot enough data available: A binomial test to check for the true hypothesis with 95% confidenceinterval requires that the sample size, n, should be at least(1.960.04)2p(1−p). With p = 0.5, n > 600.Table 3.3: Sensitivity of various traffic sources to meta-level optimization, for videoswith increased popularity. The title optimization resulted in significant improvement(approximately 25%) from the YouTube search. Similarly, thumbnail optimizationimproved traffic from the related videos and keyword optimization resulted in in-creased traffic from related and promoted videos.traffic from YouTube search. Similarly, thumbnail optimization improved traffic fromthe related videos and keyword optimization resulted in increased traffic from relatedand promoted videos.3.3 Social interaction of the channel withYouTube usersIn this section, we use time series analysis methods to determine how the social inter-action of a YouTube channel with its viewers affects the view count dynamics. Thissection is organized as follows. Sec. 3.3.1, characterizes the causal relationship be-tween the subscribers and view count of a channel using Granger causality test. InSec. 3.3.2, we investigate how the popularity of the channel is affected by the schedul-ing dynamics of the channel. When channels deviate from a regular upload schedule,the view count and the comment count of the channel increase. In Sec. 3.3.3, we ad-dress the problem of separating the view count dynamics due to virality (viewcountresulting from subscribers) and migration (views from non-subscribers) and exoge-nous events using a generalized Gompertz model. Finally, Sec. 3.3.4, we studies theeffect of video playlists on the view count. The main conclusion outlined in Sec. 3.3.4is that the dynamics of the view count in a playlist is highly correlated and the effectsof “migration” causes the view count of videos to decrease even with an increase inthe subscriber count.533.3. Social interaction of the channel with YouTube users3.3.1 Causality between subscribers and view count inYouTubeIn this section the goal is to detect the causal relationship between subscriber andviewer counts and how it can be used to estimate the next day subscriber countof a channel. The results are of interest for measuring the popularity of a YouTubechannel. Fig. 3.4 displays the subscriber and view count dynamics of a popular movietrailer channel in YouTube. It is clear from Fig. 3.4 that the subscribers “spike” witha corresponding “spike” in the view count. In this section we model this causalrelationship of the subscribers and view count using the Granger causality test fromthe econometric literature [89].The main idea of Granger causality is that if the value(s) of a lagged time-series canbe used to predict another time-series, then the lagged time-series is said to “Grangercause” the predicted time-series. To formalize the Granger causality model, let sj(t)denote the number of subscribers to a channel j on day t, and vji (t) the correspondingview count for a video i on channel j on day t. The total number of videos in a channelon day t is denoted by I(t). Define,vˆj(t) =I(t)∑i=1vji (t), (3.6)as the total view count of channel j at time t. The Granger causality test involvestesting if the coefficients bi are non-zero in the following equation which models therelationship between subscribers and view counts:sj(t) =ns∑k=1ajksj(t− k) +nv∑i=kbjkvˆj(t− k) + εj(t), (3.7)where εj(t) represents normal white noise for channel j at time t. The parameters{aji}{i=1,...,ns} and {bji}{i=1,...,nv} are the coefficients of the AR model in (3.7) for channelj, with ns and nv denoting the lags for the subscriber and view counts time seriesrespectively. If the time-series Dj = {sj(t), vˆj(t)}t∈{1,...,T} of a channel j fits themodel (3.7), then we can test for a causal relationship between subscribers and viewcount. In equation (3.7), it is assumed that |ai| < 1, |bi| < 1 for stationarity. Thecausal relationship can be formulated as a hypothesis testing problem as follows:H0 : b1 = · · · = bnv = 0 vs. H1 : Atleast one bi 6= 0. (3.8)543.3. Social interaction of the channel with YouTube usersThe rejection of the null hypothesis, H0, implies that there is a causal relationshipbetween subscriber and view counts.First, we use Box-Ljung test [90] is to evaluate the quality of the model (3.7)for the given dataset Dj. If satisfied, then the Granger causality hypothesis (3.8) isevaluated using the Wald test [91]. If both hypothesis tests pass then we can concludethat the time series Dj satisfies Granger causality–that is, the previous day subscriberand view count have a causal relationship with the current subscriber count.A key question prior to performing the Granger causality test is what percentageof videos in the YouTube dataset in Sec. 3.5.1 satisfy the AR model in (3.7). Toperform this analysis we apply the Box-Ljung test with a confidence of 0.95 (p-value= 0.05). First, we need to select ns and nv, the number of lags for the subscribersand view count time series. For ns = nv = 1, we found that only 20% of the channelssatisfy the model (3.7). When ns and nv are increased to 2, the number of channelssatisfying the model increases to 63%. For ns = nv = 3, we found that 91% of thechannels satisfy the model (3.7), with a confidence of 0.95 (p-value = 0.05). Hence,in the below analysis we select ns = nv = 3. It is interesting to note that the meanvalue of coefficients bi decrease as i increases indicating that older view counts haveless influence on the subscriber count. Similar results also hold for the coefficientsai. Hence, as expected, the previous day subscriber count and the previous day viewcount most influence the current subscriber count.The next key question is does their exist a causal relationship between the sub-scriber dynamics and the view count dynamics. This is modeled using the hypothesisin (3.8). To test (3.8) we use the Wald test with a confidence of 0.95 (p-value =0.05) and found that approximately 55% of the channels satisfy the hypothesis. Forapproximately 55% of the channels that satisfy the AR model (3.7), the view count“Granger causes” the current subscriber count. Interestingly, if different channelcategories are accounted for then the percentage of channels that satisfy Grangercausality vary widely as illustrated in Table 3.4. For example, 80% of the Entertain-ment channels satisfy Granger causality while only 40% of the Food channels satisfyGranger causality. These results illustrate the importance of channel owners to notonly maximize their subscriber count, but to also upload new videos or increase theviews of old videos to increase their channels popularity (i.e. via increasing their sub-scriber count). Additionally, from our analysis in Sec.3.2 which illustrates that theview count of a posted video is sensitive to the number of subscribers of the channel,increasing the number of subscribers will also increase the view count of videos thatare uploaded by the channel owners.553.3. Social interaction of the channel with YouTube usersCategorya FractionGaming 0.60Entertainment 0.80Food 0.40Sports 0.67aYouTube assigns a category to videos, rather than channels. The category of the channel wasobtained as the majority of the category of all the videos uploaded by the channel.Table 3.4: Fraction of channels satisfying the hypothesis: View count “Grangercauses” subscriber count, split according to category.0 1,000 2,000 3,00000.511.52·106Time [Days]Viewcount0 1,000 2,000 3,00002,0004,0006,0008,000Time [Days]SubscribersFigure 3.4: Viewcount and subscribers for the popular movie trailer channel: VI-SOTrailers. The Granger causality test for view counts “Granger causes” subscribercount is true with a p-value of 5× 10−8.3.3.2 Scheduling dynamics in YouTubeIn this section, we investigate the scheduling dynamics of YouTube channels. We findthe interesting property that for popular gaming YouTube channels with a dominantupload schedule, deviating from the schedule increases the views and the commentcounts of the channel.Creator Academy35 in their best practice section recommends to upload videos ona regular schedule to get repeat views. The reason for a regular upload schedule is toincrease the user engagement and to rank higher in the YouTube recommendation list.However, we show in this section that going “off the schedule” can be beneficial for agaming YouTube channel, with a regular upload schedule, in terms of the number ofviews and the number of comments.From the dataset, we ‘filtered out’ video channels with a dominant upload sched-ule, as follows: The dominant upload schedule was identified by taking the peri-35YouTube website for helping with channels563.3. Social interaction of the channel with YouTube usersodogram of the upload times of the channel and then comparing the highest valueto the next highest value. If the ratio defined above is greater than 2, we say thatthe channel has a dominant upload schedule. From the dataset containing 25 thou-sand channels, only 6500 channels contain a dominant upload schedule. Some chan-nels, particularly those that contain high amounts of copied videos such as trailers,movie/TV snippets upload videos on a daily basis. These have been removed fromthe above analysis. The expectation is that by doing so we concentrate on thosechannels that contain only user generated content.We found that channels with gaming content account for 75% of the 6500 channelswith a dominant upload schedule36 and the main tags associated with the videos were:“game”, “gameplay” and “videogame”37. We computed the average views when thechannel goes off the schedule and found that on an average when the channel goes offschedule the channel gains views 97% of the time and the channel gains comments68% of the time. This suggests that channels with “gameplay” content have periodicupload schedule and benefit from going off the schedule.3.3.3 Modeling the view count dynamics of videos withexogenous eventsSeveral time-series analysis methods have been employed in the literature to modelthe view count dynamics of YouTube videos. These include ARMA time series mod-els [24], multivariate linear regression models [25], hidden Markov models [92], nor-mal distribution fitting [93], and parametric model fitting [26, 27]. Though allthese models provide an estimate of the view count dynamics of videos, we are in-terested in segmenting view count dynamics of a video resulting from subscribers,non-subscribers and exogenous events. Exogenous events are due to video promotionon other social networking platform such as Facebook or the video being referencedby a popular news organization or celebrity on Twitter. This is motivated by tworeasons. First, removing view count dynamics due to exogenous events provides anaccurate estimate of sensitivity of meta-level features in Sec. 3.2. Second, extractingthe view count resulting from exogenous events gives an estimate of the efficiency ofvideo promotion.The view count dynamics of popular videos in YouTube typically show an initialviral behaviour, due to subscribers watching the content, and then a linear growthresulting from non-subscribers. The linear growth is due to new users migrating from36This could also be due to the fact gaming videos account for 70% of the videos in the dataset.37We used a topic model to obtain the main tags.573.3. Social interaction of the channel with YouTube usersother channels or due to interested users discovering the content either through searchor recommendations (we call this phenomenon migration similar to [26]). Hence, with-out exogenous events, the view count dynamics of a video due to subscribers and non-subscribers can be estimated using piecewise linear and non-linear segments. In [26], itis shown that a Gompertz time series model can be modeled the view count dynamicsfrom subscribers and non-subscribers, if no exogenous events are present. In this chap-ter, we generalize the model in [26] to account for views from exogenous events. Itshould be noted that classical change-point detection methods [94] cannot be usedhere as the underlying distribution generating the view count is unknown.To account for the view count dynamics introduced from exogenous events we usethe generalized Gompertz model given by:v¯i(t) =Kmax∑k=0wki (t)u(t− tk),wki (t) = Mk(1− e−ηk(ebk(t−tk)−1))+ ck(t− tk),(3.9)where v¯i(t) is the total view count for video i at time t, u(·) is the unit step func-tion, t0 is the time the video was uploaded, tk with k ∈ {1, . . . , Kmax} are the timesassociated with the Kmax exogenous events, and wki (t) are Gompertz models which ac-count for the view count dynamics from uploading the video and from the exogenousevents. In total there are Kmax + 1 Gompertz models with each having parameterstk,Mk, ηk, bk. Mk is the maximum number of requests not including migration for anexogenous event at tk, ηk and bk model the initial growth dynamics from event tk,and ck accounts for the migration of other users to the video. In (3.9) the parameters{Mk, ηk, bk}k=0 are associated with the subscriber views when the video is initiallyposted, the parameters {tk,Mk, ηk, bk}Kmaxk=1 are associated with views introduced fromexogenous events, and the views introduced from migration are given by {ck}Kmaxk=0 .Each Gompertz model (3.9) captures the initial viral growth when the video is ini-tially available to users, followed by a linearly increasing growth resulting from usermigration to the video.The parameters θi = {ak, tk,Mk, ηk, bk, ck}Kmaxk=0 in (3.9) can be estimated by solving583.3. Social interaction of the channel with YouTube usersthe following mixed-integer non-linear program:θi ∈ arg min{ Ti∑t=0(v¯i(t)− vi(t))2+ λK}K =Kmax∑k=0ak, ak ∈ {0, 1} k ∈ {0, . . . , Kmax}, (3.10)with Ti the time index of the last recorded views of video vi, and ak a binary variableequal to 1 if an exogenous event is present at tk. Note that (3.10) is a difficultoptimization problem due to the presence of the binary variables ak [95]. In theYouTube social network when an exogenous event occurs this causes a large andsudden increase in the number of views, however as seen in Fig. 3.5, a few days afterthe exogenous event occurs the views only result from migration (i.e. linear increasein total views). Assuming that each exogenous event is followed by a linear increase inviews we can estimate the total number of exogenous events Kmax present in a giventime-series by first using a segmented linear regression method, and then countingthe number of segments of connected linear segments with a slope less then cmax. Theparameter cmax is the maximum slope for the views to be considered to result fromviewer migration. Plugging Kmax into (3.10) results in the optimization of a non-linear program for the unknowns {tk,Mk, ηk, bk, ck}Kmaxk=0 . This optimization problemcan be solved using sequential quadratic programming techniques [96].To illustrate how the Gompertz model (3.9) can be used to detect for exogenousevents, we apply (3.9) to the view count dynamics of a video that only contains asingle exogenous event. Fig. 3.5 displays the total view count of a video where anexogenous event occurs at time t = 41 (i.e. t1 = 41 in (3.9)) days after the videois posted38. The initial increase in views for the video for t ≤ 7 days results fromthe 2910 subscribers of the channel viewing the video. For 7 ≤ t ≤ 41, other usersthat are not subscribed to the channel migrate to view the video at an approximatelyconstant rate of 13 views/day. At t = 41, an exogenous event occurs causing anincrease in the views per day. The difference in viewers, resulting from the exogenousevent, is 7174. For t ≥ 43, the views result primarily from the migration of users toapproximately 2 views/day. Hence, using the generalized Gompertz model (3.9) we38Due to privacy reasons, we cannot detail the specific event. Some of the reasons for the suddenincrease in the popularity of the video include: Another user on YouTube mentioning the video, thiswill encourage viewers from that channel to view the video, resulting in a sudden increase in thenumber of views. Another possibility is that the channel owner or a YouTube Partner like BBTVdid significant promotional initiatives on other social media sites such as Twitter, Facebook, etc. topromote the channel or video.593.3. Social interaction of the channel with YouTube userscan differentiate between subscriber views, views caused by exogenous events, andviews caused by migration.Day0 50 100 150 200TotalViews02468MeasuredGompertzFigure 3.5: Due to an exogenous event on day 41, there is a sudden increase in thenumber of views. The total view count fitted by the Gompertz model v¯i(t) in (3.9) isshown in black with the virality (exponential) and migration (linear) illustrated bythe dotted red.3.3.4 Video playthrough dynamicsOne of the most popular sequences of YouTube videos is the video game “playthrough”.A video game playthrough is a sequence of videos for which each video has a relaxedand casual focus on the game that is being played and typically contains commentaryfrom the user presenting the playthrough. Unlike YouTube channels such as CNN,BBC, and CBC in which each new video can be considered independent from theothers, in a video playthrough the future view count of videos are influenced by thepreviously posted videos in the playthrough. To illustrate this effect we consider avideo playthrough for the game “BioShock Infinite”–a popular video game releasedin 2013. The channel, popular for hosting such video playthroughs, contains close to4500 videos and 180 video playthroughs. The channel is highly popular and has gar-nered a combined view count close to 100 million views with 150 thousand subscribersover a period of 3 years. Fig. 3.6 illustrates that the early view count dynamics arehighly correlated with the view count dynamics of future videos. Both the short termview count and long term migration of future videos in the playthrough decrease af-ter the initial video in the playthrough is posted. This results for two reasons, eitherthe viewers purchase the game, or the viewers leave as the subsequent playthroughsbecome repetitive as a result of game quality or video commentary quality. A uniqueeffect with video playthroughs is that though the number of subscribers to the channelhosting the videos in Fig. 3.6 increases over the 600 day period, the linear migration isstill maintained after the initial 50 days after the playthrough is published. Addition-603.4. Closing remarksally, the slope of the migration is related to the early total view count as illustratedin Fig. 3.6b.0 100 200 300 400 500 600103104Time [Days]ViewCountVid Idx Exp Pred1510152025(a) Actual and predictedview count of playthrough. Weplot the 1st, 5th, 10th, 15th, 20thand 25th video from the playlistcontaining 25 videos. In the legend,Exp and Pred corresponds to theactual and the predicted value us-ing (3.9), respectively. Figure showsthat the view counts decreases forsubsequent videos in the playlist.5 10 15 20 25103104Video Part NumberViewCountMigration RateVirality Rate(b) The virality rate specifies theearly views due to subscribers, andthe migration rate (in units ofviews/1000 days) specifies the sub-sequent linear growth due to non-subscribers.Figure 3.6: Actual and predicted view count of a playthrough containing 25 YouTubevideos for the game “BioShock Infinite”. The predictions are computed by fitting a modifiedGompertz model (3.9) to the measured view count for each video in the playthrough.3.4 Closing remarksIn this chapter, we conducted a data-driven study of YouTube based on a largedataset (see Sec. 3.5.1 for details). First, by using several machine learning methods,we investigated the sensitivity of the videos meta-level features on the view countsof videos. It was found that the most important meta-level features include: firstday view count , number of subscribers, contrast of the video thumbnail, Google hits,number of keywords, video category, title length, and number of upper-case lettersin the title respectively. Additionally, optimizing the meta-data after the video isposted improves the popularity of the video. The social dynamics (the interaction ofthe channel) also affects the popularity of the channel. Using the Granger causalitytest, we showed that the view count has a casual effect on the subscriber count ofthe channel. A generalized Gompertz model was also presented which can allowthe classification of a videos view count dynamics which results from subscribers,613.5. Supplementary materialmigration, and exogenous events. This is an important model as it allows the viewsto be categorized as resulting from the video or from exogenous events which bringviewers to the video. The final result was to study the upload scheduling dynamicsof gaming channels in YouTube. It was found that going “off schedule” can actuallyincrease the popularity of a channel.3.5 Supplementary material3.5.1 Description of YouTube datasetThis chapter uses the dataset provided by BBTV. The dataset contains daily samplesof metadata of YouTube videos on the BBTV platform from April, 2007 to May, 2015,and has a size of around 200 gigabytes. Table 3.5 contains the details of metadatacollected for the each YouTube channel and video. The dataset contains around6 million videos spread over 25 thousand channels. Table 3.6 shows the statisticssummary of the videos present in the dataset.Table 3.5: Metadata of YouTube channel and videoChannel VideoId IdName NameStart Date Published DateTitle TitleVideo Count DurationView Count View CountSubscriber Count Like CountComment Count Dislike CountDescription Comment CountTopic Id TagsThumbnail ThumbnailSampling Time Sampling TimeBanner Category IdLanguage Average View DurationAverage View TimeClick Through RateTable 3.7, shows the summary of the various category of the videos present in thedataset. The dataset contains a large percentage of gaming videos. Fig. 3.7 showsthe fraction of videos as a function of the age of the videos. There is a large fraction623.5. Supplementary materialTable 3.6: Dataset summaryVideos 6 millionChannels 26 thousandAverage number of videos (per channel) 250Average age of videos 275 daysAverage number of views (per video) 10 thousandCategory FractionGaming 0.69Entertainment 0.07Food 0.07Music 0.035Sports 0.017Table 3.7: YouTube dataset categories (out of 6 million videos)of videos uploaded within a year. Also, the dataset captures the exponential growthin the number of videos uploaded to YouTube. Similar to [26], we define three cate- Age of videos 050010001500200025003000Density 10 -610 -510 -410 -310 -2Figure 3.7: The fraction of videos in the dataset as a function of the age of the videos.There is a significant percentage of newer videos (videos with less age) compared toolder videos. Hence, the dataset capture the exponential growth of the number ofvideos uploaded to YouTube.gories of videos based on their popularity: Highly popular, popular, and unpopular.Table 3.8 gives a summary of the fraction of videos in the dataset belonging to eachcategory. As can be seen from Table 3.8, the majority of the videos in the datasetbelong to the popular category.A unique feature of the dataset is that it contains information about the “meta-level optimization”633.5. Supplementary materialCriteria FractionHighly Popular (Total Views > 104) 0.12Popular (150 < Total Views < 104) 0.67Unpopular (Total Views < 150) 0.21Table 3.8: Popularity distribution of videos in the datasetOptimization # VideosTitle change 21 thousandThumbnail change 13 thousandKeyword change 21 thousandTable 3.9: Optimization summary statisticsfor videos. The meta-level optimization is a change in the title, tags or thumbnail,of an existing video in order to increase the popularity. BBTV markets a productthat intelligently automates the meta-level optimization. Table 3.9 gives a summaryof the statistics of the various meta-level optimization present in the dataset.3.5.2 Background: Statistical learning algorithmsMultivariate Adaptive Regression Splines (MARS)MARS is an adaptive method for regression. It uses two types of linear basis functionsof the form:(x− t)+ =x− t if x > t,0 else and (t− x)+ =t− x if t > x,0 else . (3.11)The functions are piecewise linear, with a knot at value of t. The two functions arecalled a reflected pair. The regression uses a collection of functions C which containthe above functions at each of the observed value xi,j:C = {(xj − t)+, (t− xj)+} ; j = 1, · · · , N t ∈ x1,j, · · · , xm,j. (3.12)The MARS model is given byf(x) = β0 +M∑m=1βmhm(x), (3.13)643.5. Supplementary materialwhere hm(x) is a function from the library C or a product of two or more functionsfrom C. Given hm, the coefficients βm can be computed by minimizing the least squarecriteria. The construction of hm is as below: Start with h0(x) = 1. Given a modelM (with M functions), we add to the model the term of the formβM+1hl(x)(xj − t)+ + βM+2hl(x)(t− xj)+, hl ∈M,that produces the largest decrease in the training error. The coefficients βM+1 andβM+2 can be obtained by least squares.At the end (using an appropriate error criteria), we have a large model which typ-ically overfits the data, and hence we apply a backward deletion. For computationalreasons, we use the generalized cross-validation criteria given by:gCV(λ) =∑Ni=1(vi − fλ(xi))2(1−M(λ)/N)2 (3.14)The value M(λ) is the effective number of parameters in the model. If there are rindependent basis functions and there are K knots then M(λ) = r + 3K.Conditional Inference and Random ForestRandom forests, an extension of the tree learning algorithm, builds a large collection ofde-correlated trees and then averages them. Algorithm 2 provides a brief descriptionof how to construct random forest. For more details, refer to [97].Algorithm 2 Random forest1: for b = 1 to B do2: Draw a bootstrap sample of size N from the training data.3: Grow a tree Tb to the bootstrapped data, by recursively repeating the followingsteps for each terminal node of the tree, until the minimum node size nmin isreached.• Select p features at random from the m features.• Pick the best variable among the p.• Split the node into two daughter nodes.4: end forThe output of Algorithm 2 is a set of trees {Tb}Bb=1. The output of a random forest653.5. Supplementary materialis given byf(x) =1BB∑b=1Tb(x) (3.15)The basic random forest construction tend to select variables that have many possiblesplits or many missing values.The conditional inference random forest (CIRF) [84] uses a significance test pro-cedure in order to select variables. In the this work, we used the ctree39 R packageto implement the CIRF.RegressionA regression model assumes that the output is a linear function of the featuresX1, X2, · · · , Xm ,i.e.Y = α +m∑i=1βiXi. (3.16)A Generalized Linear Model (GLM) is a generalization of the regression model in (3.16)given byg(y) = α +m∑i=1βiXi, (3.17)where g(·) is called the link function. Common link functions include the logistic andexponential. A Generalized Additive Model (GAM) is a generalization of the GLMand is given byg(y) = α +m∑i=1fi(Xi), (3.18)where the fi(·) are unknown smooth functions. Estimating the GLM and GAMcan be done by the back fitting algorithm (Algorithm 9.1 in [97]). In the boostedGAM [87, 88], we estimate the model iteratively by adding a function most similarto the gradient of the likelihood with respect to the link function (refer to gradientboosting in [97]). The optimal number of functions for the boosted GAM is obtainedthrough cross validation.Feature Selection: There are a number of approaches to variable selection inthe regression models discussed above. A forward step-wise feature selection algo-rithm is a greedy strategy, wherein we start with a null model and sequentially addfeatures that improve the prediction accuracy. In contrast, the backward step-wise39Please refer to https://cran.r-project.org/web/packages/partykit/vignettes/ctree.pdf on the usage of ctree. Given the dataset D, as a dataframe in R, the followingcommand trains a CIRF: ctree(viewcount ∼ ·, data = D)663.5. Supplementary materialfeature selection, we start with the full model and delete features that have the leasteffect on the fit. Refer to [97] for more details on feature selection for regressionmodels. In Generalized Linear Model with Stepwise Feature Selection using Akaikeinformation (AIC) criterion, the final model selection using either forward or back-ward feature selection is done using the AIC criteria.Another popular method for feature selection, especially for high dimensionaldata, is the Least Absolute Shrinkage and Selection Operator (LASSO). LASSO wasoriginally introduced by Robert Tibshirani in 1996[97]. LASSO is the solution to thefollowing convex optimization problem:βˆ = argmin1NN∑i=1(Yi −X ′iβ)2 + λ‖β‖1. (3.19)The set of predictor variables selected by the Lasso estimator is denoted byMλ = {1 ≤ k ≤ m|βk 6= 0} . (3.20)The `1-penalty for the LASSO (3.19) has two effects, model selection and shrinkageestimation. On the one hand, a certain set of coefficients is set to zero and henceexcluded from the selected model. On the other hand, for all variables in the selectedmodel Mλ, coefficients are shrunken towards zero compared to the least-squares so-lution (λ = 0).In the Relaxed LASSO the model selection and the shrinkage estimation andcontrolled by two separate parameters λ and φ and is given byβˆ = argmin1NN∑i=1(Yi −X ′i {β · 1Mλ})2 + φλ‖β‖1, (3.21)where 1Mλ is the indicator function on the set of variables Mλ ⊂ {1, 2, · · · ,m}.67Chapter 4Interactive AdvertisementScheduling in Personalized LiveSocial MediaIn this chapter, we consider the problem of interactive advertisement (ad) schedul-ing in personalized live social media. Popularity of live video streaming has seen asharp growth due to improved bandwidth for streaming and the ease of sharing User-Generated-Content (UGC) on the internet platforms. In addition, with the advent ofhigh quality camera smartphones and the widespread deployment of 4G LTE baseddata services, personal online video streaming is growing in popularity and includesmobile applications such as Periscope and Meerkat. One of the primary motivationsfor users to generate content is that platforms like YouTube, Twitch etc., allow usersto generate revenue through advertising and royalties. A strong motivation to con-sider the problem of interactive ad scheduling in live online videos stems from the factthat ads are currently scheduled using passive techniques: periodic [11], and manualmethods; and yet advertisement revenues are significant for social media companies40.In this chapter, we model the interest of the live media using a Markov chain.Viewers are more likely to engage with an ad if there are interested in the content ofthe video that the ad is inserted. Hence, advertisers aim to schedule ads when theinterest is high, so as to maximize advertisement revenue through viewer engagement(click on the ads). The interset of the content is not observed directly, however, noisyobservation is obtained by the comments and likes of the viewers. Hence, the problemof computing the optimal policy of scheduling ads on live channel can be formulatedas a multiple stopping time problem where a decision maker wishes to stop at mostL-times to maximize the cumulative reward.40The revenue of Twitch which deals with live video gaming, play through of video games, ande-sport competitions, is around 3.8 billion for the year 2015, out of which 77% of the revenue wasgenerated from advertisements.68Chapter 4. Interactive Advertisement Scheduling in Personalized Live Social MediaMain results and OrganizationThis chapter is organized as follows: Section 4.1 formulates the multiple stopping timeproblem as a partially observed Markov decision process (POMDP); the POMDPformulation is natural in this context since we are dealing with a partially observedmulti-state Markov chain with multiple actions (L stops, continue). For a POMDP, ingeneral, and the multiple stopping time problem, it is intractable (PSPACE-complete)to numerically compute the optimal policy. Hence, we provide structural results onthe optimal multiple stopping policy. Structural results impose sufficient conditionson the model to determine the structure of the optimal policy without brute forcecomputations - the main tools used are submodularity and stochastic dominance onthe belief space of posterior distributions.This chapter has the following main results:1. Optimality of threshold policies: Section 4.2.3 provides the main structuralresult. Specifically, Theorem 4.2.1 asserts that the optimal policy is characterizedby up to L threshold curves, Γl on the unit simplex of Bayesian posteriors (beliefstates). To prove this result we use the monotone likelihood ratio (MLR) stochasticorder since it is preserved under conditional expectations. However, determining theoptimal policy is non-trivial since the policy can only be characterized on a partiallyordered set (more generally a lattice) within the unit simplex. We modify the MLRstochastic order to operate on line segments within the unit simplex of posteriordistributions. Such line segments form chains (totally ordered subsets of a partiallyordered set) and permit us to prove that the optimal decision policy has a thresholdstructure. In addition, similar to [39], we show that the stopping sets (set of beliefstates at which the decision maker stops) have a nested structure, i.e. S l−1 ⊂ S l.2. Optimal Linear Threshold and their Estimation: For the threshold curvesΓl, l = 1, · · · , L, Theorem 4.3.1 and Theorem 4.3.2 give necessary and sufficient con-ditions for the optimal linear hyperplane approximation (linear threshold policies)that preserves the structure of the optimal multiple stopping policy. Section 4.3presents a simulation based stochastic gradient algorithm (Algorithm 3) to computethe best linear threshold policies. The advantage of the simulation based algorithmis that it is very easy to implement and is computationally efficient.3. Application to Interactive Advertising in live social media: Figure 4.1 showsthe schematic setup of the ad scheduling problem considered in this chapter. Theproblem of optimal scheduling of ads has been studied in the context of advertisingin television; see [33], [34] and the references therein. However, scheduling ads on liveonline social media is different from scheduling ads on television in two significant694.1. Sequential multiple stopping and stochastic dynamic programmingways [36]: i) real-time measurement of viewer engagement (comments and likes onthe content). The viewer engagement provides a noisy measurement of the underlyinginterest in the content. ii) revenue is based on viewer engagement with the ads ratherthan a pre-negotiated contract.In Section 4.4, we use real dataset from Periscope, a popular personalized livestreaming application owned by Twitter, to optimally schedule multiple ads (L > 1) ina sequential manner so as to maximize the advertising revenue. The numerical resultsshow that the policy obtained through the multiple stopping framework outperformsconventional scheduling techniques.Broadcaster (Stochas-tic Scheduler)Live SessionLive Video Schedule AdsContinue StopIntegrated Live Video(Interest ∼ P, pi0)Live ViewersViewerEngagementFigure 4.1: Block diagram showing the stochastic scheduling problem faced by thedecision maker (broadcaster) in advertisement scheduling on live media. The setupis detailed in Section 4.4. The broadcaster wishes to schedule at most L-ads duringthe live session. To maximize advertisement revenue, the ads need to be scheduledwhen the interest in the content is high. The interest in the content cannot bemeasured directly, but noisy observations of the interest are obtained from the viewerengagement (viewer comments and likes) during the live session.4.1 Sequential multiple stopping and stochasticdynamic programmingIn this section, we formulate the optimal multiple stopping time problem as a POMDP.In Section 4.1.3, we present a solution to the POMDP using stochastic dynamic pro-gramming. This sets the stage for Section 4.2 where we analyze the structure of theoptimal policy.704.1. Sequential multiple stopping and stochastic dynamic programming4.1.1 Optimal multiple stopping: POMDP formulationConsider a discrete time Markov chain Xt with state-space S = {1, 2, · · · , S}. Here,t = 0, 1, · · · denote discrete time. The decision maker receives a noisy observation Ytof the state Xt at each time t. The decision maker wishes to stop at most L times overan infinite horizon. The positive integer L, is chosen a priori. At each time the decisionmaker either stops or continues, and obtains a reward that depends on the currentstate of the Markov chain. The objective of the decision maker is to opportunisticallyselect the best time instants to stop so as to maximize the cumulative reward. Thisproblem of stopping at most L times sequentially so as to maximize the cumulativereward corresponds to a multiple stopping time problem with L-stops.The multiple stopping time problem consists of the following components:1. State Dynamics: The Markov chain has transition matrix P and initial proba-bility vector pi0; soP (i, j) = P(Xt+1 = j|Xt = i), pi0(i) = P(X0 = i). (4.1)2. Observations: At each time instant t, the decision maker receives noisy obser-vation Yt of the state Xt. Denote, the conditional probability of receiving observationj ∈ Y (Yt = j) in state i (Xt = i) by B(i, j). Then,B(i, j) = P (Yt = j|Xt = i) ∀i ∈ S, j ∈ Y . (4.2)3. Actions: At each time instant t, the decision maker chooses an action ut ∈ A ={1 (Stop) , 2 (Continue) } to either stop or to continue.4. Reward: Choosing the stop action at time t, when there are l additional stopremaining, the decision maker accrues a reward rl(Xt, a = 1), where Xt is the stateof the Markov chain at time t. Similarly, if the decision maker chooses to continue,it will accrue rl(Xt, a = 2).5. Scheduling Policy: The history available to the decision maker at time t isZt = {pi0, Y1, · · · , Yt} .The scheduling policy µ, at each time t, maps Zt to action ut i.e. the action chosenat time t is ut = µ(Zt).Objective:For l ∈ {1, 2, · · · , L}, let τl denote the stopping time when there are l stops714.1. Sequential multiple stopping and stochastic dynamic programmingremaining, i.e.τl = inf {t : t > τl+1, ut = 1} ,with τL+1 = 0. (4.3)For policy µ and initial belief pi0, cumulative reward is:Jµ(pi0) = Eµ{τL−1∑t=0ρtrL(Xt, 2) + ρτLrL(XτL , 1) (4.4)+τL−1−1∑t=τL+1ρtrL−1(Xt, 2) + · · ·+ ρτ1r1(Xτ1 , 1)∣∣∣ pi0} ,where the expectation is over the state dynamics and the observation distribution.In (4.4), ρ ∈ [0, 1] denotes a user-defined economic discount factor41. Choosing ρ < 1de-emphasizes the effect of decisions taken at later time instants on the cumulativereward.The decision maker aims to compute the optimal strategy µ∗ to maximize (4.4),i.e.µ∗ = argmaxµ∈UJµ(pi0). (4.5)Remark 1. The above formulation is an instance of a special type of POMDP calledthe stopping time POMDP. This is seen as follows: the objective in (4.4) can beexpressed as an infinite horizon criteria by augmenting a fictitious absorbing state–0that has zero reward, i.e. r0(0, u) = 0 u ∈ A. When L stop actions are taken, thesystem transitions to state 0 and remains there indefinitely. Then (4.4) is equivalentto the following discounted infinite horizon criteria:Jµ(pi0) = Eµ{τL−1∑t=0ρtrL(Xt, 2) + ρτLrL(XτL , 1)+ · · ·+ ρτ1r1(Xτ1 , 1) +∞∑t=τ1+1ρtr0(0, 2)∣∣∣ pi0} ,where the last summation is zero.Remark 2 (Finite horizon constraint). This work considers the problem of at most41In the multiple stopping time problem, considered here, ρ = 1 is allowed. For undiscountedproblem (ρ = 1), the stopping times may not be finite and the objective in (4.4) becomes un-bounded. However, the multiple stopping time problem will terminate in finite time: AssumeR = maxi,lrl(i, 1) > 0 i.e. the maximum stop reward is positive and R = mini,lrl(i, 2) < 0, i.e. theminimum reward for continue is negative. Then, it is clear that any optimal policy will stop in lessthanLR|R| time steps.724.1. Sequential multiple stopping and stochastic dynamic programmingL stops with no constraints on the stopping times. Our results also hold straightfor-wardly for the case where L stops need to be made within a pre-specified finite timehorizon. Then, the optimal policy will be non-stationary and the structural resultspresented in subsequent sections apply at each time instant.4.1.2 Belief state formulation of the objectiveAs is customary for partially observed control problems, we reformulate the dynamicsand cumulative objective in terms of the belief state. Let Π denote the belief spaceof S-dimensional probability vectors. The belief space is the unit S − 1 dimensionalsimplex:Π ={pi : 0 ≤ pi(i) ≤ 1,S∑i=1pi(i) = 1}. (4.6)The belief state at time t, denoted by pit ∈ Π, is the posterior probability of Xt giventhe history Zt. The belief state is a sufficient statistic of Zt [98], and evolves accordingto the following Hidden Markov Bayesian filter update [99]:pit+1 = T (pit, Yt+1), whereT (pi, y) =ByP′piσ(pi, y), σ(pi, y) = 1′SByP′pi,By = diag (B(1, y), · · · , B(S, y)) .(4.7)Here 1S represents the S-dimensional vectors of ones.Using the smoothing property of conditional expectations, the objective in (4.4)can be reformulated in terms of belief state as:Jµ(pi0) = Eµ{τL−1∑t=0ρtr′2,Lpit + ρτLr′1,LpiτL (4.8)+τL−1−1∑t=τL+1ρtr′2,L−1pit + · · ·+ ρτ1r′1,1piτ1 +∞∑t=τ1+1ρtr′2,0pit∣∣∣ pi0} ,where ru,l = [rl(1, u), . . . , rl(S, u)]′. For the stopping time problem (4.8), there existsa stationary optimal policy [98]. Since the belief state is a sufficient statistic of Zt,(4.5) is equivalent to computing the optimal stationary policy µ∗ : Π × [L] → A,where [L] = {1, 2, · · · , L}, as a function of belief and number of stops remaining tomaximize (4.8).734.2. Optimal multiple stopping: Structural results4.1.3 Stochastic dynamic programmingComputing the optimal policy µ∗ to maximize (4.5) or equivalently (4.8) involvessolving multiple stopping Bellman’s dynamic programming equation [98]:µ∗(pi, l) = argmaxu∈AQ(pi, l, u),V (pi, l) = maxu∈AQ(pi, l, u),(4.9)Q(pi, l, 1) = r′1,lpi + ρ∑y∈YV (T (pi, y), l − 1)σ(pi, y),Q(pi, l, 2) = r′2,lpi + ρ∑y∈YV (T (pi, y), l)σ(pi, y).Since the state-space Π is a continuum, Bellman’s equation (4.9) does not translateinto a practical solution methodology as V (pi, l) needs to be evaluated at each pi ∈ Π.This, in turn, renders the calculation of the optimal policy µ∗(pi, l) computationallyintractable42.4.2 Optimal multiple stopping: Structural resultsIn this section, we derive structural results for the optimal policy (4.9) of the multiplestopping time problem. In Section 4.2.3, we show that under reasonable conditionson the POMDP parameters, the optimal policy is a monotone policy.4.2.1 DefinitionsDefine the stopping set Sl (the set of belief states where Stop is the optimal action),when l stops are remaining as:Sl = {pi : µ∗(pi, l) = 1} . (4.10)Correspondingly, the continue set (the set of belief states where Continue is theoptimal action) is defined asC l = {pi : µ∗(pi, l) = 2} . (4.11)42It is well known that a finite horizon POMDP with finite observation space can be solvedexactly, indeed the value function is piecewise linear and convex [99]. However, the problem isPSPACE complete [100]; the worst case computational cost increases exponentially with the numberof actions and doubly exponential with the time index.744.2. Optimal multiple stopping: Structural resultsLet W (pi, l) be defined asW (pi, l) = V (pi, l)− V (pi, l − 1). (4.12)The stopping and continue sets in terms of W defined in (4.12) is as follows:Sl = {pi|r′lpi ≥ ρ∑yW (T (pi, y), l)σ(pi, y)},C l = {pi|r′lpi < ρ∑yW (T (pi, y), l)σ(pi, y)}.(4.13)where, rl , r1,l − r2,l.Remark 3. For notational convenience, without loss of generality, assume r1,l = rland r2,l = 0. So, the decision maker accrues no reward for the continue action.Remark 4. We consider r1 = r2 = · · · = rL = r, i.e. the rewards are not dependenton l. It should be noted however that the structural results continue to hold for thecase where the instantaneous rewards rl are dependent on l.The stopping and continue sets can be arbitrary partitions of the simplex Π.However, in Section 4.2.3, we show that these sets can be characterized by thresholdcurves. The question of computing the optimal policy, then, reduces to estimatingthe threshold curve.It is worth pointing out that in the classical stopping POMDPs in [99], the stop-ping and continue sets are characterized in terms of convex value function. The keydifficulty of the multiple stopping problem, is that W being the difference of twoconvex value functions does not share the convex properties of the value function.4.2.2 AssumptionsThe main result below, namely, Theorem 4.2.1, requires the following assumptions onthe reward vector, r, the transition matrix, P and the observation distribution, B.(A1) P is totally positive of order 2 (TP2), i.e. all second order minors are non-negative43.43A stochastic matrix A is TP2 if∣∣∣∣Ai1,j1 Ai1,j2Ai2,j1 Ai2,j2∣∣∣∣ ≥ 0,∀i2 ≥ i1, j2 ≥ j1.Equivalently, Ai,j/Ai+1,j is increasing in j.754.2. Optimal multiple stopping: Structural results(A2) B satisfies the following: If the observation space is discrete or countably infinitethen B(j, x)B(i, y) ≤ B(j, y)B(i, x), j < i, x ≤ y. If the observation space iscontinuous, let B(i) denote the observation probability density while the Markovchain is in state i. ThenB(j)(x)B(i)(x)should be non-decreasing function of x.(A3) The vector, (I − ρP )r, has decreasing elements.Discussion of Assumptions:When S = 2, (A1) is valid when P (1, 1) ≥ P (2, 1). When S > 2, consider thetridiagonal transition matrix44 with P (i, j) = 0, i > j + 2 and i < j− 2. (A1) is validif P (i, i)P (i+ 1, i+ 1) ≥ P (i+ 1, i)P (i, i+ 1).(A2) holds for numerous examples. Examples include binomial, Poisson, geo-metric, Gaussian, exponential, etc. When the observation space is discrete Assump-tion (A2) is equivalent to the TP2 definition. In the numerical results in Section 4.4,we use the Poisson distribution where B(i, j) =gji exp (−gi)j!, where gi is the mean of thePoisson distribution. (A2) is satisfied if gi decreases monotonically with i (see Propo-sition 4.6.1). For a continuous observation distribution such as Gaussian whose meanis dependent on the state of the Markov chain (variance is fixed), (A2) is satisfiedwhen the mean monotonically decreases with i.(A3) is a joint condition on the reward vector and the transition matrix. (A3)and (A1) jointly imply that the reward vector r has decreasing elements45. WhenS = 2, it can be verified that r having decreasing elements is sufficient for (I − ρP ) rto have decreasing elements. For S > 2, (A3) is a stronger condition than having theelements of r decreasing.(A3) is easy to interpret when P has additional structure. For example, consider aslowly varying Markov chain with P = I + Q, where Q(i, j) > 0, i 6= j, ∑j Q(i, j) =0, and > 0. Here 1> maxi∑j |Q(i, j)| for P to be a valid transition matrix.Then (A3) is equivalent to r having decreasing elements. Such slowly varying matricesarise in a lot of applications like manufacturing systems, internet packet transmissionand wireless communication (see Section 1.3 in [101]). Also, the user interest in anonline social media typically evolves slowly [102]. The reward vector r captures thepreference of the decision maker - the highest reward is accrued in State 1.44The transition matrices computed on real dataset in Section 4.4 follow a tridiagonal structure;refer to (4.23).45ρ < 1: Let v = (I − ρP )r. (A3) implies that v has decreasing elements. When ρ < 1, (I − ρP )is invertible. Hence, r = (I − ρP )−1v = ∑∞k=0 ρkP kv. Since the product of TP2 matrices is TP2,each P k is TP2. The result follows from Theorem 9.2.2 in [99].For ρ = 1, g = limρ↑1(1 − ρ)(I − ρP )−1v is the solution of (I − P )r = v. This limit exists [103,764.2. Optimal multiple stopping: Structural results4.2.3 Main result: Optimality of threshold policiesThe main result below (Theorem 4.2.1) states that the optimal policy is monotonewith respect to the belief state pi. However, for a monotone policy to be well defined,we need to first define the ordering between two belief states. For S = 2, the belief pi =[1− pi(2) pi(2)]can be completely ordered with respect to pi(2) ∈ [0, 1]. However,for S > 2, comparing belief states requires using stochastic orders which are partialorders. We will use the monotone likelihood ratio (MLR) (see Def. 4.6.1 in Sec. 4.6.1);it is ideal for partially observed control problems since it is preserved under conditionalexpectation (Bayesian update).Under reasonable conditions, Theorem 4.2.1 asserts that the optimal policy µ∗(pi)is monotonically decreasing in pi with respect to the MLR order. However, despite thismonotonicity, determining the optimal policy is nontrivial since the policy can only becharacterized on a partially ordered set. The main innovation in Theorem 4.2.1 is tomodify the MLR stochastic order to operate on lines L(e1, p¯i) and L(eS, p¯i) (see 4.6.1)within the belief space. Such line segments form chains (totally ordered subsets of apartially ordered set) and permit us to prove that the optimal decision policy has athreshold structure.e2e3p¯i1Hp¯i2p¯i3L(e1, p¯i1)e1C lSlΓlSl−1Γl−1Figure 4.2: Visual illustration of Theorem 4.2.1. Each of the stopping sets Sl ischaracterized by a threshold curve Γl. Each of the threshold curve Γl intersects theline L(e1, p¯i) at most once.Theorem 4.2.1. Assume (A1), (A2) and (A3). Then,A There exists an optimal policy µ∗(pi, l) that is decreasing on lines L(e1, p¯i), andL(eS, p¯i) in the belief space Π for each l46.Cor. 8.2.5] and hence, r has decreasing elements.45H is defined in 4.6.1774.3. Stochastic gradient algorithm for estimating optimal linear threshold policiesB There exists an optimal switching curve Γl, for each l, that partitions the beliefspace Π into two individually connected sets Sl and C l,such that the optimalpolicy isµ∗(pi, l) =1 if pi ∈ Sl2 if pi ∈ C l (4.14)C Sl−1 ⊂ Sl, l = 1, 2, · · · , L.Theorem 4.2.1A asserts that the optimal policy is monotonically decreasing onthe line L(e1, p¯i), as shown in Figure 4.2. Hence, on each line L(e1, p¯i) there existsa threshold above (in MLR sense) which it is optimal to Stop and below which it isoptimal to Continue. Theorem 4.2.1B asserts, for each l, the stopping and continuesets are connected. Hence, there exists a threshold curve, Γl, as shown in Figure 4.2,obtained by joining the thresholds, from Theorem 4.2.1A, on each of the line L(e1, p¯i).Furthermore, the stopping set enclosed by the threshold curve is a union of convexsets47and hence, the threshold curve is continuous and differentiable almost every-where. Theorem 4.2.1C proves the nested structure of the stopping sets: Thestopping set when l−1 stops are remaining is a subset of the stopping set when thereare l stops remaining.4.3 Stochastic gradient algorithm for estimatingoptimal linear threshold policiesIn light of Theorem 4.2.1, computing the optimal policy reduces to estimating L-threshold curves in the unit simplex (belief space), one for each of the L-stops. Thethreshold curves can be approximated by any of the standard basis functions. Inthis paper, we will restrict the approximation to linear threshold policies, i.e. policiesof the form given in (4.15). However, any such approximation needs to capture the46In general, the optimal policy is not unique. The theorem asserts that there exists a version ofthe optimal policy that is monotone.47This is due to the finite stopping time property of the multiple stopping time problem; seeFootnote 41. A finite horizon POMDP with a finite state and observation space has a value functionthat is piecewise linear and convex; see Theorem 7.4.1 in [99]. For l = 1, V (pi, 1) = maxγ∈Γγ′pi, where Γ isa finite set due to the finite stopping time property. For l = 2, the dynamic programming equationin (4.9) can be written as: V (pi, 2) = max{r′pi + maxγ∈Γγ′P ′pi, ρ∑y∈Y V (T (pi, y), 2)σ(pi, y)}. Foreach γ ∈ Γ, the stopping set is convex; see the proof of Theorem 12.2.1 in [99]. Hence, the stoppingset for l = 2 is a union of convex sets. Similar argument holds for any value of l.784.3. Stochastic gradient algorithm for estimating optimal linear threshold policiesessence of Theorem 4.2.1, i.e. the optimal policy is MLR decreasing on lines, connectedand satisfy the nested property. We call such linear threshold policies (that capturesthe essence of Theorem 4.2.1) as the optimal linear threshold policies.Section 4.3.1 derives necessary and sufficient condition to characterize such linearthreshold policies. Algorithm 3 in Section 4.3.2 is a simulation based algorithm tocompute the optimal linear threshold policies. The simulation based algorithm iscomputationally efficient (see comments at end of Section 4.3.2).4.3.1 Structure of optimal linear threshold policies formultiple stoppingWe define a linear parametrized policy on the belief space Π as follows. Let θl ∈ IRS−1denote the parameters of linear hyperplane. Then, linear threshold policies as afunction of the belief pi and the number of stops remaining l, are defined asµθ(pi, l) =1 if[0 1 θl] pi−1 ≤ 02 otherwise .(4.15)The linear policy µθ(pi, l) is indexed by θ to show the explicit dependence of the pa-rameters on the policy. In (4.15), θ = (θ1, θ2, . . . , θL) ∈ IRL×(S−1) is the concatenationof the θl vectors, one for each of the L-stops.In Theorem 4.2.1A, it was established that the optimal multiple stopping policy isMLR decreasing on specific lines within the belief space, i.e. for pi1 ≥Li pi2, µ(pi1, l) ≤µ(pi2, l); i = 1, S. Theorem 4.3.1 gives necessary and sufficient conditions on thecoefficient vector θl such that pi1 ≥Li pi2, µθ(pi1, l) ≤ µθ(pi2, l); i = 1, S.Theorem 4.3.1. A necessary and sufficient condition for the linear threshold policiesµθ(l, pi) to be1. MLR decreasing on line L(e1), iff θl(S − 1) ≥ 0 and θl(i) ≥ 0, i ≤ S − 2.2. MLR decreasing on line L(eS), iff θl(S − 1) ≥ 0, θl(S − 2) ≥ 1 and θl(i) ≤θl(S − 2), i < S − 2.The proof is in Sec. 4.6.3. As a consequence of Theorem 4.3.1, the constraintson the parameters θ ensure that only MLR decreasing linear threshold policies are794.3. Stochastic gradient algorithm for estimating optimal linear threshold policiesconsidered; the necessity and sufficiency imply that non-monotone policies are notconsidered, and monotone policies are not left out.In Theorem 4.2.1B it was established that the optimal stopping sets are con-nected, which is satisfied trivially since we approximate the threshold curve usinga linear hyperplane. Theorem 4.3.2 below provides sufficient conditions such thatthe parametrized linear threshold curves satisfy the nested property established inTheorem 4.2.1C. A proof is provided in Sec. 4.6.3.Theorem 4.3.2. A sufficient condition for the linear threshold policies in (4.15) tosatisfy the nested structure in Theorem 4.2.1C is given byθl−1(S − 1) ≤ θl(S − 1)θl−1(i) ≥ θl(i) i < S − 1,(4.16)for each l.4.3.2 Simulation-based stochastic gradient algorithm forestimating linear threshold policiesWe now estimate the optimal linear threshold policies using a simulation basedstochastic gradient algorithm using Algorithm 3. The algorithm is designed so thatthe estimated policies satisfy the conditions in Theorem 4.3.1 and Theorem 4.3.2.The optimal policy of a multiple stopping time problem maximizes the expectedcumulative reward Jµ in (4.4). In Algorithm 3, we approximate Jµ over a finite timehorizon (N), as JN which is computed as:JN(θ) = Eµθ{L∑l=1ρτlr′piτl∣∣∣ τl ≤ N ;∀l} . (4.17)JN is an asymptotic estimate of Jµ as N tends to infinity.Algorithm 3, is a stochastic gradient algorithm that generates a sequence of es-timates θn, that converges to a local maximum. It requires the computation of thegradient: ∇θJN(·). Evaluating the gradient in closed form is intractable due to thenon-linear dependence of JN(θ) on θ. We can estimate ∇ˆθJN(·) using a simulationbased gradient estimator. There are several such simulation based gradient estimatorsavailable in the literature including infinitesimal perturbation analysis, weak deriva-tives and likelihood ratio (score function) methods [104]. In this work, for simplicity,we use the SPSA algorithm [105], which estimates the gradient using a finite difference804.3. Stochastic gradient algorithm for estimating optimal linear threshold policiesmethod.To make use of the SPSA algorithm, we convert the constrained optimizationproblem in θ (constraints imposed by Theorem 4.3.1 and Theorem 4.3.2) into anunconstrained problem using spherical co-ordinates as follows:θφl (i) =φ21(S − 1)∏L−1`=l sin2(φ`(S − 1)) i = S − 11 + φ21(S − 2)∏l`=2 sin2(φ`(S − 2)) i = S − 2θl(S − 2)∏L`=1 sin2(φ`(i)) i < S − 2.(4.18)It can be verified that the parametrization, θφ in (4.18), satisfies the conditions inTheorem 4.3.1 and Theorem 4.3.2. For example, consider i = S−1, then the productterm involving sin(·) ensures that θl−1(S − 1) ≤ θl(S − 1) (the first part of Theo-rem 4.3.2).Algorithm 3 Stochastic gradient algorithm for optimal multiple stoppingRequire: POMDP parameters satisfy (A1), (A2), (A3).1: Choose initial parameters φ0 and initial linear threshold policies µθ0 using (4.15).2: for iterations n = 0, 1, 2, . . . : do3: Evaluate JN(θφn+cnωn) and JN(θφn−cnωn) using (4.17)4: SPSA: Gradient estimate ∇ˆφJN(θφn) using (4.19).5: Update the parameter vector φn to φn+1 using (4.20).6: end forFollowing [105], the gradient estimate using SPSA is obtained by picking a randomdirection ωn, at each iteration n. The estimate of the gradient is then given by∇ˆφJN(θφn) = JN(θφn+cnωn)− JN(θφn−cnωn)2cnωn, (4.19)where,ωn(i) =−1 with probability 0.5+1 with probability 0.5.The two JN(·) terms in the numerator of (4.19) is estimated using the finite timeapproximation (4.17). Using the gradient estimate in (4.19), the parameter update isas follows [105]:φn+1 = φn + an∇ˆφJN(θφn). (4.20)814.4. Numerical examples: Interactive advertising in live social mediaThe parameters an and cn are typically chosen as follows [105]:an = ε(n+ 1 + ς)−κ 0.5 < κ ≤ 1, and ε, ς > 0cn = µ(n+ 1)−υ 0.5 < υ ≤ 1 µ > 0(4.21)At each iteration of Algorithm 3, evaluating the gradient estimate in (4.19) re-quires two POMDP simulations. However, this is independent of the number of states,the number of observations or the number of stops. The decreasing step size stochas-tic gradient algorithm, Algorithm 3, converges to a local optimum with probabilityone. Hence, it is necessary to try several initial conditions and estimate the optimalthreshold.To summarize, we have used a stochastic gradient algorithm to estimate the opti-mal linear threshold policies for the multiple stopping time problem. Instead of linearthreshold policies, one could also use piecewise linear or other basis function approx-imations; provided that the resulting parameterized policy is still MLR decreasing(i.e. the characterization similar to Theorem 4.3.1 holds).4.4 Numerical examples: Interactive advertisingin live social mediaThis section has three parts. In Section 4.4.1 we visually illustrate the main resultfor S = 348. We consider multiple examples to illustrate how the assumptions inSec. 4.2.3 affect the optimal multiple stopping time policy. In addition, we benchmarkthe performance of linear threshold policies (obtained using Algorithm 3) againstoptimal multiple stopping policy. Second, using a real dataset, we study howthe multiple stopping problem can be used to schedule advertisements in live socialmedia. We show numerically that the linear threshold scheduling policies outperformsconventional techniques for scheduling ads in live social media. Finally, we illustratethe performance of the linear threshold policies for a large size POMDP (25 states)by comparing with the popular SARSOP algorithm.4.4.1 Synthetic dataIn this section, we visually illustrate the optimal multiple stopping policy, using nu-merical examples. The objective is to illustrate how the assumptions in Sec. 4.2.348For S = 3, the unit simplex is an equilateral triangle.824.4. Numerical examples: Interactive advertising in live social mediaaffect the optimal multiple stopping time policy. The optimal policy can be obtainedby solving the dynamic programming equations in (4.9) and can be computed ap-proximately by discretizing the belief space. The belief space Π , for all examplesbelow, was uniformly quantized into 100 states, using the finite grid approximationmethod in [106].Example 1: POMDP parameters: Consider a Markov chain with 3−states withthe transition matrix P and the reward vector specified in (4.22). The observationdistribution is given by B(i, j) =gji exp (−gi)j!, i.e. the observation distribution is Pois-son with state dependent mean vector g given in (4.22). It is easily verified thatthe transition matrix, the observation distribution and the reward vector satisfy theconditions (A1) to (A3).P =0.2 0.1 0.70.1 0.1 0.80 0.1 0.9 , g = [12 7 2] , r = [9 3 1] (4.22)We choose49 L = 5, i.e. the decision maker wishes to stop at most 5 times.Figure 4.3a shows the stopping sets S5 and S1. It is evident from Figure 4.3a thatthe optimal policy is monotone on lines, stopping sets are connected and satisfy thenested property; thereby illustrating Theorem 4.2.1.(a) Example 1: S1 (shownin black) and S5 (shownin red) obtained by solv-ing the dynamic program-ming (4.9). The figure illus-trates monotone, connectedand the nested structure ofthe stopping sets (Sl−1 ⊂Sl), in Theorem 4.2.1.0 0.2 0.4 0.6 0.8 1(b) Example 2: Optimalpolicy when (A3) is vio-lated. S1 is shown in blackand S5 is shown in red. Themonotone property of The-orem 4.2.1A is violated.0 0.2 0.4 0.6 0.8 1(c) Example 3: Optimalpolicy when (A3) is vio-lated. S1 is shown in blackand S2 is shown in red.The stopping sets are notnested.Example 2: Consider the same parameters as in Example 1, except rewardr =[1 2 1]which violates (A3). Figure 4.3b shows the optimal multiple stopping49This is motivated by the real dataset example in Section 4.4.2.834.4. Numerical examples: Interactive advertising in live social mediapolicy in terms of the stopping sets. As can be seen from Figure 4.3b that the optimalpolicy do not satisfy the monotone property (Theorem 4.2.1A). However, the nestedproperty continues to hold.Example 3: Consider the same parameters as in Example 1, except L = 2 andr1 =[9 3 1]and r2 =[3 9 1]. Assumption (A3) is violated for l = 2. Figure 4.3cshows the optimal multiple stopping policy in terms of the stopping sets. As can beseen from Figure 4.3c that the optimal policy does not satisfy the monotone propertyor the nested property.Thus, the conditions (A1) to (A3) of Theorem 4.2.1 are useful in the sense thatwhen they are violated, there are examples where the optimal policy does not havethe monotone or nested property.Performance of linear threshold policies: In order to benchmark the performanceof optimal linear threshold policies (that satisfy the constraints in Theorem 4.3.1and Theorem 4.3.2), we ran Algorithm 3 for Example 1 (parameters in (4.22)). Theperformance was compared based on the expected cumulative reward between theoptimal policy and the linear threshold policies for 1000 independent runs. Thefollowing parameters were chosen for the SPSA algorithm µ = 2, υ = 0.2, ς = 0.5,κ = 0.602 and ε = 0.1667; these values are as suggested in [105]. It was observedthat there is a 12% drop in performance of the linear threshold policies compared tothe optimal multiple stopping policy.4.4.2 Real dataset: Interactive ad scheduling on Periscopeusing viewer engagementWe now formulate the problem of interactive ad scheduling on live online social mediaas a multiple stopping problem and illustrate the performance of linear thresholdpolicies using a Periscope dataset50. Periscope is a popular live personalized videostreaming application where a broadcaster interacts with the viewers via live videos.Each such interaction lasts between 10−20 minutes and consists of: (i) A broadcasterwho starts a live video using a handheld device. (ii) Video viewers who engage withthe live video through comments and likes.The technique of interactive scheduling, where viewer engagement is utilized toschedule ads, has not been addressed in the literature. It will be seen in this sectionthat interactive scheduling of ads has significant performance improvements over the50We use the dataset in [107], which can be downloaded from http://sandlab.cs.ucsb.edu/periscope/. In [107], the authors deal with the performance of Periscope application in terms ofdelay and scalability.844.4. Numerical examples: Interactive advertising in live social mediaexisting passive methods.Dataset: The dataset in [107] contains details of all public broadcasts on thePeriscope application from May 15, 2015 to August 20, 2015. The dataset consistsof timestamped events: time instants at which the live video started/ended; timeinstants at which viewers join; and, time instants at which the viewers engage usinglikes and comments. In this work, we consider viewer engagement through likes, sincecomments are restricted to the first 100 viewers in the Periscope application.Ad scheduling ModelHere we briefly describe how the model in Section 4.1 can be adapted to theproblem of interactive ad scheduling in live video streaming; see Figure 4.1 for theschematic setup.1. Interest Dynamics: In live online social media, it is well known that the viewerengagement is correlated with the interest of the content being streamed or broadcast.Markov models have been used to model interest in online games [108], web [12] andin online social networks [109]. We therefore model the interest in live video as aMarkov chain, Xt, where the different states denote the level of interest in the livecontent. The states are ordered in the decreasing order of interest.Homogeneous Assumption: Periscope utilizes the Twitter network to link broad-casters with the viewers and hence shares many of the properties of the Twittersocial network. Different sessions of a broadcaster, therefore, tend to follow similarstatistics due to the effects of social selection and peer influence [110]. It was shownin [111] that live sessions on live online gaming platforms can be viewed as commu-nities and communities in online social media have similar information consumptionpatterns [112]. We therefore model the interest dynamics as a time homogeneousMarkov chain having a transition matrix P .2. Engagement Dynamics: The interest in the video, Xt, cannot be measureddirectly by the broadcaster and has to be inferred from the viewer engagement, de-noted by Yt. Since the viewer engagement measures the number of likes in a giventime interval, we model it using a Markov modulated Poisson distribution. Denotethe rate of the Poisson observation process when the interest is in state i by gi. Theobservation probability in (4.2) can be obtained using B(i, j) =gji exp (−gi)j!.3. Broadcaster Revenue: The ad revenue in online social media depends on theclick rate (the probability that the ad will be clicked). In a recent research, AdobeResearch51 concluded that video viewers are more likely to engage with an ad if theyare interested in the content of the video that the ad is inserted into. The reward51https://gigaom.com/2012/04/16/adobe-ad-research/854.4. Numerical examples: Interactive advertising in live social mediavector in Section 4.1.1 should capture the positive correlation that exists betweeninterest in the videos and the click rate [113]. Since the information regarding theclick rate and actual number of viewers are not available in the dataset, we choosethe reward vector r to be a vector of decreasing elements, each being proportional tothe reward in that state, such that (A3) is satisfied.4. Broadcaster operation: The broadcaster wishes to schedule at most L ads atinstants when the interest is high. Here, we choose52 the number of stops L = 5.At each discrete time, after receiving the observation Yt, the broadcaster either stopsand schedules an ad or continues with the live stream; see Figure 4.1.5. Broadcaster objective: The objective of the broadcaster is given by (4.4). Itaims to schedule ads when the content is interesting, so as to elicit maximum numberof clicks, thereby maximizing the expected revenue. In personalized live streamingapplications like Periscope, the discount factor in (4.4) captures the “impatience” oflive broadcaster in scheduling ads.The above model and formulation correspond to a multiple stopping problem withL stops, as discussed in Section 4.1. In the next section, we describe how to estimatethe model parameters from the data (viewer engagement Yt) for computing the linearthreshold policies using Algorithm 3 in Section 4.3.Estimation of parameters: The live video sessions in Periscope have a rangeof 10−20 minutes [107]. The viewer engagement information consists of a time seriesof likes obtained by sampling the timestamped likes at a 2-second interval. Samplingat a 2-second interval, each session provides 1000 data points. The model parametersP and B are computed using maximum likelihood estimation. Since the interestdynamics are time homogeneous, we utilize data from multiple sessions to estimatethe parameters P and B. The model was validated using the QQ-plot (see Figure 4.4)of normal pseudo-residuals [114, Section 6.1]. The estimated value of the transitionmatrix P and the state dependent mean g of a popular live session are given as:P =0.733 0.266 0.000 0.0000.081 0.718 0.201 0.0000.000 0.214 0.670 0.1160.000 0.000 0.222 0.778 ,g =[38 21 10 1].(4.23)52Most of the popular Periscope sessions last 15− 30 mins. Broadcast television usually average13.5 mins per hour of advertisement or approximately one ad every 5 mins. Hence, we choose thenumber of advertisements L = 5.864.4. Numerical examples: Interactive advertising in live social mediaThe model order dimension was estimated using the penalized likelihood criterion;specifically Table 4.1 shows the model order selection using the Bayesian informa-tion criterion (BIC). The likelihood values in Table 4.1 were obtained using anExpectation-Maximization (EM) algorithm [114]. In Table 4.1 that S = 4 has thelowest BIC value.Table 4.1: BIC model order selection for the popular live session. The maximumlikelihood estimated parameters are given in (4.23). The BIC criteria was run for Svarying from 2−12 (only values for 2−6 are shown below). It can be seen that S = 4has the lowest BIC value.S − log(L ) BIC = −2 log(L ) + n log(N)2 -4707.254 9535.0533 -4190.652 8601.1224 -3969.955 8287.3645 -3951.155 8405.7646 -3887.453 8462.725• L denotes the likelihood value.• n denotes the number of parameters: n = S2 + S − 1.• N denotes the number of observations. Here, N = 104.The reward vector was chosen as r =[4 3 2 1], and satisfies (A3) for ρ ∈ [0, 1].200 400 600 800 1000 1200Time Index020406080100120Rate of likes(a) Plot of likes-4 -3 -2 -1 0 1 2 3 4Standard Normal Quantiles-3-2-101234Quantiles of Input Sample(b) QQ-plotFigure 4.4: The plot of the likes obtained for a popular session is shown in Fig. 4.4a.The maximum likelihood estimated parameters are given in (4.23). The QQ-plot isused for validating the goodness of fit. The linearity of the points suggest that theestimated parameters in (4.23) are a good fit.874.4. Numerical examples: Interactive advertising in live social mediaMultiple ad scheduling: Performance resultsWe now compare the linear threshold scheduling policies (obtained from Algorithm 3)with two existing schemes:1. Periodic: Here, the broadcaster stops periodically to advertise. Twitch53, forexample, uses periodic ad scheduling [11]. Periodic advertisement scheduling isalso widely used for pre-recorded videos on social media platforms like YouTube.2. Heuristic: Here, the broadcaster solves a classical single stopping problem ateach stop. The scheduler re-initializes and solves for L = 1 in Section 4.1 ateach stop.Performance Results: It was seen that the optimal linear threshold policiesoutperforms conventional periodic scheduling by 25% and the heuristic scheduling by10%. The periodic scheme performs poorly because it does not take into account theviewer engagement or the interest in the content while scheduling ads. The multiplestopping policy, in comparison to the heuristic scheme, takes into account the factthat L-ads need to be scheduled and hence, is optimal.4.4.3 Large state space models & Comparison withSARSOPTo illustrate the application on large state space models, we present a numericalexample using synthetic data.POMDP parameters: We consider a Markov chain with 25 states. The transitionmatrix and observation distribution are generated as discussed in [115]. In order thatthe transition matrix P satisfy the TP2 assumption in (A1), we use the followingapproach: First construct a 5-state transition matrix A = exp(Qt), where Q is atridiagonal generator matrix (off-diagonal entries are non-negative and row sums to0) and t > 0. Since Kronecker product preserves TP2 structure, we let P = A ⊗ A.The observation distribution B, containing 25 observations satisfying (A2) is similarlygenerated. The reward vector is chosen as follows: r = [25, 24, · · · , 1]. The numberof stops, L = 5.Because of the large state space dimension, computing the optimal policy usingdynamic programming is intractable. We compare linear threshold policies (obtainedthrough Algorithm 3), the heuristic policy and periodic policy (described in the Sec-tion 4.4.2), in terms of the expected cumulative reward by each of the policy. Also, we53Twitch is a video platform that focuses primarily on video gaming. In 2015, Twitch has morethan 1.5 million broadcasters and 100 million visitors per month884.4. Numerical examples: Interactive advertising in live social mediacompare the linear threshold policy against the state-of-the-art solver for POMDP:SARSOP (an approximate POMDP planning algorithm) [116].Table 4.2 shows the normalized cumulative reward by each of the policies. Theexpected reward was calculated using 1000 independent Monte Carlo simulations.From Table 4.2 we observe the following:1. The linear threshold policy and heuristic policy outperforms periodic schedulingby a factor of 2.2. The linear threshold policy outperforms the heuristic policy by 12%.3. The linear threshold policy has a performance drop of 9% compared to thesolution obtained using SARSOP. This can be attributed to the linear hyper-plane approximation to the threshold curve compared to the SARSOP solutionwhere the number of linear segments is exponential in the number of states andobservations.Although the linear threshold policies have a slight performance drop comparedto SARSOP, it has two significant advantages:1. The policy (the linear threshold vectors corresponding to each stop) is easy toimplement54.2. Computing the linear threshold approximation is computationally cheaper com-pared to SARSOP algorithm. It can be noted from Table 4.2 that Algorithm 3is computationally cheaper by a factor of 10.Algorithm Cumulative Reward #Computations(Normalized w.r.t SARSOP)SARSOP 1 18e12Linear Threshold 0.91 1.25e11Heuristic 0.79 1.25e11Periodic 0.35 0Table 4.2: Comparison of the expected reward and number of computations by vari-ous algorithm. The linear threshold policies have a performance drop of 9% comparedto the solution obtained using SARSOP and outperforms the heuristic policy by 12%.SARSOP solution computed using a 2.5 GHz CPU running for 2 hours. The calcula-tion assumes a floating point operation every CPU cycle. Algorithm 3, for obtaininglinear threshold policies, was run with finite horizon N = 1000.54The SARSOP policy has approximately 4e4 linear segments.894.5. Closing remarks4.5 Closing remarksThis chapter presented three main results regarding the multiple stopping time prob-lem.(i) The optimal policy was shown to be monotone with respect to a specialized mono-tone likelihood ratio order on lines (under reasonable conditions). Therefore theoptimal policy was characterized by multiple threshold curves on the belief space andthe optimal stopping sets satisfied a nested property (Theorem 4.2.1).(ii) Necessary and sufficient conditions were given for a linear threshold policies tosatisfy the MLR increasing condition for the optimal policy (Theorem 4.3.1 and The-orem 4.3.2). We then gave a stochastic gradient algorithm (Algorithm 3) to estimatethe linear threshold policies.(iii) Finally, the linear scheduling policy was illustrated on a real data set involvinginteractive advertising in live social media video.4.6 Proof of theorems4.6.1 Preliminaries and DefinitionsTheorem 4.2.1 require concepts in stochastic dominance [117] and submodularity [118].First-order and MLR stochastic dominanceIn order to compare belief states and we will use the monotone likelihood ratio (MLR)stochastic ordering and a specialized version of the MLR order restricted to lines inthe simplex. The MLR stochastic order is useful since it is preserved under conditionexpectations.Definition 4.6.1 (MLR ordering). Let pi1, pi2 ∈ Π be two belief state vectors. Then, pi1is greater than pi2 with respect to Monotone Likelihood Ratio (MLR) ordering–denotedas pi1 ≥r pi2, ifpi1(j)pi2(i) ≤ pi2(j)pi1(i), i < j, i, j ∈ {1, . . . , S} (4.24)Proposition 4.6.1. If the observation distribution is Poisson, i.e. B(i, j) =gji exp (−gi)j!,where gi is the mean of the Poisson distribution, then (A2) is satisfied when gi de-creases monotonically with i.904.6. Proof of theoremsProof. Recall, (A2) is given byB(j, x)B(i, y) ≤ B(j, y)B(i, x), j < i, x ≤ y.Substituting,gxj exp (−gj)x!gyi exp (−gi)y!≤ gyj exp (−gj)y!gxi exp (−gi)x!(gjgi)x≤(gjgi)y,which implies that gi ≤ gj.Definition 4.6.2 (First order stochastic dominance). Let pi1, pi2 ∈ Π be two be-lief state vectors. Then, pi1 is greater than pi2 with respect to first-order stochasticdominance–denoted as pi1 ≥s pi2, ifS∑i=jpi1(i) ≤S∑i=jpi2(i) ∀j ∈ {1, 2, · · · , S} . (4.25)Result [99]:i) pi1, pi2 ∈ Π. Then, pi1 ≥r pi2 implies pi1 ≥s pi2.ii) pi1 ≥s pi2 if and only if for any increasing function φ(·), Epi1 {φ(x)} ≥ Epi2 {φ(x)}.For state-space dimension S = 2, MLR is a complete order and coincides with first-order stochastic dominance. For state-space dimension S > 2 MLR is a partial orderi.e. [Π,≥r] is a partially ordered set55since it is not always possible to order any twobelief states. However, on line segments in the simplex defined below, MLR is a totalordering.Define the sub simplex Hi; i = 1, S as:Hi = {p¯i : p¯i ∈ Π and p¯i(i) = 0} . (4.26)Figure 4.2 illustrates H1 for S = 3. Consider two types of lines, L (ei, p¯i) ; i = 1, S, asfollows: For any p¯i ∈ Hi, construct the line L(ei, p¯i) that connects p¯i to ei as below:L (ei, p¯i) = {pi ∈ Π : pi = (1− γ) p¯i + γe1, 0 ≤ γ ≤ 1} , p¯i ∈ H1 (4.27)With an abuse of notation, we denote L(ei, p¯i) by L(ei). Figure 4.2 illustrates the55A partially ordered set is a set X on which there is a binary relation 4 that is reflexive,antisymmetric, and transitive.914.6. Proof of theoremsdefinition of L(e1).Definition 4.6.3 (MLR ordering on lines). pi1 is greater than pi2 with respect to MLRordering on the lines L(ei), denoted as pi1 ≥Li pi2, if pi1, pi2 ∈ L(ei), for some p¯i ∈ Hiand pi1 ≥r pi2.Remark 5 ([99]). For i = 1, S, pi1 ≥Li pi2 is equivalent to pij = εjei + (1 − εj)p¯i, forsome p¯i ∈ Hi and ε1 ≥ ε2.The MLR ordering on lines is a complete order, i.e. it forms a chain, i.e. allelements pi, p¯i ∈ L(ei) are comparable, i.e. either pi ≥Li p¯i or p¯i ≥Li pi. The MLRon lines allows us to give a threshold characterization of the optimal policy on thebelief space. An important consequence of assumption (A1) and (A2) is the followingtheorem, which state that the filter T (pi, y) in (4.7) preserves MLR dominance.Theorem 4.6.1 ([99]). If the transition matrix, P , and the observation matrix, B,satisfies the condition in (A1) and (A2), then• For pi1 ≥r pi2, the filter satisfies T (pi1, ·) ≥r T (pi2, ·).• For pi1 ≥r pi2, σ(pi1, ·) ≥s σ(pi2, ·)To prove the structural result, we show that the Q(pi, l, u) in (4.9) is submodularon the lines L(ei); i = 1, S with respect to the MLR order ≥Li .Definition 4.6.4 (Submodular function). A function f : L(ei) × {1, 2} → IR issubmodular if :f(pi, u)− f(pi, u¯) ≤ f(p¯i, u)− f(p¯i, u¯);u ≤ u¯, pi ≥Li p¯i (4.28)Theorem 4.6.2 ([118]). If f(pi, u) is submodular, then there exists a u∗(pi) = argmaxu∈Uf(pi, u)that is decreasing in pi.4.6.2 Value iterationThe value iteration algorithm is a successive approximation approach for solvingBellman’s equation (4.9). For iterations k = 0, 1, . . . ,Vk+1(pi, l) = maxu∈{1,2}Qk+1(pi, l, u), (4.29)924.6. Proof of theoremsµk+1(pi, l) = argmaxu∈{1,2}Qk+1(pi, l, u), (4.30)whereQk+1(pi, l, 1) = r′pi + ρ∑yVk(T (pi, y), l − 1)σ(pi, y), (4.31)Qk+1(pi, l, 2) = ρ∑yVk(T (pi, y), l)σ(pi, y), (4.32)with V0(pi, l) initialized arbitrarily. Define Wk(pi, l) asWk(pi, l) , Vk(pi, l)− Vk(pi, l − 1). (4.33)The stopping and continue sets (at each iteration k) when l stops are remaining isdefined as follows:Slk+1 = {pi|r′pi ≥ ρ∑yWk(T (pi, y), l)σ(pi, y)},C lk+1 = {pi|r′pi < ρ∑yWk(T (pi, y), l)σ(pi, y)}.(4.34)The optimal stationary policy µ∗(pi, l) is given byµ∗(pi, l) = limk→∞µk(pi, l). (4.35)Correspondingly, the stationary stopping and continue sets in (4.10) and (4.11) aregiven bySl = limk→∞Slk, Cl = limk→∞C lk. (4.36)The value function, Vk(pi, l) in (4.29), can be rewritten, using (4.34), as follows:Vk(pi, l) =(r′pi + ρ∑yVk−1(T (pi, y), l − 1)σ(pi, y))ISlk+(ρ∑yVk−1(T (pi, y), l)σ(pi, y))IClk , (4.37)where IClk and ISlk are indicator functions on the continue and stopping sets respec-tively, for each iteration k.Assume Sl−1k ⊂ Slk (see Theorem 4.6.5) and substituting (4.37) in the definition934.6. Proof of theoremsof Wk(pi, l) in (4.33),Wk(pi, l) =(ρ∑yWk−1(T (pi, y), l)σ(pi, y))IClk(pi)+ r′piICl−1k ∩Slk(pi) (4.38)+(ρ∑yWk−1(T (pi, y), l − 1)σ(pi, y))ISl−1k (pi).In order to prove the main theorem (Theorem 4.2.1), we require the followingresults, proofs of which are provided in 4.6.3.Theorem 4.6.3. Vk(pi, l) is increasing in pi.Theorem 4.6.4. Wk(pi, l) is decreasing in l.Theorem 4.6.5. Slk+1 ⊃ Sl−1k+14.6.3 ProofsTo prove Theorem 4.6.3, Theorem 4.6.4 and Theorem 4.6.5, we assume that theproposition hold for all values less than k.Proof of Theorem 4.6.3Recall from (4.29),Vk(pi, l) = maxu∈{1,2}Qk(pi, l, u),To prove Theorem 4.6.3, we show Qk(pi, l, u) is MLR increasing in pi for u = {1, 2}.Recall from (4.31),Qk(pi, l, 1) = r′pi + ρ∑y Vk−1(T (pi, y), l − 1)σ(pi, y),Using Theorem 4.6.1 and the induction hypothesis, the term∑y Vk−1(T (pi, y), l −1)σ(pi, y) is MLR increasing in pi. From Assumption (A3), r′pi is MLR increasing inpi. The proof for Qk(pi, l, 2) MLR increasing in pi is similar and is omitted. Hence,Vk(pi, l) is MLR increasing in pi.944.6. Proof of theoremsProof of Theorem 4.6.4The proof follows by induction. Recall from (4.38), we haveWk(pi, l − 1) =∑yWk−1(T (pi, y), l − 1)σ(pi, y)ICl−1k (pi)+r′piICl−2k ∩Sl−1k (pi)+ (4.39)∑yWk−1(T (pi, y), l − 2)σ(pi, y)ISl−2k (pi)Hence, we compare Wk(pi, l) and Wk(pi, l − 1) in the following 4 regions:a.) Sl−2k : Wk(pi, l)−Wk(pi, l − 1) =∑y(Wk−1(T (pi, y), l − 1)−Wk−1(T (pi, y), l − 2))σ(pi, y),which is non-negative by the induction assumption.b.) C l−2k ∩ Sl−1k : Wk(pi, l)−Wk(pi, l − 1) =∑yWk−1(T (pi, y), l − 1)σ(pi, y)− r′pi,which is non-negative since pi ∈ Sl−1k .c.) C l−1k ∩ Slk : Wk(pi, l)−Wk(pi, l − 1) =r′pi −∑yWk−1(T (pi, y), l − 1)σ(pi, y),which is non-negative since pi ∈ C l−1k .d.) C lk : Wk(pi, l)−Wk(pi, l − 1) =∑y(Wk−1(T (pi, y), l)−Wk−1(T (pi, y), l − 1))σ(pi, y),which is non-negative by the induction assumption.Proof of Theorem 4.6.5If pi ∈ Sl−1k , then r′pi ≥∑yWk−1(T (pi, y), l − 1)σ(pi, y). By Theorem 4.6.4, r′pi ≥∑yWk−1(T (pi, y), l)σ(pi, y). Hence pi ∈ Slk.954.6. Proof of theoremsProof of Theorem 4.2.1Existence of optimal policy: In order to show the existence of a threshold policyof L1, we need to show that Qk+1(pi, l, 2)−Qk+1(pi, l, 1) is submodular in pi ∈ L(e1).Since,Qk+1(pi, l, 2)−Qk+1(pi, l, 1) = ρ∑yWk(T (pi, y), l)σ(pi, y)− r′pi.We need to show that ρ∑yWk(T (pi, y), l)σ(pi, y)− r′pi is MLR decreasing in pi.ρ∑yWk(T (pi, y), l)σ(pi, y)− r′pi (4.40)=∑y(ρWk(T (pi, y), l)− r′pi)σ(pi, y)=∑y((ρWk(T (pi, y), l)− ρr′T (pi, y))− (r′pi − ρr′T (pi, y)))σ(pi, y)= ρ∑y(Wk(T (pi, y), l)− r′T (pi, y))σ(pi, y)− r′(I − ρP ′)pi (4.41)The term −r′(I − ρP ′)pi in (4.41) is MLR decreasing in pi due to our assumption.Hence, to show that ρ∑yWk(T (pi, y), l)σ(pi, y) − r′pi is MLR decreasing in pi it issufficient to show that Wk(pi, l)− r′pi is MLR decreasing in pi. Define,W¯k(pi, l) , Wk(pi, l)− r′pi (4.42)Now, W¯k(pi, l) =(∑y ρ((W¯k−1(T (pi, y), l) + r′T (pi, y))− r′pi)σ(pi, y)) IClk(pi)+(∑y ρ((W¯k−1(T (pi, y), l − 1) + r′T (pi, y))− r′pi)σ(pi, y)) ISlk(pi)=(∑y(ρW¯k−1(T (pi, y), l)σ(pi, y))− r′(I − ρP )′pi) IClk(pi)+(∑y(ρW¯k−1(T (pi, y), l − 1)σ(pi, y))− r′(I − ρP )′pi) ISlk(pi) (4.43)We prove using induction that W¯k(pi, l) is MLR decreasing in pi, using the recursiverelation over k in (4.43).964.6. Proof of theoremsFor k = 0,W¯0(pi, l) = W0(pi, l)− r′pi = V0(pi, l)− V0(pi, l − 1)− r′pi (4.44)The initial conditions of the value iteration algorithm can be chosen such that W¯0(pi, l)in (4.44) is decreasing in pi. A suitable choice of the initial conditions is given below:V0(pi, l) = r′(l−1∑j=0ρjP j)′pi. (4.45)The intuition behind the initial conditions in (4.45) is that the value function, V0(pi, l)gives the expected total reward if we stop l times successively starting at belief pi.Next, we show that W¯k(pi, l) is MLR decreasing in pi, if W¯k−1(pi, l) is MLR decreas-ing in pi. For pi1 ≥r pi2, consider the following cases: (a) pi1, pi2 ∈ Sl−1k , (b) pi1 ∈ Sl−1k ,pi2 ∈ C l−1k ∩ Slk, (c) pi1, pi2 ∈ C l−1k ∩ Slk, (d) pi1 ∈ C l−1k ∩ Slk, pi2 ∈ C lk, (e) pi1, pi2 ∈ C lk,(f) pi1 ∈ Sl−1k , pi2 ∈ C lk. For cases (a), (c), (e), W¯k(pi1, l) ≤ W¯k(pi2, l) by the inductionassumption. For case (b) W¯k(pi1, l) ≤ W¯k(pi2, l), since pi1 ∈ Sl−1k . Case (d) is similarto case (b). For case (f),W¯k(pi1, l)− W¯k(pi2, l)=(∑y(ρW¯k−1(T (pi1, y), l − 1)σ(pi1, y))− r′(I − ρP )′pi1)−(∑y(ρW¯k−1(T (pi2, y), l)σ(pi2, y))− r′(I − ρP )′pi2)≤ ρ(∑y((W¯k−1(T (pi1, y), l − 1)− W¯k−1(T (pi1, y), l))σ(pi1, y)))≤ 0,where the first inequality is due to induction hypothesis and the second inequality isdue to Theorem 4.6.4. Hence, it is clear that W¯k(pi, l) is decreasing in pi, if W¯k−1(pi, l)is decreasing in pi, finishing the induction step.Characterization of the switching curve Γl: For each p¯i ∈ H construct theline segment L(e1, p¯i). The line segment can be described as (1−ε)p¯i+εe1. On the linesegment L(e1, p¯i) all the belief states are MLR orderable. Since µ∗(pi, l) is monotonedecreasing in pi, for each l, we pick the largest ε such that µ∗(pi, l) = 1. The beliefstate, piε∗,p¯i is the threshold belief state, where ε∗ = inf {ε ∈ [0, 1] : µ∗(piε,p¯i) = 1}.Denote by Γ(p¯i) = piε∗,p¯i. The above construction implies that there is a uniquethreshold Γ(p¯i) on L(e1, p¯i). The entire simplex can be covered by considering allpairs of lines L(e1, p¯i), for p¯i ∈ H1, i.e. Π = ∪p¯i∈HL(e1, p¯i). Combining, all points yield974.6. Proof of theoremsa unique threshold curve in Π given by Γ = ∪p¯i∈H1Γ(p¯i).Connectedness of Sl: Since e1 ∈ Sl for all l, call Sla, the subset of Sl thatcontains e1. Suppose Slb is the subset that was disconnected from Sla. Since everypoint on Π lies on the line segment L(e1, p¯i), for some p¯i, there exists a line segmentstarting from e1 ∈ Sla that would leave the set Sla, pass through the set where action2 is optimal and then intersect set Slb, where action 1 is optimal. But, this violatesthe requirement that the policy µ∗(pi, l) is monotone on L(e1, p¯i). Hence, Sla and Slbare connected.Connectedness of C l: Assume eS ∈ C l, otherwise C l is empty and there isnothing to prove. Call the set that contains eS as Cla. Suppose Clb ⊂ C l is disconnectedfrom C la. Since every point in Π lies on the line segment L(eS, p¯i), for some p¯i, thereexists a line starting from eS ∈ C la would leave set C la, pass through the set whereaction 1 is optimal and then intersect the set C lb (where action 2 is optimal). But thisviolates the monotone property of µ∗(pi, l).Nested structure: The proof is straightforward from Theorem 4.6.5.Proof of Theorem 4.3.1The proof of Theorem 4.3.1 is similar to the proof of Theorem 12.4.1 in [99]. Recall,that the linear threshold policies is given by:µθ(pi, l) =1 if[0 1 θl] pi−1 ≤ 02 else .For any number of stops remaining, e1 (the belief that the state is 1) belongs to thestopping set, Sl ,which gives the first condition θl(S − 1) ≥ 0.Consider pi1 ≥L1 pi2. Then pi1 = ε1e1 + (1 − ε1)p¯i and pi2 = ε2e1 + (1 − ε2)p¯i, forsome p¯i ∈ H and ε1 ≥ ε256. For the linear policy to the MLR decreasing on lines,µθ(pi1, l) ≤ µθ(pi2, l). Hence,[0 1 θl] [pi1−1]≤[0 1 θl] [pi2−1],[0 1 θl] [pi1 − pi20]≤ 0,56Refer to Remark 5984.6. Proof of theorems[0 1 θl] [(ε1 − ε2)e1 − (ε1 − ε2)p¯i0]≤ 0,−(ε1 − ε2) [p¯i(2) + θl(1)p¯i(3) + · · ·+ θl(S − 2)p¯i(S)] ≤ 0,giving the second set of conditions θl(i) ≥ 0, i ≤ S − 2.The proof of the second part is similar and hence is omitted.Proof of Theorem 4.3.2For l1 > l2, due to the nested structure in Theorem 4.2.1 Sl2 ⊂ Sl1 . This implies thefollowingµθ(l2, pi) ≥ µθ(l1, pi)[0 1 θl2] [ pi−1]≥[0 1 θl1] [ pi−1][0 0 θl2 − θl1] [ pi−1]≥ 0 (4.46)It is straightforward to check that the conditions in (4.16) in Theorem 4.3.2 satisfythe conditions in (4.46).99Chapter 5Conclusion5.1 Summary of findingsThe unifying theme of the thesis was to devise a set of theories and methods fordetection, estimation and control in online social media. This chapter concludes thework and presents a summary of findings along with some direction for future researchand development.• Chapter 2 considered the problem of detection of change in utility by a lin-ear perturbation. Necessary and sufficient conditions for the detection of thechange point were derived. In addition, in the presence of noise, we provideda procedure for detecting the unknown change point and a hypothesis test fordetecting dynamic utility maximization. The results were illustrated on datasetfrom Yahoo! Tech Buzz. Chapter 2 also considered the practical problem ofdetecting utility maximizing behaviour in high dimensional datasets. The prob-lem of computational complexity associated with high dimensional datasets wassolved using a dimensionality reduction algorithm using Johnson-Lindenstrausstransform.• Chapter 3 conducted a data-driven study of YouTube. The main result isthe sensitivity of the meta-level features on the view counts of YouTube videos.Next, optimizing the meta-data after the video is posted improves the popularityof the video. This is useful for multi-platform networks like BBTV to generatemore view count using existing content. Chapter 3 also discusses the socialdynamics (the interaction of the channel) that affects the popularity of thechannel. Using the Granger causality test, we showed that the view count hasa casual effect on the subscriber count of the channel.• Chapter 4 considered the problem of optimal scheduling of ads on live per-sonalized online social broadcasting channels. First, we cast the problem asan optimal multiple stopping problem in the POMDP framework. Second, wecharacterized the structural results of the optimal multiple stopping policy. By1005.2. Directions for future researchexploiting the structural results of the optimal multiple stopping policy we com-puted optimal linear threshold policies using a stochastic gradient algorithm.Finally, we validated the results on real datasets. Through a real dataset fromPeriscope, the linear threshold policies found outperformed conventional peri-odic scheduling by 25%.5.2 Directions for future researchThe work presented in this thesis can be extended in various directions.• The change detection problem in Chapter 2 considered the utility change of asingle agent. An extension of the change detection framework could includemultiple agents (possibly over a social network). A challenging problem is howto model the interation of the agents and the utility maximization frameworkfor multiple agents. Some work in this direction has been done by consideringconcave potential games in [119] and [120]. Another possible research directionis by considering multiple change points or considering higher order perturbationfunctions. However, there is a trade-off in identifiability and the generalizationof the model. As shown in [52], if change points are allowed at all time instantsthen any dataset satisfy the model.• Chapter 3 conducted a data-driven study of YouTube. However the conclusionsin Chapter 3 are based on the BBTV dataset. Extrapolating these results toother YouTube datasets is an important problem worth addressing in futurework. Another extension of the current work could involve studying the effectof video characteristics on different traffic sources, for example the effect oftweets or posts of videos on Twitter or Facebook.• Chapter 4 considered the problem of optimal scheduling of ads on live onlinesocial broadcasting channels. The optimal linear threshold policies were ob-tained through a stochastic gradient algorithm. However, it is of interest todevelop upper and lower myopic bounds to the optimal policy as in [121]. Theupper and lower myopic bound policies are computationally easy to implementand can be constructed to be close to the optimal policy. Chapter 4 assumesall ads have the same length and revenue is obtained only through advertising.However, these are rarely true. Some of the ads may be “sponsored” (the adsare already paid for) and the ads may be of varying length. Optimizing for adlength and external sources of revenue is an interesting problem to consider.1015.2. Directions for future researchThese issues promise to offer interesting avenues for future work.102Bibliography[1] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, andL. Brilliant, “Detecting influenza epidemics using search engine query data,”Nature, vol. 457, no. 7232, pp. 1012–1014, 2009.[2] J. Strebel, T. Erdem, and J. Swait, “Consumer search in high technology mar-kets: exploring the use of traditional information channels,” Journal of Con-sumer Psychology, vol. 14, no. 1, pp. 96–104, 2004.[3] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, andL. Brilliant, “Detecting influenza epidemics using search engine query data,”Nature, vol. 457, no. 7232, pp. 1012–1014, 2009.[4] H. A. Carneiro and E. Mylonakis, “Google trends: A web-based tool for real-time surveillance of disease outbreaks,” Clinical Infectious Diseases, vol. 49,no. 10, pp. 1557–1564, 2009.[5] X. Zhou, J. Ye, and Y. Feng, “Tuberculosis surveillance by analyzing googletrends,” IEEE Trans. on Biomedical Engineering, vol. 58, no. 8, pp. 2247–2254,Aug 2011.[6] A. Seifter, A. Schwarzwalder, K. Geis, and J. Aucott, “The utility of googletrends for epidemiological research: Lyme disease as an example,” GeospatialHealth, vol. 4, no. 2, pp. 135–137, 2010.[7] J. A. Doornik, “Improving the timeliness of data on influenza-like illnesses usinggoogle trends,” Tech. Rep., 2010.[8] L. Wu and E. Brynjolfsson, The Future of Prediction: How Google SearchesForeshadow Housing Prices and Sales. University of Chicago Press, 2009, p.147.[9] A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe, “Predicting elections withtwitter: What 140 characters reveal about political sentiment,” in Int. AAAIConference on Web and Social Media, 2010, pp. 178–185.103Bibliography[10] B. K. Kaye and T. J. Johnson, “Online and in the know: Uses and gratifica-tions of the web for political information,” Journal of Broadcasting & ElectronicMedia, vol. 46, no. 1, pp. 54–71, 2002.[11] T. Smith, M. Obrist, and P. Wright, “Live-streaming changes the (video) game,”in Proc. of the 11th European Conference on Interactive TV and Video. ACM,2013, pp. 131–138.[12] N. Archak, V. Mirrokni, and S. Muthukrishnan, “Budget optimization for onlinecampaigns with positive carryover effects,” in Proc. of the 8th InternationalConference on Internet and Network Economics. Springer-Verlag, 2012, pp.86–99.[13] N. Archak, V. S. Mirrokni, and S. Muthukrishnan, “Mining advertiser-specificuser behavior using adfactors,” in Proceedings of the 19th International Con-ference on World Wide Web, ser. WWW ’10. ACM, 2010, pp. 31–40.[14] P. Samuelson, “A note on the pure theory of consumer’s behaviour,” Economica,vol. 5, no. 17, pp. 61–71, 1938.[15] S. Afriat, “The construction of utility functions from expenditure data,” Inter-national economic review, vol. 8, no. 1, pp. 67–77, 1967.[16] H. Varian, “The nonparametric approach to demand analysis,” Econometrica,vol. 50, no. 1, pp. 945–973, 1982.[17] W. Diewert and C. Parkan, “Tests for the consistency of consumer data,” Jour-nal of Econometrics, vol. 30, no. 1-2, pp. 127–147, 1985.[18] H. Varian, “Revealed preference,” Samuelsonian economics and the twenty-firstcentury, pp. 99–115, 2006.[19] V. Krishnamurthy and W. Hoiles, “Afriat’s test for detecting malicious agents,”IEEE Signal Processing Letters, vol. 19, no. 12, pp. 801–804, 2012.[20] M. Barni and F. Pe´rez-Gonza´lez, “Coping with the enemy: Advances inadversary-aware signal processing,” in IEEE Conf. on Acoustics, Speech andSignal Processing, 2013, pp. 8682–8686.[21] W. Hoiles and V. Krishnamurthy, “Nonparametric demand forecasting and de-tection of energy aware consumers,” IEEE Transactions on Smart Grid, vol. 6,no. 2, pp. 695–704, March 2015.104Bibliography[22] S. Currarini, M. O. Jackson, and P. Pin, “Identifying the roles of race-basedchoice and chance in high school friendship network formation,” Proc. of theNational Academy of Sciences, vol. 107, no. 11, pp. 4857–4861, 2010.[23] R. K.-X. Jin, D. C. Parkes, and P. J. Wolfe, “Analysis of bidding networksin eBay: aggregate preference identification through community detection,” inProc. of AAAI workshop on PAIR, 2007.[24] G. Gu¨rsun, M. Crovella, and I. Matta, “Describing and forecasting video accesspatterns,” in 2011 Proc. of INFOCOM. IEEE, 2011, pp. 16–20.[25] H. Pinto, J. Almeida, and M. GonXcalves, “Using early view patterns to predictthe popularity of YouTube videos,” in Proc. of the sixth ACM Int. Conf. onWeb search and Data mining. ACM, 2013, pp. 365–374.[26] C. Richier, E. Altman, R. Elazouzi, T. Jimenez, G. Linares, and Y. Por-tilla, “Bio-inspired models for characterizing YouTube viewcout,” in 2014IEEE/ACM Int. Conf. on Advances in Social Networks Analysis and Mining.IEEE, 2014, pp. 297–305.[27] C. Richier, R. Elazouzi, T. Jimenez, E. Altman, and G. Linares, “Forecastingonline contents’ popularity,” arXiv preprint arXiv:1506.00178, 2015.[28] A. Zhang, “Judging YouTube by its covers,” Department of Computer Scienceand Engineering, University of California, San Diego, Tech. Rep., 2015.[Online]. Available: http://cseweb.ucsd.edu/∼jmcauley/cse255/reports/wi15/Angel%20Zhang.pdf[29] T. Yamasaki, S. Sano, and K. Aizawa, “Social popularity score: Predicting num-bers of views, comments, and favorites of social photos using only annotations,”in Proc. of the First Int. Workshop on Internet-Scale Multimedia Management.ACM, 2014, pp. 3–8.[30] T. Yamasaki, J. Hu, K. Aizawa, and T. Mei, “Power of tags: Predicting pop-ularity of social media in geo-spatial and temporal contexts,” in Advances inMultimedia Information Processing. Springer, 2015, pp. 149–158.[31] T. Trzcinski and P. Rokita, “Predicting popularity of online videos using sup-port vector regression,” arXiv preprint arXiv:1510.06223, 2015.105Bibliography[32] Y. Ding, Y. Du, Y. Hu, Z. Liu, L. Wang, K. Ross, and A. Ghose, “Broadcastyourself: Understanding YouTube uploaders,” in Proc. of the ACM SIGCOMMConf. on Internet Measurement. New York, NY, USA: ACM, 2011, pp. 361–370.[33] S. Bollapragada, M. R. Bussieck, and S. Mallik, “Scheduling commercial video-tapes in broadcast television,” Oper. Res., vol. 52, no. 5, pp. 679–689, Oct.2004.[34] D. G. Popescu and P. Crama, “Ad revenue optimization in live broadcasting,”Management Science, vol. 62, no. 4, pp. 1145–1164, 2015.[35] S. Seshadri, S. Subramanian, and S. Souyris, “Scheduling spots on television,”2015.[36] H. Kang and M. P. McAllister, “Selling you and your clicks: examining the audi-ence commodification of google,” Journal for a Global Sustainable InformationSociety, vol. 9, no. 2, pp. 141–153, 2011.[37] R. Terlutter and M. L. Capella, “The gamification of advertising: analysis andresearch directions of in-game advertising, advergames, and advertising in socialnetwork games,” Journal of Advertising, vol. 42, no. 2-3, pp. 95–112, 2013.[38] J. Turner, A. Scheller-Wolf, and S. Tayur, “Scheduling of dynamic in-gameadvertising,” Operations Research, vol. 59, no. 1, pp. 1–16, 2011.[39] T. Nakai, “The problem of optimal stopping in a partially observable Markovchain,” Journal of Optimization Theory and Applications, vol. 45, no. 3, pp.425–442, 1985.[40] W. Stadje, “An optimal k-stopping problem for the poisson process,” in Math-ematical Statistics and Probability Theory. Springer, 1987, pp. 231–244.[41] M. Nikolaev, “On optimal multiple stopping of Markov sequences,” Theory ofProbability & Its Applications, vol. 43, no. 2, pp. 298–306, 1999.[42] A. Krasnosielska-Kobos, “Multiple-stopping problems with random horizon,”Optimization, vol. 64, no. 7, pp. 1625–1645, 2015.[43] E. Bayraktar and R. Kravitz, “Quickest detection with discretely controlledobservations,” Sequential Analysis, vol. 34, no. 1, pp. 77–133, 2015.106Bibliography[44] J. Geng, E. Bayraktar, and L. Lai, “Bayesian quickest change-point detectionwith sampling right constraints,” IEEE Transactions on Information Theory,vol. 60, no. 10, pp. 6474–6490, 2014.[45] T. L. Lai, “On optimal stopping problems in sequential hypothesis testing,”Statistica Sinica, vol. 7, no. 1, pp. 33–51, 1997.[46] ——, Sequential analysis. Wiley Online Library, 2001.[47] S. H. J. Alexander G. Nikolaev, “Stochastic sequential decision-making with arandom number of jobs,” Operations Research, vol. 58, no. 4, pp. 1023–1027,2010.[48] S. Savin and C. Terwiesch, “Optimal product launch times in a duopoly: Bal-ancing life-cycle revenues with product cost,” Operations Research, vol. 53,no. 1, pp. 26–47, 2005.[49] I. Lobel, J. Patel, G. Vulcano, and J. Zhang, “Optimizing product launches inthe presence of strategic consumers,” Management Science, vol. 62, no. 6, pp.1778–1799, 2015.[50] K. E. Wilson, R. Szechtman, and M. P. Atkinson, “A sequential perspectiveon searching for static targets,” European Journal of Operational Research, vol.215, no. 1, pp. 218 – 226, 2011.[51] M. Atkinson, M. Kress, and R.-J. Lange, “When is information sufficient for ac-tion? search with unreliable yet informative intelligence,” Operations Research,vol. 64, no. 2, pp. 315–328, 2016.[52] A. Adams, R. Blundell, M. Browning, and I. Crawford, “Prices versuspreferences: taste change and revealed preference,” Mar 2015. [Online].Available: /uploads/publications/wps/WP201511.pdf[53] D. L. McFadden and M. Fosgerau, “A theory of the perturbed consumer withgeneral budgets,” National Bureau of Economic Research, Tech. Rep., 2012.[54] D. J. Brown and R. L. Matzkin, “Estimation of nonparametric functions in si-multaneous equations models, with an application to consumer demand,” 1998.[55] D. Fudenberg, R. Iijima, and T. Strzalecki, “Stochastic choice and revealedperturbed utility,” Econometrica, vol. 83, no. 6, pp. 2371–2409, 2015.107Bibliography[56] G. C. Chasparis and J. Shamma, “Control of preferences in social networks,”in Decision and Control (CDC), IEEE Conf. on, Dec 2010, pp. 6651–6656.[57] M. Basseville and I. Nikiforov, Detection of Abrupt Changes — Theory andApplications, ser. Information and System Sciences Series. New Jersey, USA:Prentice Hall, 1993.[58] M.-F. Balcan, A. Daniely, R. Mehta, R. Urner, and V. V. Vazirani, “Learningeconomic parameters from revealed preferences,” in Int. Conf. on Web andInternet Economics. Springer, 2014, pp. 338–353.[59] M. Zadimoghaddam and A. Roth, “Efficiently learning from revealed prefer-ence,” in Internet and Network Economics. Springer, 2012, pp. 114–127.[60] E. Beigman and R. Vohra, “Learning from revealed preference,” in Proc. of the7th ACM Conference on Electronic Commerce, ser. EC ’06, 2006, pp. 36–42.[61] O. Chapelle, B. Schlkopf, and A. Zien, Semi-Supervised Learning, 1st ed. TheMIT Press, 2010.[62] W. B. Johnson and J. Lindenstrauss, “Extensions of Lipschitz mappings into aHilbert space,” Contemporary mathematics, vol. 26, pp. 189–206, 1984.[63] D. Achlioptas, “Database-friendly random projections: Johnson-lindenstrausswith binary coins,” Journal of Computer and System Sciences, vol. 66, pp.671–687, 2003.[64] S. S. Vempala, The random projection method. American Mathematical Soc.,2005, vol. 65.[65] B. Mangold, M. Dooley, G. Flake, H. Hoffman, T. Kasturi, D. Pennock, andR. Dornfest, “The Tech Buzz Game,” Computer, vol. 38, no. 7, pp. 94–97, July2005.[66] Y. Chen, D. M. Pennock, and T. Kasturi, “An empirical study of dynamic pari-mutuel markets: Evidence from the tech buzz game,” in Proc. of WebKDD,2008.[67] P. Milgrom and C. Shannon, “Monotone comparative statistics,” Econometrica,vol. 62, no. 1, pp. 157–180, 1992.108Bibliography[68] M. Zeni, D. Miorandi, and F. D. Pellegrini, “Youstatanalyzer: A tool foranalysing the dynamics of YouTube content popularity,” in Proc. of the 7thIntn. Conf. on Performance Evaluation Methodologies and Tools, 2013, pp. 286–289.[69] X. Cheng, M. Fatourechi, X. Ma, C. Zhang, L. Zhang, and J. Liu, “Insight dataof YouTube from a partner’s view,” in Proc. of NOSSDAV Workshop. ACM,2014, pp. 73–78.[70] A. Brodersen, S. Scellato, and M. Wattenhofer, “YouTube around the world:Geographic popularity of videos,” in Proc. of the 21st Int. Conf. on World WideWeb, ser. WWW ’12. ACM, 2012, pp. 241–250.[71] Y. Ding, Y. Du, Y. Hu, Z. Liu, L. Wang, K. Ross, and A. Ghose, “Broadcastyourself: Understanding YouTube uploaders,” in Proc. of SIGCOMM Conf. onInternet Measurement Conference, 2011, pp. 361–370.[72] M. Gerasimov, V. Kruglov, and A. Volodin, “On negatively associated randomvariables,” Lobachevskii Journal of Mathematics, vol. 33, no. 1, pp. 47–55, 2012.[73] K. Joag-Dev and F. Proschan, “Negative association of random variables withapplications,” The Annals of Statistics, pp. 286–295, 1983.[74] G. Huang, Q. Zhu, and C. Siew, “Extreme learning machine: theory and appli-cations,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006.[75] G. Huang and L. Chen, “Convex incremental extreme learning machine,” Neu-rocomputing, vol. 70, no. 16, pp. 3056–3062, 2007.[76] G. H. Golub and C. F. Van Loan, Matrix computations. JHU Press, 2012,vol. 3.[77] H. Liu and H. Motoda, Feature selection for knowledge discovery and data min-ing. Springer Science & Business Media, 2012, vol. 454.[78] U. Stan´czyk and L. Jain, Feature Selection for Data and Pattern Recognition.Springer, 2015.[79] M. Gevrey, I. Dimopoulos, and S. Lek, “Review and comparison of methods tostudy the contribution of variables in artificial neural network models,” Ecolog-ical modelling, vol. 160, no. 3, pp. 249–264, 2003.109Bibliography[80] M. Yamada, W. Jitkrittum, L. Sigal, E. Xing, and M. Sugiyama, “High-dimensional feature selection by feature-wise kernelized lasso,” Neural compu-tation, vol. 26, no. 1, pp. 185–207, 2014.[81] C. Hutto and E. Gilbert, “Vader: A parsimonious rule-based model for sen-timent analysis of social media text,” in Eighth International Conference onWeblogs and Social Media, 2014.[82] M. A. Hall, “Correlation-based feature selection for machine learning,” Ph.D.dissertation, The University of Waikato, 1999.[83] H. Drucker, “Improving regressors using boosting techniques,” in ICML, vol. 97,1997, pp. 107–115.[84] T. Hothorn, K. Hornik, and A. Zeileis, “Unbiased recursive partitioning: A con-ditional inference framework,” Journal of Computational and Graphical statis-tics, vol. 15, no. 3, pp. 651–674, 2006.[85] C. Bishop, Pattern Recognition and Machine Learning. Springer-Verlag NewYork, 2006.[86] N. Meinshausen, “Relaxed lasso,” Computational Statistics & Data Analysis,vol. 52, no. 1, pp. 374–393, 2007.[87] W. Venables and B. Ripley, Modern applied statistics with S-PLUS. SpringerScience & Business Media, 2013.[88] T. Hothorn, P. Bu¨hlmann, T. Kneib, M. Schmid, and B. Hofner, “Model-basedboosting 2.0,” Journal of Machine Learning Research, vol. 11, pp. 2109–2113,2010.[89] C. W. Granger, “Investigating causal relations by econometric models and cross-spectral methods,” Econometrica: Journal of the Econometric Society, pp. 424–438, 1969.[90] G. M. Ljung and G. E. P. Box, “On a measure of lack of fit in time seriesmodels,” Biometrika, vol. 65, no. 2, pp. 297–303, 1978.[91] A. Wald, Sequential analysis. Dover, 1973.[92] L. Jiang, Y. Miao, Y. Yang, Z. Lan, and A. Hauptmann, “Viral video style: acloser look at viral videos on YouTube,” in Proceedings of International Con-ference on Multimedia Retrieval. ACM, 2014, p. 193.110Bibliography[93] F. Figueiredo, F. Benevenuto, and J. Almeida, “The tube over time: character-izing popularity growth of youtube videos,” in Proceedings of the fourth ACMinternational conference on Web search and data mining. ACM, 2011, pp.745–754.[94] A. Tartakovsky, I. Nikiforov, and M. Basseville, Sequential analysis: Hypothesistesting and changepoint detection. CRC Press, 2014.[95] S. Burer and A. Letchford, “Non-convex mixed-integer nonlinear programming:a survey,” Surveys in Operations Research and Management Science, vol. 17,no. 2, pp. 97–106, 2012.[96] D. P. Bertsekas, Nonlinear programming. Athena scientific, 1999.[97] T. Hastie, R. Tibshirani, and J. H. Friedman, The elements of statistical learn-ing: data mining, inference, and prediction, 2nd Edition., ser. Springer seriesin statistics. Springer, 2009.[98] D. P. Bertsekas, Dynamic programming and optimal control. Athena ScientificBelmont, MA, 2017, vol. 1, no. 4.[99] V. Krishnamurthy, Partially Observed Markov Decision Processes. CambridgeUniversity Press, 2016.[100] C. H. Papadimitriou and J. Tsitsiklis, “The compexity of Markov decision pro-cesses,” Mathematics of Operations Research, vol. 12, no. 3, pp. 441–450, 1987.[101] G. Yin and Q. Zhang, Discrete-time Markov chains: two-time-scale methodsand applications. Springer Science & Business Media, 2006, vol. 55.[102] G. Piao and J. G. Breslin, “Exploring dynamics and semantics of user interestsfor user modeling on twitter for link recommendations,” in Proceedings of the12th International Conference on Semantic Systems. ACM, 2016, pp. 81–88.[103] M. L. Puterman, Markov decision processes: discrete stochastic dynamic pro-gramming. John Wiley & Sons, 2005.[104] G. C. Pflug, Optimization of stochastic models: the interface between simulationand optimization. Springer Science & Business Media, 2012, vol. 373.[105] J. C. Spall, Introduction to stochastic search and optimization: estimation, sim-ulation, and control. John Wiley & Sons, 2005, vol. 65.111Bibliography[106] W. Lovejoy, “A survey of algorithmic methods for partially observed Markovdecision processes,” Annals of Operations Research, vol. 28, pp. 47–66, 1991.[107] B. Wang, X. Zhang, G. Wang, H. Zheng, and B. Y. Zhao, “Anatomy of a person-alized livestreaming system,” in Proceedings of the 2016 Internet MeasurementConference, ser. IMC ’16. New York, NY, USA: ACM, 2016, pp. 485–498.[108] A. Baldominos Go´mez, E. Albacete Garc´ıa, I. Marrero, and Y. Saez Achaeran-dio, “Real-time prediction of gamers behavior using variable order Markov andbig data technology: a case of study,” 2016.[109] F. Benevenuto, T. Rodrigues, M. Cha, and V. Almeida, “Characterizing userbehavior in online social networks,” in Proceedings of the 9th ACM SIGCOMMConference on Internet Measurement, ser. IMC ’09, 2009, pp. 49–62.[110] K. Lewis, M. Gonzalez, and J. Kaufman, “Social selection and peer influencein an online social network,” Proceedings of the National Academy of Sciences,vol. 109, no. 1, pp. 68–72, 2012.[111] W. A. Hamilton, O. Garretson, and A. Kerne, “Streaming on twitch: Fosteringparticipatory communities of play within live mixed media,” in Proceedings ofthe 32Nd Annual ACM Conference on Human Factors in Computing Systems,2014, pp. 1315–1324.[112] M. Del Vicario, A. Bessi, F. Zollo, F. Petroni, A. Scala, G. Caldarelli, H. E.Stanley, and W. Quattrociocchi, “The spreading of misinformation online,”Proceedings of the National Academy of Sciences, vol. 113, no. 3, pp. 554–559,2016.[113] J. Lehmann, M. Lalmas, E. Yom-Tov, and G. Dupret, “Models of user en-gagement,” in International Conference on User Modeling, Adaptation, andPersonalization. Springer, 2012, pp. 164–175.[114] W. Zucchini and I. L. MacDonald, Hidden Markov models for time series: anintroduction using R. CRC press, 2009.[115] V. Krishnamurthy and C. R. Rojas, “Reduced complexity hmm filtering withstochastic dominance bounds: A convex optimization approach,” IEEE Trans-actions on Signal Processing, vol. 62, no. 23, pp. 6309–6322, Dec 2014.112Bibliography[116] H. Kurniawati, D. Hsu, and W. S. Lee, “SARSOP: Efficient point-basedPOMDP planning by approximating optimally reachable belief spaces.” inRobotics: Science and Systems., 2008.[117] S. Karlin and Y. Rinott, “Classes of orderings of measures and related cor-relation inequalities. I. Multivariate totally positive distributions,” Journal ofMultivariate Analysis, vol. 10, no. 4, pp. 467–498, December 1980.[118] D. M. Topkis, Supermodularity and complementarity. Princeton universitypress, 2011.[119] R. Deb, “A testable model of consumption with externalities,” Journal of Eco-nomic Theory, vol. 144, no. 4, pp. 1804 – 1816, 2009.[120] W. Hoiles, V. Krishnamurthy, and A. Aprem, “Pac algorithms for detectingnash equilibrium play in social networks: From twitter to energy markets,”IEEE Access, Special Section: Socially Enabled Networking and Computing,vol. 4, pp. 8147–8161, 2016.[121] V. Krishnamurthy and U. Pareek, “Myopic bounds for optimal policy ofPOMDPs: An extension of Lovejoy’s structural results,” Operations Research,vol. 62, no. 2, pp. 428–434, 2015.113
Thesis/Dissertation
2018-02
10.14288/1.0361160
eng
Electrical and Computer Engineering
Vancouver : University of British Columbia Library
University of British Columbia
Attribution-NonCommercial-NoDerivatives 4.0 International
http://creativecommons.org/licenses/by-nc-nd/4.0/
Graduate
Detection, estimation and control in online social media
Text
http://hdl.handle.net/2429/63811