Towards a time-lapse prediction system for cricketmatchesbyVignesh Veppur SankaranarayananB. E., Anna University, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE STUDIES(Computer Science)The University Of British Columbia(Vancouver)May 2014c© Vignesh Veppur Sankaranarayanan, 2014AbstractCricket is a popular sport played in over a hundred countries, is the second mostwatched sport in the world after soccer, and enjoys a multi-million dollar industry.There is tremendous interest in simulating cricket and more importantly in predict-ing the outcome of games, particularly in their one-day international format. Thecomplex rules governing the game, along with the numerous natural phenomenaaffecting the outcome of a cricket match present significant challenges for accurateprediction. Multiple diverse parameters, including but not limited to cricketingskills and performances, match venues and even weather conditions can signifi-cantly affect the outcome of a game. The sheer number of parameters, along withtheir interdependence and variance create a non-trivial challenge to create an ac-curate quantitative model of a game. Unlike other sports such as basketball andbaseball which are well researched from a sports analytics perspective, for cricket,these tasks have yet to be investigated in depth. The goal of this work is to pre-dict the game progression and winner of a yet-to-begin or an ongoing game. Thegame is modeled using a subset of match parameters, using a combination of linearregression and nearest-neighbor classification-aided attribute bagging algorithm.The prediction system takes in historical match data as well as the instantaneousstate of a match, and predicts the score at key points in the future, culminatingin a prediction of victory or loss. Runs scored at the end of an innings, the keyfactor in determining the winner, are predicted at various points in the game. Ourexperiments based on actual cricket game data, shows that our method predicts thewinner with an accuracy of approximately 70%.iiPrefaceThe work presented in this dissertation was conducted in the Data Management andMining lab under the supervision of Prof. Laks V.S.Lakshmanan and in collabora-tion with Dr.Junaed Sattar. I was the lead investigator in this work responsible forhigh level problem identification, data collection, methodology, analysis of resultsand manuscript composition. Dr. Junaed Sattar and Prof. Laks V.S.Lakshmananwere closely involved throughout and provided advice on all of the core aspectslike formalization of the problem, methodology, and analysis and interpretation ofresults and also helped in manuscript composition and editing. Prof. Jim Littleacted as a second reader and provided constructive feedback on the manuscript.A jointly authored paper based on this work has been published in the confer-ence proceedings of 2014 SIAM Conference on Data Mining [28]iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Importance of Sports Analytics . . . . . . . . . . . . . . . . . . . 11.2 Cricket – Popularity & Formats . . . . . . . . . . . . . . . . . . . 21.3 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . 51.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Sports Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.1 Basketball . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Baseball . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 Soccer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9iv2.2 Research in Cricket . . . . . . . . . . . . . . . . . . . . . . . . . 103 Rules of the sport . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Rules and Objective . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.1 Toss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.2 Over . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.4 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.5 Dismissal . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.6 Target score . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.7 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Game Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1 Terms & Notations . . . . . . . . . . . . . . . . . . . . . . . . . 184.1.1 Segment . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1.2 Runs in a segment . . . . . . . . . . . . . . . . . . . . . 184.1.3 Match state . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.4 End-of-innings score . . . . . . . . . . . . . . . . . . . . 194.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Historical Features . . . . . . . . . . . . . . . . . . . . . . . . . 214.4 Instantaneous Features . . . . . . . . . . . . . . . . . . . . . . . 224.4.1 Home or away . . . . . . . . . . . . . . . . . . . . . . . 224.4.2 Venue class . . . . . . . . . . . . . . . . . . . . . . . . . 224.4.3 Powerplay . . . . . . . . . . . . . . . . . . . . . . . . . 234.4.4 Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4.5 Batsmen performance features . . . . . . . . . . . . . . . 234.4.6 Bowler quality features . . . . . . . . . . . . . . . . . . . 244.4.7 Game snapshot . . . . . . . . . . . . . . . . . . . . . . . 244.5 Batsmen Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 244.5.1 Home-run hitting ability . . . . . . . . . . . . . . . . . . 254.5.2 Milestone reaching ability . . . . . . . . . . . . . . . . . 254.6 Bowler Classification . . . . . . . . . . . . . . . . . . . . . . . . 264.7 Game Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . 27v4.8 Home-Run Prediction Model . . . . . . . . . . . . . . . . . . . . 284.9 Non-Home-Run Prediction . . . . . . . . . . . . . . . . . . . . . 294.10 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.11 Cold-Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Experiments & Results . . . . . . . . . . . . . . . . . . . . . . . . . 325.1 Non-Home Run Prediction Performance . . . . . . . . . . . . . . 335.1.1 Segmentwise non-home run prediction performance . . . 355.2 Home Run Prediction Performance . . . . . . . . . . . . . . . . . 385.2.1 Segmentwise home run prediction performance . . . . . . 405.3 End-of-Innings Run Prediction Performance . . . . . . . . . . . . 405.3.1 Segmentwise run prediction performance . . . . . . . . . 405.4 Runs in a Segment, Rˆi . . . . . . . . . . . . . . . . . . . . . . . . 405.5 Performance Comparison with Baseline Model . . . . . . . . . . 435.5.1 Run prediction by Bailey et al. . . . . . . . . . . . . . . . 485.5.2 ICC projected score prediction model . . . . . . . . . . . 485.6 Winner Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 506 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . 516.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.1.1 Other formats of the game . . . . . . . . . . . . . . . . . 516.1.2 Strategy recommendation . . . . . . . . . . . . . . . . . 516.1.3 Wickets prediction . . . . . . . . . . . . . . . . . . . . . 526.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52viList of TablesTable 4.1 Historical feature values for teams in the dataset . . . . . . . . 22Table 4.2 Batsmen clustered according their ability . . . . . . . . . . . . 26viiList of FiguresFigure 1.1 A test format cricket match between India and Australia . . . 4Figure 1.2 A ODI cricket match between India and Australia . . . . . . 5Figure 3.1 A depiction of a typical cricket field. It is not mandatory forthe field to be oval in shape. . . . . . . . . . . . . . . . . . . 13Figure 4.1 Histogram of bowlers’ economy rate (average runs concededper over) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 5.1 Total non-home runs scatter plot for innings1 (left) & innings2(right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 5.2 PDF and CDF of total non-home run prediction error for innings1(top) & innings2 (bottom) . . . . . . . . . . . . . . . . . . . 34Figure 5.3 Non-home runs prediction scatter plot for every segment ininnings1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 5.4 Non-home runs prediction scatter plot for every segment ininnings2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 5.5 Total home runs scatter plot for innings1 (left) & innings2 (right) 38Figure 5.6 PDF and CDF of total home run prediction error for innings1(top) & innings2(bottom) . . . . . . . . . . . . . . . . . . . . 39Figure 5.7 Home runs prediction scatter plot for every segment in innings1 41Figure 5.8 Home runs prediction scatter plot for every segment in innings2 42Figure 5.9 Rˆeoi scatter plot for innings1 (left) & innings2 (right) . . . . . 43Figure 5.10 PDF and CDF of Reoi prediction error for innings1 (top) &innings2 (bottom) . . . . . . . . . . . . . . . . . . . . . . . . 44viiiFigure 5.11 Runs in a segment prediction scatter plot for every segment ininnings1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Figure 5.12 Runs in a segment prediction scatter plot for every segment ininnings2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 5.13 Mean absolute error and standard deviation for Runs Ri acrosseach segments Si, or, 5-over intervals for innings 1 (above) andinnings 2 (below). Since the first and fore-most prediction R1for i = 1 gives the runs scored at the end of over number 5, theplots start from over 5. . . . . . . . . . . . . . . . . . . . . . 47Figure 5.14 Mean absolute error in Rˆeoi prediction for innings 1 (top) andinnings 2 (bottom) for Bailey et al. [2], ICC Projeted Scoreprediction and our model. . . . . . . . . . . . . . . . . . . . 49ixAcknowledgmentsI would like to thank my supervisor Prof.Laks V.S.Lakshmanan and co-supervisorDr.Junaed Sattar for their support, valuable guidance and encouragement. I wouldlike to acknowledge NSERC for funding this research.I want to thank my parents for their sincere love and support through this jour-ney. Words cannot do justice to their contribution. I would like to thank the stim-ulating environment of UBC Computer Science, Waran Research Foundation –specifically Prof.Venkateswaran, my friends, the special one and all others whohave been part of this journey.Last but not the least, I thank the almighty for providing me with the above!xDedication..to my over curious and naughty brother Prasanna..xiChapter 1IntroductionPredicting game progression and its outcome has direct applications in devisingstrategies aimed at winning a game. It is also used to set odds in the betting in-dustry. Despite the popularity of the sport, prediction in cricket has not been ad-dressed in great detail as in other sports like baseball, basketball, soccer etc. Thiswork addresses the problem of predicting game progression and outcome of One-Day-International cricket matches. This chapter provides the motivation for thiswork.1.1 Importance of Sports AnalyticsIn today’s world, professional sports are intensely competitive propositions. Mo-tivated by prestige and huge financial rewards, sports professionals are engaged infierce competition not only on the sporting turf, but also off it, pursuing perfectionand the slightest of advantages over their opponents. Today’s sports professionalsinclude not only the sportsmen actively participating in the game, but also theircoaches, trainers, physiotherapists, and in many cases, strategists. Coaches, cap-tains and team managers leverage their expertise and make decision using theirintuition. Such decisions can be biased by the human impressions and judgmentsof players and hence might overlook players’ weakness. Moreover, interesting pat-terns in the game may elude the eye of the best tactician.With massive advances in cheaper and reliable storage technologies, data about1every match is being stored in such a way that the entire chain of events could bereplayed. With such advancements in technology and huge stake involved in sport-ing events, the in-game data recorded is analyzed and converted into actionableknowledge by teams to gain advantage over their competitors. The trend notice-able both in individual sports such as tennis and in team sports such as baseball andbasketball is that this knowledge is used to determine pre-game strategy. Success-ful application of this pre-determined strategy often becomes the decisive factortowards victory. The difference between a win and a loss hinges on formulation ofsuch strategies and extensive planning.Effective formulation of strategies requires carrying out extensive analysis ofpast games, current performance in the game in progress, and numerous other fac-tors affecting a game. Players and team management (collectively often referredto as the team think-tank in sports) perform as a “human expert system”, relyingon experiences, expertise and analytic ability to arrive at the best-possible courseof action before as well as during a game. Vast amount of raw data and statisticsare available to aid in the decision-making process, but determining what it takesto win a game is extremely challenging.These amounts of raw data are also leveraged by broadcasters and game ex-perts to analyze performances of individual players, their strengths, weaknessesand teams’ performance. These facts and figures are presented to the viewers toadd richness to the viewer experience.Betting adds an extra dimension for the sport industry. Billions of dollars arebeing wagered on sporting events [21]. Gambling houses rely on statistical modelsthat predict outcomes of various level of sporting events.1.2 Cricket – Popularity & FormatsCricket is played in more than 100 countries across the world1. However, focusareas, namely the Indian subcontinent, United Kingdom, Australia, South Africaand the Caribbean drive the revenues and commercial interest in the sport. Thisis attributed to the fact that these are the top performing teams and form the coremembers of International Cricket Council, the cricket governing body. It has the1http://www.espncricinfo.com/ci-icc/content/story/209608.html2second largest viewership by population for any sport, next only to soccer, andgenerates an extremely passionate following among the supporters. It is also thefourth highest sport to be bet on, after Soccer, Tennis and snooker[21].Cricket is a team sport played between bat and ball, and governed by an ex-tensive and complex set of rules. The eventual goal of the game is for one team toscore more runs than their opponents to be declared the winner. A pair of batsmenfrom a team are in the field at any given time, trying to score runs off the ballsor deliveries thrown at them by the bowlers of the opposing team. The goal ofthe bowlers is to get all the batsmen out before they can accumulate a large score.The fielders (from the bowling team) assist the bowlers by stopping or catching theballs after they are hit by the batsmen, to stop them from scoring runs and gettingthem out, respectively. Once all the batsmen are out, it is the turn of the previousbowling team to bat, who must score more runs than their opponents to win. Eachteam gets a minimum of one turn, or innings in cricket parlance, depending onthe format of the match. In a Test match scenario, the game is limited to at-mosttwo innings each, held over a maximum of five days, whichever finishes first. Fig-ure 1.1 shows an Indian batsman Sachin Tendulkar batting in a test match betweenIndia and Australia. The other two formats of cricket and arguably more popularones are One-Day International (abbreviated as ODI) and Twenty20 formats. InOne-Day International matches (object of analysis in this work), each team gets tobat once, with an innings consisting of 50 overs, with each over being a collectionof six legal balls bowled by a bowler. A ball is considered to be illegal if the bowlercrosses the crease while delivering the ball. It is also illegal if the balls lands andbounces more than once on the pitch before reaching the batsman. In these andfew more circumstances, the ball is not counted and a penalty run is awarded tothe batting team. Twenty20 matches limits the game to 20 overs per side, short-ening the duration to approximately three hours a game. The Test match settingis unique in the sense that it allows for a game to be drawn, whereas the conceptof a draw does not exist in the other two formats. If the two teams score the samenumber of runs using the resources allocated, then the match is said to be tied. onlyTwenty20 format has a tie-breaker to determine the winner after a tie. Among thesethree formats, the one-day format is the most popular. Motivated by this, we focusour analysis in this paper on one-day cricket. In figure 1.2, India and Australia3play an ODI match on 9th January 2004 at the MCG cricket ground in Melbourne,Australia.Figure 1.1: A test format cricket match between India and Australia. SachinTendulkar plays the ball. (Image by Pulkit Sinha, released under theCreative Commons Attribution-ShareAlike 2.0 Generic License (CCBY-SA 2.0).The team strategists often are faced with making decisions when the predeter-mined strategy fails or the game unravels in an unexpected manner. Currently, acombination of personal experience, team constitution and seat of the pants “crick-eting sense” is relied upon for making instantaneous strategic decisions. Inherently,the methodology employed by human experts is to extract and leverage importantinformation from both past and current game statistics. However, to our knowl-edge, the underlying science behind this has not been clearly articulated. One ofthe key problems that needs to be solved in formulating strategies is predicting theoutcome of a game. The focus of this work is to address the problem of accu-rately modeling game progression towards match outcome prediction. Predictinggame progression and outcome involve leveraging data mining and machine learn-ing techniques to learn the patterns from historical play data. Various techniqueslike regression, nearest neighbors, clustering, attribute bagging etc. are customizedand used in this work.4Figure 1.2: A ODI cricket match between India and Australia (Imageby Ricky212, released under the Creative Commons Attribution-ShareAlike 2.0 Generic License (CC BY-SA 2.0).1.3 Machine Learning ModelsThis section serves as an introduction to the machine learning models used in thiswork. K-nearest neighbor is a non-parametric method that is used for regressionand classification [11]. A test sample along with a set of training samples are givenas input to the model. The model identifies the test sample’s k closest point(s) inthe training sample and assigns class membership to the test sample. The intuitionbehind this model is that, given a test sample, information can be borrowed fromtraining data by finding the closest k points. Spearman [29], Jaccard [15], Manhat-tan (L1 norm), Cosine [13], Euclidean (L2 norm) are some of the distance metricsused to calculate distance between two samples.Attribute bagging [7] is an ensemble learning method that has l individual clas-sifiers operating on n features chosen at random. Majority voting is performed5among the l classifiers to pick the output class. This ensemble method can useany classification learning model. The number of classifiers and the size of eachbag (n) is experimentally determined. By definition, random forests [20] (whichemploys decision trees as classifier model) is considered to be a special case of at-tribute bagging method. As explained in section 4.8, attribute bagging along withnearest neighbor classification method is used to predict home runs - one of the twoprediction components in this work.K-means algorithm [5] is a clustering technique that partitions n samples intok clusters based on the distance of samples from the cluster centres. In this work,k-means algorithm is used to cluster batsmen into five groups based on their abilityas inferred by their performance statistics (elaborated in section 4.5).1.4 ContributionsIn this work, a model for one-day format games is learnt by mining existing gamedata. In principle, our approach is applicable towards modeling any format of thegame; however, we choose to focus our testing and evaluation on the most popularand arguably the most important format, namely one-day international (ODI), forthe reasons mentioned above. By using a combination of supervised and unsuper-vised learning algorithms, our approach learns a number of features like home/-away, team performance in the past, batsmen, bowlers etc. from a one-day cricketdataset which consists of complete records of all games played in a 19-month pe-riod between January 2011 and July 2012. Along with these learned historical fea-tures of the game, our model also incorporates instantaneous match state data, suchas runs scored, wickets lost etc., as game progresses, to predict future states of anon-going match. By using a weighted combination of both historical and instanta-neous features, our approach is thus able to simulate and predict game progressionbefore and during a match. A paper based on this work has been presented andpublished [28] at the SIAM 2014 International Conference on Data Mining heldbetween April 24 - 26 in Philadelphia, USA.1.5 OutlineThe organization of rest of the thesis is as follows.6• An overview of existing literature in sports prediction and also specificallyfor cricket is provided in Chapter 2.• Chapter 3 provides an overview of the one-day format and its basic rules.• In Chapter 4, we introduce the problem of modeling cricket to the data min-ing community. A key part of this modeling is the identification of the mostimportant features of the game (Section 4.3 and 4.4). Furthermore, it ex-plains our algorithm in detail.• In Chapter 5 we describe the challenges in extracting and cleaning historicalmatch data so it can be used for model learning. We discuss the results of theextensive experimentation we conducted on a historical dataset we crawledand cleaned. Not only do our results validate our approach, but they alsoshow it significantly outperforms the state of the art in one-day cricket gamescore and outcome prediction.• Conclusion and directions of future work are presented in chapter 6.7Chapter 2Related WorkAs mentioned in Section 1.1 sports analytics has direct applications in under-standing and improving team performance, betting industry and match analysisby broadcasters. The complete set of data recorded for every game is proprietaryto the broadcaster. But a handful of high level play-by-play data is accessible tothe public through websites like www.espncricinfo.com (cricket), www.mlb.com(baseball), www.nba.com (basketball) etc. In Section 2.1 below, we discuss somerelevant work in the direction of modeling, simulation and prediction of sports.2.1 Sports PredictionThe problem of outcome prediction has been investigated in the context of basket-ball, baseball and soccer.2.1.1 BasketballVaz de Melo et al. [32] model the league championship as a network of playersbased on their work relationships over the years and predict the league standingat the end of a season. Fewell et al. [10] define players as nodes and ball passesas links and examine different network properties such as degree centrality, clus-tering, and entropy to analyze and quantify a teams’ strategy. Bhandari et al. [4]developed the Advanced Scout system for discovering interesting patterns frombasketball games, which has is now used by the NBA teams. The key idea is that8by using a technique called Attribute Focusing, an overall distribution of an at-tribute is compared with the distribution of this attribute for a subset of data (e.g.,a single game, all away matches1, entire season, etc.). If it has a characteristicallydifferent distribution for the focus attribute, it is marked as interesting. Such in-teresting patterns are discovered and provided to the domain expert to investigatefurther and gain insights. More recently, Schultz [23] studies how to determinetypes and combination of players most relevant to winning matches.2.1.2 BaseballIn baseball, Gartheepan et al. [14] built a data driven model that helps in decidingwhen to ‘pull a starting pitcher’. Pulling a starting pitcher is the act of replacingthe pitcher who started the inning with another pitcher. This is considered to be animportant decision in the context of baseball. In their subsequent work [12], theauthors propose a model that could lead to better on-field decisions for the nextinning. Bukiet et al. [8] use Markov chain method for evaluating the performanceof teams and influence of a particular player on the team performance.2.1.3 SoccerLuckner et al. [22] have predicted the outcome of FIFA World Cup 2006 matchesusing live Prediction Markets. Palomino et al. [26] study soccer matches witha game-theoretic model and determine that teams’ skill level, current score andhome field advantage are significant explanatory variables of the probability ofscoring goals. Kang et al. [17] build a mathematical model to quantitatively expressperformance of soccer players based on the trajectory of the ball passes among the22 on-field players.The work discussed above is developed with sport-specific intuitions. Bothsoccer and basketball are fundamentally very different from cricket which wouldrender the work inapplicable to the sport of cricket. Baseball is probably the closestto cricket in terms of playing dynamics since both are bat and ball games withbatters and bowlers and the objective is to score the maximum number of runs. Butgiven the facts that it has very different rules, ground shape and dimension, play1Matches played away from home.9structure, in-game economy etc., it would not be appropriate to apply the baseball-specific models to cricket. Moreover, none of the existing work models the sportto predict future match states of an on-going game. They are either used as a pre-match prediction tool or as a post-match analysis framework.2.2 Research in CricketOne of the earliest and pioneering work in cricket was by Duckworth and Lewis[9] where they introduce the Duckworth-Lewis or D-L method, which allows forfair adjustment of scores in proportion to the time lost due to match interruptions(often due to adverse weather conditions such as rain, poor visibility etc.). If theinterruptions occur during the second innings, the team batting second will haveless batting time and will face fewer balls. This affects the equilibrium and putsthe second team at a disadvantage. To mitigate this, the target for victory hasto be adjusted in proportion to the time lost. The D-L method is based upon amathematical formulation that abstracts every ball and wicket of an ODI match intoa single scalar called resource. Using this formulation, the number of overs lost isevaluated as runs and used to reset targets for the second team. This proposal hasbeen adopted by the International Cricket Council (ICC) as a means to reset targetsin matches where time is lost due to match interruptions. The method proposedin [9], and subsequently adapted by [25], for capturing the resources of a teamduring the progression of a match has found independent use in subsequent workin cricket modeling and mining [25][2].One of the objectives in sports analytics is to rate and rank players. In cricket,some possible ranking criteria are statistics such as batting average and strike ratefor batsmen (defined formally in Section 4.5) in determining most valued players.Lewis [19], Lemmer [18], Alsopp and Clarke [1], and Beaudoin [3] develop newperformance measures to rate teams and to find the most valuable players.Raj and Padma [27] analyze the Indian cricket team’s One-Day International(ODI) match data and mine association rules from a set of features, namely toss,home or away game, batting first or second and game outcome. Kaluarachchi andVarde [16] employ both association rules and naive Bayes classifier and analyzethe factors contributing to a win, also taking day/day-night game into account.10Both approaches use the same subset of high-level features to analyze the factorscontributing to victory. Furthermore, they do not address score prediction, nor theprogression of the game.Brooker et al. [6] take into account the number of runs scored, wickets lost,balls remaining at any point in the game along with the ground conditions and es-timate the end-of-innings score. This work fails to take the batsmen and bowlerfeatures into account. The authors note that it is not practical to learn a modelfor every single bowler and batsman that has played the game so far. Bailey andClarke [2] use historical match data and predict the total score of an innings usinglinear regression. As data of a match in progress streams in, the prediction modelis updated. Using this, they analyze the betting2 market’s sensitivity to the upsand downs of the game. Their model predicts total score as instantaneous matchdata is streamed in. This enables us to use their model as a baseline for this work.Swartz et al. [31] use Markov Chain Monte Carlo methods to simulate ball by balloutcome of a match using a Bayesian Latent Variable model. Based on the fea-tures of current batsman, bowler, and game situation (number of wickets lost andnumber of balls bowled), they estimate the outcome of the next ball. This modelsuffers from severe sparsity as noted by the authors themselves: the likelihood ofa given batsman having previously faced a given bowler in previous games in thedataset is low. Simulating a match, based on team compositions and making use ofa model built from historical match data and taking in the current match situation,is a key step in predicting the outcome of a match. While both [31] and [2] havebuilt match simulators for ODI cricket, their models rely on games played over 10years ago. ODI cricket has since undergone a number of major rule modifications.Important examples include powerplays, free hit after an illegal ball delivery, andthe use of two new balls (as opposed to just one) in an innings. These changessignificantly affect the team strategies, and essentially render old models a poorfit. This work focuses on the modern and current form of ODI cricket, incorpo-rating all recent changes to the game with support for accommodating future rulemodifications.2There is a vibrant betting market associated with cricket. See, e.g.,http://www.betfair.com/exchange/en-gb/cricket-4/sp/.11Chapter 3Rules of the sportThis chapter provides an overview of the ODI format of the game and review itsbasic rules as they pertain to the problem of modeling the game progression andscore prediction. It serves as foundation to the contributions in this thesis.3.1 Rules and ObjectiveThe objective of the teams is to score as many runs as possible while batting andlimiting the opponents from scoring while bowling. The team that has the highestnumber of runs at the end of the game is determined to be the winner. Figure 3.1is a depiction of a cricket field and is used for representational purposes. A teamconsists of 11 players. Based on the team’s strategy it can contain any number ofbatsmen and bowlers (typically 6 batsmen and 5 bowlers). Batsmen and bowlerscan be left-handed or right-handed. Different batsmen have specific roles like open-ing the innings, consolidation (during overs 15 to 40), power-hitting (during overs41 to 50). Bowlers can be specialist in pace bowling (bowling fast and enabling theball to swing) or spin bowling ( rotation of the ball while releasing that aids turningin direction after landing on the pitch)Theoretically, all 11 players can bat or bowl in a match if needed. At any time,two players of the batting team (in blue) and eleven players of the bowling team arepresent on the field. The pitch is the pale yellow rectangular shape in the middle ofthe field. It is 22 yards (20.11 meters) in length. The playing area is enclosed by12Striking Batsmen Bowler Non-Striking Batsmen Boundary Pitch Figure 3.1: A depiction of a typical cricket field. It is not mandatory for thefield to be oval in shape.the boundary. 4 or 6 (home) runs are awarded based on whether the ball lands inthe playing area and rolls over the boundary or flies past the boundary. After hittingthe ball, the two batsmen exchange their position as many times as they can. Theseare called non-home runs. Players of the bowling team field the ball to minimizethe number of exchanges the batsmen attain. At any time in the game, there aretwo umpires present on the field to officiate the match.The most important components and terminologies in the game of cricket aredescribed below.3.1.1 TossSimilar to a number of other sports, an ODI cricket match starts with a toss. Theteam that wins the toss can choose to bat first or can ask the opponents to batfirst. This is an important decision in the context of the game. The teams take intoaccount the nature of the pitch (Pitch is the 22 yard central strip of the cricket field13where bowlers bowl the ball for the batsmen to hit.), weather conditions and thestrengths and weaknesses of their own and that of the opponents to arrive at thedecision.3.1.2 OverSix consecutive legal delivery of balls by a bowler to the batsmen is called an over.An ODI game consists of two innings and each have 50 overs (300 legal deliveries).3.1.3 ObjectiveIn a game between TeamA and TeamB, suppose TeamA wins the toss and chooses tobat first. The period during which TeamA bats is called innings1, in which bowlersfrom TeamB will have to bowl the ball to the batsmen of TeamA. TeamA has 50overs to score as many runs as possible, while TeamB tries to minimize the scoringby getting TeamA’s batsmen out (more commonly referred to as taking wickets).Scoring can also be restricted by TeamB, by bowling balls that are difficult to playand by flawless fielding, where fielders stop hits by batsmen of TeamA to denythem opportunities to score runs. Innings1 comes to an end when TeamA loses allits wickets or finishes its quota of 50 overs, whichever happens first. When a teamloses all its wickets, it is termed as being all-out. Let ScoreA denote the number ofruns accumulated by TeamA at this point. When TeamB comes in to bat in innings2,it has the exact same number of 50 overs to play (not considering rain intervention),with the goal of scoring at least ScoreA+1 runs; innings2 ends when ScoreB, thenumber of runs scored by TeamB, exceeds ScoreA, or when TeamB finishes its quotaof 50 overs or loses all its wickets, whichever happens first. TeamB is deemed thewinner in the first case, and TeamA wins otherwise. A third possibility is a tie whenScoreA and ScoreB are equal at the end of the game.13.1.4 ScoringTeams can accumulate runs in two ways - home runs and non-home runs as de-scribed below. When a bowler commits a foul while delivering the ball, the bowl-ing team is penalized by awarding run(s) to the batting team. The ball delivered by1Currently, there are no tie-breakers in ODI the format.14the bowler is deemed illegal by the on-field umpires. Some of the main reasons fora delivery to be illegal are as follows:• The bowler crosses the crease while releasing the ball• The ball lands and bounces on the pitch more than once• The ball does not land on the pitch and reaches the batsman directly abovehis hip.• The ball falls very wide of the batsman making it unplayable.These penalty runs are termed as extras and contribute a small fraction of the totalruns. In our model, the non-home run category accounts for extras.Home RunsOne way of scoring is to power-hit the ball outside the playing area. Based onwhere the ball lands while traveling past the boundary, four or six runs are awarded.Four runs are awarded if the ball touches the ground before rolling past the bound-ary of the playing area. If the ball lands directly outside the playing area (therebynot touching the ground within the playing area), six runs are awarded. Borrowinga term from baseball, for convenience, we collectively term runs scored this wayas home runs. Home runs yield greater reward in terms of runs scored, but thebatsmen have to take risks to hit them, which increases their chance of getting out.Non-Home RunsThe other way of scoring is to hit the ball within the playing area and for thetwo batsmen to run and exchange their positions. In the mean time, the opponentplayers try to collect the ball to minimize the number of exchanges. Runs areawarded based on the number of times the batsmen exchange their positions beforethe ball is returned to one of the positions. There is theoretically no bound on thenumber of exchanges possible in a given ball but this value typically lies in therange 0–3 runs. This way of scoring has a lower risk of the batsman getting out butyields a lower number of runs. We term these non-home runs.15It is to be understood that batsmen have different intentions and mind-set whenthey try to score home runs and non-home runs. Hence their approach to scoringeach of the two are also different. Since home runs involve high risks than scoringnon-home runs, the number of non-home run scoring balls greatly outnumber thenumber of home run scoring balls as substantiated by our data. The decision to hita particular ball to the boundary (home run hit) is taken by the striking batsmanbased on a combination of factors like team’s score, his strengths, merit of theball, merit of the bowler, fielders’ placement in the ground etc. Understanding thisminute yet significant dynamic of the game has helped to come up with separatemodels for predicting home runs and non-home runs (More on this in section 4.2)3.1.5 DismissalThere are eleven ways for a batsman to lose his wicket, commonly referred toas getting out or being dismissed. The common ways to get dismissed are thefollowing• Bowled: when the ball delivered hits the stumps2• Caught by opponent fielders: When the ball hit by the batsmen is caught bythe fielders before it touches the ground.• Run-out: This happens when the batsmen try to score runs by exchangingtheir positions. A batsmen loses his wicket if he is found short of his positionwhen the ball (fielded and thrown back by opponent fielders) hits the stumps.• Leg Before Wicket (LBW): When the ball delivered hits the batsman’s bodyparts and the umpire determines that the it would have traveled to knock thestumps had it not hit his body, he is deemed out.There are a few other modes of dismissal which are uncommon. In our model, wedo not distinguish between the different forms of dismissal.2Stumps are the three vertical posts with two support bails that are placed in the pitch. A batsmanguards them so that the ball delivered by the bowler does not knock them.163.1.6 Target scoreThe number of runs accumulated by TeamA at the end of innings1 is ScoreA.ScoreA+1 run is set as the Target. This is the score that the team batting secondtries to achieve or exceed in innings2.3.1.7 ResourcesOvers and Wickets are collectively termed as resource. The batting team consumesthe overs to accumulate runs and loses wickets in the process. A batting team has50 overs and 10 wickets at their disposal at the start of an innings. This resourcecontinually decreases as the game progresses.The rules, terminologies and objective of a cricket game was explained in thischapter. Next chapter explains about the game modeling using relevant historicaland instantaneous features and our algorithm in detail.17Chapter 4Game ModelingThis chapter describes the problem formulation with relevant features in detail.Furthermore, it proceeds to explain our algorithm that predicts game progressionand the winner.4.1 Terms & NotationsA few very important terms and notation that will be used in our model are de-scribed below.4.1.1 SegmentTo predict the end-of-innings score of a team, a segmented prediction approach istaken. The batting period of a team is called an innings and it lasts till they run outof one of the resources. The 50 over innings is segmented into 10 intervals of 5overs each, where each interval is referred to as Si, 1≤ i≤ 10.4.1.2 Runs in a segmentFor a team T , RTi and WTi denote the the number of runs scored and the number ofwickets lost in segment Si, respectively. For a segment Si, NHRi and HRi denotethe non-home runs and home runs scored in that segment respectively. Together,they form the runs scored in that segment i.e., Ri = NHRi +HRi184.1.3 Match stateThe Match state at segment n, 0 ≤ n < 10 is defined as the pair of numbers con-sisting of the number of runs scored and the number of wickets lost so far, bythe batting team. Notice that given a match state, the resources remaining at thebatting team’s disposal can be easily calculated: the number of balls remaining is(10−n)×5×6 and the number of wickets remaining is 10−(#wickets lost so far).4.1.4 End-of-innings scoreThe total number of runs scored by team T at the end of their innings is given byRTeoi = ∑10i=1 RTi . The superscript T is dropped when the team is obvious from thecontext.4.2 Problem FormulationThe main problem tackled in this work is, given the instantaneous match data up toa certain point in the game, predict the progression of the remainder of the game,and in particular, predict the winner. If the no instantaneous data is available sincethe match is yet to begin, predict the progression and outcome using availablematch information. This is a special case with n = 0 and dealt with by supplyingRknown = 0 and Wknown = 0. More precisely, given a match state associated withsegment n, namely (Rknown = ∑ni=1 Ri,Wknown = ∑ni=1Wi), predict the number ofruns Rˆi for the remaining segments i, n+1 ≤ i ≤ 10. Using these predictions, thetotal predicted score at the end of the innings, Rˆeoi can be obtained asRˆeoi = Rknown +n∑i=n+1Rˆi (4.1)To predict the number of runs scored in a segment Si, both historical data aswell as instantaneous match data available till segment Si−1 are used. The cur-rent state of the match are the instantaneous features that is used for game stateprediction. Both sets of features are explained in further detail in the followingsections.If an innings has not commenced, as a special case, n = 0, Rknown = 0 andWknown = 0. In this case, the task becomes to predict number of runs Rˆi, for all19i = 1, ...,10. The total predicted score is thenRˆeoi =10∑i=1Rˆi (4.2)This segmented prediction approach is followed to predict Rˆeoi for both innings1and innings2. TeamA is predicted to be the winner if RˆAeoi >RˆBeoi. TeamB is predictedto be the winner if RˆAeoi