UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Enabling configuration self-adaptation using machine learning Araujo, Rodrigo 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2018_september_araujo_rodrigo.pdf [ 1.33MB ]
JSON: 24-1.0379346.json
JSON-LD: 24-1.0379346-ld.json
RDF/XML (Pretty): 24-1.0379346-rdf.xml
RDF/JSON: 24-1.0379346-rdf.json
Turtle: 24-1.0379346-turtle.txt
N-Triples: 24-1.0379346-rdf-ntriples.txt
Original Record: 24-1.0379346-source.json
Full Text

Full Text

Enabling Configuration Self-Adaptation Using MachineLearningbyRodrigo AraujoB.Sc. Computer Science, Universidade Salvador, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2018c© Rodrigo Araujo, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Enabling Configuration Self-Adaptation Using Machine Learningsubmitted by Rodrigo Araujo in partial fulfillment of the requirements for thedegree of Master of Science in Computer Science.Examining Committee:Reid Holmes, Computer ScienceSupervisorMike Gelbart, Computer ScienceSupervisory Committee MemberiiAbstractDue to advancements in distributed systems and the increasing industrial demandsplaced on these systems, distributed systems are comprised of multiple complexcomponents (e.g databases and their replication infrastructure, caching compo-nents, proxies, and load balancers) each of which have their own complex con-figuration parameters that enable them to be tuned for given runtime requirements.Software Engineers must manually tinker with many of these configuration param-eters that change the behaviour and/or structure of the system in order to achievetheir system requirements. In many cases, static configuration settings might notmeet certain demands in a given context and ad hoc modifications of these con-figuration parameters can trigger unexpected behaviours, which can have negativeeffects on the quality of the overall system.In this work, I show the design and analysis of Finch; a tool that injects amachine learning based MAPE-K feedback loop to existing systems to automatehow these configuration parameters are set. Finch configures and optimizes thesystem to meet service-level agreements in uncertain workloads and usage patterns.Rather than changing the core infrastructure of a system to fit the feedback loop,Finch asks the user to perform a small set of actions: instrumenting the code andconfiguration parameters, defining service-level objectives and agreements, andenabling programmatic changes to these configurations. As a result, Finch learnshow to dynamically configure the system at runtime to self-adapt to its dynamicworkloads.I show how Finch can replace the trial-and-error engineering effort that other-wise would be spent manually optimizing a system’s wide array of configurationparameters with an automated self-adaptive system.iiiLay SummarySoftware systems are increasingly more complex. Manually configuring these soft-ware systems is an error prone tasks that software engineers have to perform fre-quently. Modern distributed software systems usually need to be re-configuredfrequently to adapt to different contexts. In this thesis I present Finch, a tool thatenables software systems to configure itself by learning the system’s optimal con-figurations, allowing the it to adapt to different contexts in order to keep the qualityof service and decrease the need to manually configure these systems.ivPrefaceThe work presented in this thesis was conducted in the Software Practices Lab atthe University of British Columbia, Vancouver, British Columbia.Reid Holmes was the supervisory author on this project and was involvedthroughout the lifecycle of this project.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 42.1 Control theory in software engineering . . . . . . . . . . . . . . . 42.2 Machine learning in control theory . . . . . . . . . . . . . . . . . 52.3 Time series analysis . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Workload modelling . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Self-adaptive systems . . . . . . . . . . . . . . . . . . . . . . . . 62.6 Machine learning-enhanced software systems . . . . . . . . . . . 9vi3 Design and Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Machine learning based MAPE-K feedback loop . . . . . . . . . 103.1.1 Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.3 Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.4 Executor . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Finch as a self-adaptation enabler . . . . . . . . . . . . . . . . . . 123.3 System’s heuristics and configuration as a learning problem . . . . 143.4 Learnable patterns in system’s context . . . . . . . . . . . . . . . 143.4.1 Observability and System Instrumentation . . . . . . . . . 153.4.2 SLA, SLO, and SLI . . . . . . . . . . . . . . . . . . . . . 163.4.3 Measuring performance metrics . . . . . . . . . . . . . . 173.4.4 Defining adaptive configurations . . . . . . . . . . . . . . 183.4.5 Resulting dataset . . . . . . . . . . . . . . . . . . . . . . 203.5 Running Finch . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Architecture and Implementation . . . . . . . . . . . . . . . . . . . 244.1 Architecture overview . . . . . . . . . . . . . . . . . . . . . . . . 244.2 The MAPE-K feedback loop . . . . . . . . . . . . . . . . . . . . 254.3 Machine learning architecture . . . . . . . . . . . . . . . . . . . 284.3.1 Training pipeline . . . . . . . . . . . . . . . . . . . . . . 294.3.2 Predicting the optimal configuration . . . . . . . . . . . . 304.4 Passive and active training mode . . . . . . . . . . . . . . . . . . 324.4.1 Passive training mode . . . . . . . . . . . . . . . . . . . 344.4.2 Active training mode . . . . . . . . . . . . . . . . . . . . 344.5 Controlling adaptations . . . . . . . . . . . . . . . . . . . . . . . 344.5.1 Detecting improvement after carrying out an adaptation . . 344.6 Finch configuration . . . . . . . . . . . . . . . . . . . . . . . . . 354.6.1 SLI adjustment effect . . . . . . . . . . . . . . . . . . . . 375 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.1.1 Initial training phase with workload simulation . . . . . . 39vii5.1.2 Experiment 1: Configuration-controlled throttling . . . . . 395.1.3 Experiment 2: Performance overhead evaluation . . . . . 415.1.4 Experiment 3: Finding the optimal configuration for Postgres 426 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.1.1 Prometheus performance overhead . . . . . . . . . . . . . 466.1.2 Ever growing dataset . . . . . . . . . . . . . . . . . . . . 466.1.3 Local training pipeline . . . . . . . . . . . . . . . . . . . 476.1.4 Lack of control interface . . . . . . . . . . . . . . . . . . 476.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50viiiList of TablesTable 5.1 Data from using Finch with artificially generated configurationparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41ixList of FiguresFigure 1.1 How finch integrates into a target system; (1) Finch is injectedinto the target system, (2) it Monitors and analyzes the targetsystem’s context, (3) Finch learns how to configure the targetsystems, (4) Interface executes configuration adaptation plans,adapting the target system. . . . . . . . . . . . . . . . . . . . 2Figure 2.1 Classical feedback loop model that inspired the MAPE-K feed-back loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Figure 2.2 Types of adaptations in response to different stimuli, from Joa˜oSousa [31] . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Figure 2.3 Classical view of the autonomic computing manager createdby IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 3.1 Finch’s ML-based MAPE-K feedback loop design. The knowl-edge base is composed of learned models. The monitor andplan components use machine learning techniques in order tobuild the dataset and create adaptation plans. . . . . . . . . . 13Figure 4.1 Finch’s state machine to ensure progress in the target system . 28Figure 4.2 Slope that shows improvement in the target system . . . . . . 36Figure 5.1 CPU profile showing the performance overhead incurred byPrometheus . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Figure 5.2 CPU profile showing the performance overhead incurred Finch’sMAPE Loop . . . . . . . . . . . . . . . . . . . . . . . . . . 43xFigure 5.3 Experiment 3: 99th percentile latency of all endpoints beforeand after adaptation. Each item in the X axis is an endpointaffected by a configuration parameter. Y axis is the latency.The average latency reduction was 39.85%. . . . . . . . . . . 45xiGlossaryThis glossary uses the handy acroynym package to automatically maintain theglossary. It uses the package’s printonlyused option to include only thoseacronyms explicitly referenced in the LATEX source.API Application Programming InterfaceMAPE-K Monitor, Analyze, Predict, and Execute over a Knowledge baseML Machine LearningSLA Service Level AgreementSLI Service Level IndicatorSLO Service Level ObjectivexiiAcknowledgmentsI would like to thank my supervisor Reid Holmes, for giving me a great opportunityto pursue my intellectual curiosity and for supporting me along the way. This workwould not have been possible without his guidance.I would like to thank Mike Gelbart, not only for being the second reader of thiswork, but also for his genuine curiosity and desire to understand and help me tomake this work better.Also, I must thank my family and friends, for believing in me and providingme a strong foundation.And last but not least, I must thank my partner, Gulipek Candan, for all thesupport I could’ve asked for.xiiiChapter 1IntroductionThe industrial adoption of microservices has led to increasingly complex config-uration schemes that are manually fine-tuned by engineers. Ganek and Corbi dis-cussed the need for autonomic computing to handle the complexity of managingsoftware systems [16]. They noted that managing complex systems has becometoo costly, prone to error, and labour-intensive because pressured engineers makemistakes, increasing the potential of system outages with a concurrent impact onbusiness. This has driven many researchers to study self-adaptive systems (e.g.,[13, 15, 20, 27, 28, 30]); however, the software industry still lacks practical tools toprovide self-adaptive system configurations. Thus, most system configuration andtuning is performed manually, often at runtime, which is known to be a very timeconsuming and risky practice [1, 12, 16].In this work I present Finch, a tool that enables engineers to integrate self-adaptation mechanisms into their systems. Finch delegates the configuration andtuning of a system to a learned model, rather than requiring engineers to performthese operations manually or through manually tuned heuristics.Building self-adaptive systems is a major engineering challenge [6]. Finch’sproposal is to enable self-adaptation by giving the user the ability to inject themain components of a self-adaptive mechanism into an existing target system in aloosely-coupled fashion.One of Finch’s main goals is to provide self-adaptive configuration supportwith minimal engineer effort. Finch uses ideas from self-adaptive systems, system1Figure 1.1: How finch integrates into a target system; (1) Finch is injectedinto the target system, (2) it Monitors and analyzes the target system’scontext, (3) Finch learns how to configure the target systems, (4) Inter-face executes configuration adaptation plans, adapting the target system.observability, machine learning, and control theory to automatically asses the sys-tem’s environment, predict the impact of changes that could potentially improvethe system, and make these changes automatically.My approach consists of providing mechanisms for injecting a control loop intoan existing target system through an API for collecting relevant system metrics andconfigurations as the system executes. The user maps Service Level Agreements(SLAs) to a subset of these metrics, feed them into a machine learning componentthat is concurrently relearning the model while analyzing current event data whichthen predicts optimal configurations for the system for its given context. As aresult, Finch provides adaptation plans that can be both automatically executed,allowing the system to have self-adaptive capabilities, and interpretable, allowingengineers to understand the impact of a change in the configuration space before itis deployed.The thesis of this research is that the configuration of a system can be delegatedto a self-adaptive machine-learning based system to adapt to different workloadpatterns. I achieved this by exploring a design that enables self-adaptability without2incurring intrusive architectural changes in a target system, requiring the user toonly carry out tasks that are common in the software industry: Defining service-level agreements and interfaces to the configuration, and enabling observability inthe system by using minimal instrumentation.The main contributions of this thesis are:• A methodology for assisting the development and evolution of self-adaptivesystems, regardless of the presence of self-adaptability in the system’s foun-dations. Such methodology is encapsulated in Finch.• Demonstrating how minimal changes to the system can support this ap-proach, and how Service-Level Agreements can be modeled and mappedto optimization objectives.• A group of experiments to evaluate Finch’s performance when integratedinto a web service, demonstrating how Finch can learn how to configure itand improve its performance while incurring 8.5% of performance overhead.Chapter 2 discusses past research in the space of self-adaptive systems andprovides fundamental background for our approach. Chapter 3 outlines the designand usage of Finch, explaining the blend of ideas from different fields that lead toits principle design decisions. Chapter 4 describes Finch and its implementation.Chapter 5 presents the evaluation performed on Finch, followed by a discussion onlimitations and future directions in Section 6.3Chapter 2Background and Related WorkFinch draws ideas from many different, although overlapping, fields. Here I give abrief overview of these ideas and how they relate to Finch.2.1 Control theory in software engineeringControl theory is an interdisciplinary branch of engineering and mathematics thatstudies how dynamic systems behave and how it changes with respect to modifiedcontrol inputs through feedback mechanisms, one of the main goals of controltheory is to devise techniques and models to make systems achieve certain goalsby controlling the system’s input.The ideas in control theory have been widely adopted in the software engineer-ing research community, with special attention to the Monitor, Analyze, Predict,and Execute over a Knowledge base (MAPE-K) feedback loop, which proved to bea powerful tool to build self-adaptive systems [3, 7, 10, 12, 22, 30]. Angelopouloset al discussed the intersection between software engineering and control theory[14]. They showed how control-theoretical software systems are implemented andtheir design process, as well as the differences of the word “adaptation” in bothfields. All These works were shown to be invaluable to the development of Finch,because the injection of a MAPE-K loop into the target system is the core compo-nent of Finch.42.2 Machine learning in control theoryMachine learning is a set of principles, algorithms, and techniques, strongly rootedin statistics, that provides systems the ability to automatically learn, improve, andperform tasks without being explicitly programmed, all based on the data that it isfed to.The applications of machine learning in control theoretical models have beendiscussed in [17], where the main idea is to take advantage of high performanceof machine learning methods while using control theory to guarantee safety andcontrollability. Reinforcement learning [32] has similar goals to those in controltheory, but with different approaches.Finch uses ideas from machine learning, reinforcement learning, and controltheory to enable self-adaptability. Thus, instead of hard-coding configuration heuris-tics in a system, the control theoretical aspect of Finch (the MAPE-K loop) usesmachine learning techniques to learn patterns in the target system and make fastpredictions in order to create adaptation plans.Like one of the main goals of machine learning—enabling a system to per-form a task without being explicitly programmed—Finch learns how to find theoptimal or sub-optimal set of configuration parameters without being explicitlyprogrammed.2.3 Time series analysisTime series data refers to a series of data points organized in a temporal order. Timeseries data has been used to analyze and predict patterns in data with respect totime, with applications on understanding how to efficiently allocate computationalresources, this is highly used in Finch.Between 2007 and 2011, many techniques for forecasting workload and per-formance metrics using time series data have been realized [5, 9, 18, 19, 25].With these forecasts, they provided methodologies for virtual machine allocationin data centres. These works did not focus on tools for applying machine learningto software systems nor on tools to enable self-adaptability in arbitrary softwaresystems—which is the end goal of this work.Finch’s main abstraction for data storing and dataset creation is time-series5data. All patterns observed by Finch in the target system are with respect to thetime of the observation.2.4 Workload modellingAnother important aspect of Finch is being able to simulate workload intensity forinitial training of the adaptation model. To have accurate workloads, it is needed tomodel them as closely as possible to real-world workloads. Herbst et al. presentedthe Descartes Load Intensity Model [23], a powerful tool for describing load inten-sity variations over time, that can also be used for an accurate benchmarking basedon realistic workload and performance analysis. Finch uses some of these ideas tomodel and simulate workloads for training the adaptation model.2.5 Self-adaptive systemsSelf-adaptation describes the ability of a system to change some aspects of itselfin runtime rather than in design time. Generally, in a self-adaptive system, wehave the autonomic manager and the managed element. The autonomic manageris always monitoring and analyzing the managed element, and adapting it in re-sponse to changes in its context in order to achieve a higher goal, such as betterperformance, fault-tolerance, or efficiency.The figure 2.3 shows the MAPE-K (Model, Analyze, Plan, Execute, over aKnowledge base) model used to engineer self-adaptive systems. At its core, ithas the feedback loop, which interacts with the target system or, as it is originallycalled, the managed element. This interaction happens through its sensors andeffectors. This model is strongly related to the classical view of feedback loops inControl Theory (figure 2.1) where a controller is always monitoring a process inorder to change some of its aspects in response to changes.In the MAPE-K model, the monitor, analyzer, planner, and executor work to-gether, using a knowledge base, toward a common goal: the adaptation of someaspect of the target system.In 2013, Joa˜o Sousa properly mapped the types of possible adaptations in re-sponse to different stimuli [31], as shown in the figure 2.2. Finch focuses on run-time structure adaptation in response to: QoS goals, system metrics, parameters,6Figure 2.1: Classical feedback loop model that inspired the MAPE-K feed-back loopFigure 2.2: Types of adaptations in response to different stimuli, from Joa˜oSousa [31]and environment metrics.Cornel Barna et al proposed Hogna, a platform for deploying self-adaptive ap-plications in cloud environments [4]. Hogna provides a framework that abstractsdeployment details, for example: spinning-off and managing instances on AmazonEC2 or OpenStack, enabling the user to focus on the adaptation mechanism. A keydifference between Hogna and Finch is that Finch is not a deployment framework,but rather a library that injects a MAPE-K loop, abstracting the formal modelingto a machine learning model that can be matched with the specified SLAs, instru-mented data, and identified configuration parameters of the system.Finch can be used either for managing adaptive deployment schemes or op-7Figure 2.3: Classical view of the autonomic computing manager created byIBMtimizing finer-grained knobs of a system, for instance, optimizing configurationknobs of a Postgres instance used by a system in order to improve performance andprevent SLA violations. In addition to that, a minor difference between these twotools is how much is asked from the user; Hogna asks for a configuration file thatdescribes the topology to be deployed, monitors to use, and more related settings,custom java classes to handle specific behaviours, and PXL file with the model de-scription and a configuration file for Kalman filters, whereas Finch requires feweractions from the user, while enabling self-adaptability and giving flexibility to theuser: it can be used both for higher level tasks—deployment—and lower leveltasks—self-tuning and self-configuring of smaller pieces of software.Andrew Pavlo et. al. presented Peloton, a database system designed for au-tonomous operation [27]. Similar to Finch, one of their main goals was to decreasethe need for manually-performed operations, though they focused solely on ap-plying their ideas and techniques to their DBMS implementation. They achievedthis by classifying the workload trends, collecting monitoring data, and forecastingresource utilization, then training a model based on this data to predict the bestoptimization plan. These ideas are important to my work, the key difference is8that instead of directly embedding these ideas in a specific system—in this casea DBMS—and requiring the autonomous components to be tightly coupled to thesystem being configured, I am embedding a subset of these ideas in a tool that aimto be integrated in any arbitrarily chosen software system.2.6 Machine learning-enhanced software systemsIn a work entitled The Case for Learned Index Structures, Kraska et. al. havedemonstrated that machine learned models have the potential to provide significantbenefits over state-of-the-art database indexes [24]. This research showed that byreplacing manually tuned heuristics with learned models enabled it to outperformcache-optimized B-Trees by up to 70%.I draw much inspiration from this work; Finch’s central idea is to allow systemsthat relies heavily on manual configurations and heuristics to be enhanced withlearned models. This could be applied to many different domains. In this work weapply this idea to a REST-based Application Programming Interface (API) backend.The idea of machine learning-enhanced software systems is to move from us-ing the same algorithms, heuristics, data structures, or configurations in multipledifferent contexts, to personalized configurations; different configurations that per-form better for different scenarios. This relates well to the No Free Lunch theorem:If an algorithm performs well on a certain class of problems then itnecessarily pays for that with degraded performance on the set of allremaining problems.This is the main idea behind Finch: the integration of learned models to gener-ate adaptation plans according to the different scenarios.9Chapter 3Design and UsageTo integrate Finch into the target system, the user has to do the following:• Instrument the target system• Identify relevant configuration parameters for Finch• Define Service Level Agreements related to a subset of the instrumented data• Allow configuration parameters to be changed programmaticallyThese 4 steps will create the necessary environment to enable self-adaptation inthe target system, by enabling Finch to learn how to configure it. In the followingsections I discuss in more details these steps and the design principles behind them.3.1 Machine learning based MAPE-K feedback loopIn Section 2 I discussed about self-adaptive systems, feedback loops, such asMAPE-K, and how they relate to Finch. However, the actual implementation ofFinch uses a slight variation of the aforementioned feedback loop, in which theMAPE-K loop is based on a machine learning model.The main difference between the original and the machine learning based MAPE-K loop stems from the Monitor (M) and the Knowledge (K) components. In theclassical MAPE-K loop, the Knowledge of the system is the data collected bythe Monitor component. However, in Finch’s implementation of this loop, theMonitor component uses the monitored data to train the machine learning models.As a result, the model built by the Monitoring component is shared across other10components. In other words; the learned models are the K in Finch’s MAPE-Kimplementation.The planner component (P) uses these models to create an adaptation plan bypredicting the optimal configuration for the target system. Just like the classicalMAPE-K, Finch’s feedback loop then executes the adaptation plan through theinterface it shares with the system, and continues the loop. Figure 3.1 illustratesFinch MAPE feedback loop design.Following sub-sections explain each component of Finch’s MAPE-K feedbackloop in more detail.3.1.1 MonitorFinch includes an API for monitoring the target system’s context. Currently itonly supports semantics for HTTP endpoints, but extending it to support differentmonitoring semantics is fairly simple.During the monitoring, Finch periodically extracts, parses and builds a datasetcontaining the context of the system (more details on how it is done can be foundin the Architecture Section 4). It then uses machine learning to train models thatare able to predict user-defined aspects of the target system, such as the optimal orthe sub-optimal configuration and how it will affect the quality of the service.This dataset grows incrementally over time, and the models are retrained everytime the dataset is incremented. As a result, Finch can capture emerging patternsin the target system and improve the accuracy of the models.3.1.2 AnalyzerThe Analyzer component frequently checks the current state of Finch and the tar-get system. Depending on the improvements or violations, this component thentriggers new adaptation plans.3.1.3 PlannerThe Planner, when triggered by the analyzer component, makes use of the currentknowledge base (trained models) and predicts the optimal configuration for thatspecific context that the system is in.11When the adaptation plan is created, the Planner then calls the Executor.3.1.4 ExecutorThe Executor, as the name suggests, executes the adaptation plans created by thePlanner. The plans can be as simple as changing in-memory or in-file configurationparameters, or more elaborate, such as managing Docker containers. The latterrequires the user to define custom adaptation methods.When another iteration of the cycle begins, the Monitor will gather data relatedto the most recent adaptation. This collection will be used when training new mod-els. The analyzer, in the meantime, will watch for improvements or deterioration,and will trigger a new adaptation plan depending on its observations. The Planner,in return will use the collected data to create new adaptation plans, which will beexecuted by the Executor.3.2 Finch as a self-adaptation enablerAccording to the self-adaptive systems community, a centralized and top-downself-adaptive system operates with the guidance of a central controller. This con-troller assesses its own behavior with respect to its current surroundings, and adaptsitself if the monitoring and analysis warrants it [6]. Given this definition, I builtFinch to follow a centralized and top-down approach.The main design goal of Finch is to allow its users to inject a MAPE-K feed-back loop into their system through its API. To carry out an effective reasoningon the target system’s context uncertainty, we need visible feedback loops that arefirst class citizens in the system, as discussed by Y. Brun et al [6]. In industry,the self-adaptation mechanism is hard-wired into the managed system most of thetime. That is, they change the managed element’s structure to fit the feedback loopinto the target system.Of course, this requires a noticeable engineering effort; usually systems are notinitially designed with self-adaptability in mind. This is where the injection partof Finch enters. Rather than hard wiring the self-adaptation mechanisms insidethe target system, Finch keeps it loosely-coupled. Upon integration into the targetsystem, Finch acts as a co-pilot and starts collecting data related to the system’s12Figure 3.1: Finch’s ML-based MAPE-K feedback loop design. The knowl-edge base is composed of learned models. The monitor and plan com-ponents use machine learning techniques in order to build the datasetand create adaptation plans.context, environment, and states, storing this data for future reference and modeltraining. After a certain time, with learned models ready to make predictions, Finchstarts analyzing event data. Guided by the internal feedback loop, it then carriesout execution plans that aim to optimize the target system. The adaptation leads tomore event data to be stored and analyzed, and the cycle repeats.133.3 System’s heuristics and configuration as a learningproblemWe can define learning problem as a set of observations comprised of input andoutput data, and some unknown relationship between the two. The goal of a learn-ing system is to learn a generalized mapping between input and output data, sothat predictions can be made for new instances drawn from the domain where theoutput variable is unknown.The main hypothesis behind Finch is that if we can model the configurationscheme or the heuristics of a system as a learning problem, then Finch can learnmodels that capture patterns between the system’s context and the system’s config-uration parameters, enabling the system to predict the optimal set of configurationparameters for a specific observed scenario. This prediction can be used to eitheradapt to different scenarios that require different configurations or to prevent poorconfigurations.To reiterate, machine learning is about learning to predict a certain behavior,based on what was experienced in the past. Thus, an important step when modelinga problem as a learning problem is the choice of observations used to train thesystem.In this work’s context, observations could be anything that relates to the sys-tem’s behavior, performance, inputs, and outputs. For instance: Throughput, re-quests per second, latencies, and machine’s resources usage (CPU, memory, IO)are some examples for the aforementioned context.In Finch’s case, it is important that Finch is provided with the necessary meansto collect the best possible set of observations from the system and its environment.In order to model the system’s heuristics and configuration as learning problem,Finch assumes that the system is properly instrumented.3.4 Learnable patterns in system’s contextFinch’s ultimate strategy is to enable self-configuration in the system it is integratedto. Finch achieves this through learning exhibited patterns in the target system. Inorder to accomplish this, Finch needs the user to properly instrument the target sys-tem. Thus, a solid foundation in system instrumentation and observability comes a14long way with Finch.3.4.1 Observability and System InstrumentationThe first step in my approach was to observe the system’s behavior and context,which gave me insight on the system’s characteristics. By collecting data on dif-ferent combination of workloads and configuration parameters, I have observedinteresting patterns emerge from the data.In order to observe the system thoroughly yet efficiently, Finch makes exten-sive use of modern observability and software instrumentation techniques. Thesetechniques refer to code inserted in parts of the system’s codebase to record its con-text. Function parameters, latencies, and time to execute a certain block of codeare some values in a codebase that we can instrument. The purpose of collectinginformation from these pieces is, for instance, to help measure performance, assistdebugging tasks, and find bottlenecks. In return, Finch greatly benefited from therecorded values throughout the system.Any user who wants to integrate Finch into their system needs to carry outsuch instrumentation of their target system. Luckily, the software industry has beenenforcing system instrumentation by providing many solutions, such as Dtrace [8],Prometheus [29], Nagios [26], and Datadog [11], so this requirement should notcome across as an extra necessity, but a system requirement regardless of Finch’spresence.Instrumentation is also heavily used in industry to detect Service-Level Agree-ment violations and to perform resource management—two tasks that are essentialfor Finch to fulfill its purpose.Under Finch’s layers, all monitoring is done using Prometheus. Prometheus isa pull-based monitoring tool and a time-series database. A normal concern is theoverhead incurred by a pull-based monitoring tool. However, the response to thisconcern is straightforward: The overhead is negligible. Unlike monitoring toolslike Nagios, which frequently executes check scripts, Prometheus only collectstime series data from a set of instrumented targets over the network. For eachtarget, the Prometheus server simply fetches the current state of all the metricsover HTTP and has no other execution overhead that would be pull-related.15Another reason why Prometheus has low overhead is, it is not an event-basedsystem. Prometheus regularly collects data on the aggregated time series, whichrepresents the current state of the given metrics, and not on the underlying eventsthat led to the generation of those metrics. Moreover, the system using Prometheousdoes not send updates to Prometheus server for each handled request. The systemsimply counts up the requests in memory, causing no monitoring overhead or traf-fic. Then Prometheus pulls this data every few seconds (which can be configured),returning the current counter value and its timestamp.3.4.2 SLA, SLO, and SLIThe recent emergence of the DevOps culture has put key concepts and techniquesfrom Observability and Systems Monitoring on a spotlight. Finch is built uponsome of these concepts, such as Service-Level Agreements, Service-Level Objec-tives, and Service-Level Indicators.These 3 terms are used interchangeably sometimes. So to avoid confusion,below are the proper definitions for each:Service Level Indicator (SLI)A quantitative measure on the target system’s current service quality levels. Anexample that is heavily used here is latency of the system’s endpoints. It could alsobe the throughput, the error rate, or the availability.Service Level Objective (SLO)A target value or a goal to be achieved by a given SLI.For example, suppose our service has an endpoint A, the latency being mea-sured in this endpoint is our SLI. Our objective is for it to be under 200 millisec-onds. Thus, our SLO is SLI ≤ SLO, where SLO = 200ms. In this case, the SLOserves as an upper bound on our observed SLI.Service Level Agreement (SLA)A rather formal contract between the service provider and its users. This contractencapsulates the SLOs and SLIs, and the consequences of its violation.16To build on top of the previous SLI and SLO values; if our SLO is to serveendpoint A in at most 200 ms, then the SLA can dictate that 95% of all requestsshould be served in accordance with the SLO.3.4.3 Measuring performance metricsA common mistake made in industry and in the research community is measur-ing performance —especially latency or response time— using averages. Aver-ages hide outliers and are usually very skewed. To better illustrate this problem,let’s consider the following scenario: Our target system receives 100 requests perminute, 80 of which take 200ms to serve, which is relatively fast. On the otherhand, the remaining 20 requests take 10000ms (10 seconds). If we assess the per-formance of the system using the average, we’ll evaluate the latency as 2.1 seconds,which is an acceptable value. However, this value hides the fact that 20% of ourrequests are taking 10 seconds, which is an unacceptable latency value.Rather than taking the average of performance metrics, Finch takes three dif-ferent approaches to measure performance metrics, which are more reliable thantaking average:• Measuring percentiles, such as 99th and 90th, in order to capture outliers.This way we can understand upper bounds and uncover more silent failuresof the performance metrics.• Using SLAs defined by the user to assess the performance metrics. For ex-ample, an agreement could require 95% of the POST requests to endpoint Ato be served under 200ms. This is a good strategy because rather than takingthe average of the POST requests, we focus on how well we served a bigportion of the requests.• Using APDEX [2] as an extension on agreements.APDEX is an industry standard that gives a score of satisfaction based on thelatency or response time of requests. It is calculated by: APDEXT =S+ Tol2R , whereT is a selected threshold, S is the number of satisfied requests, or requests that takeless than T to be served, and R is the total number of requests. Tol is the toleratedrequests, or requests that take between T and T ∗4 to be served, and R is the totalnumber of requests. A request is considered frustrated if it takes more than T ∗4 to17be served.To compare APDEX score to taking the average, think of two scenarios:• 60% of the requests take 200ms, 20% of the requests take 10ms, and 20% ofthe requests take 10 seconds. The average of latencies in this case is 2.1s,which gives you a false sense of confidence. The APDEX2s is 80%, which isconsidered low.• 1 request takes 5 minutes, 10 requests take 200ms. Here we have a case ofanomalous latency. The average is 27 seconds, which is very high, whereasAPDEX2s is 91%. This APDEX2s score indicates the threshold we set, thesituation is not bad, we just had an anomalous case.Finch uses APDEX agreements, and the 90th and 99th percentiles as featuresand targets when training the predictions models.To monitor the target system, Finch provides a small API for it. As of now, thismonitoring API consists of three methods; one for workload monitoring, anotherfor latency monitoring, which takes the endpoint, the HTTP method, and the du-ration of the request, and a last one for configuration parameter monitoring, whichafter loading the file that contains the adaptive configuration parameters (which Iwill talk about in the next sections), it will put these values in memory and readfrom it.3.4.4 Defining adaptive configurationsNow that Finch collected data points that describe the performance and the behav-ior of the target system, the state of the configuration parameters at a given timeshould also be captured. Finch asks the user to create a file containing adaptiveconfigurations. An adaptive configuration parameter can be a property, parameter,or variable whose value controls a certain aspect of a system or an algorithm. Finchtries to find the optimal value for this parameter in order to better configure the tar-get system. The configurations can have a variety of values and depend highly onthe kind of system the user is working with.There are two types of configuration parameters that can be defined by theuser; normal configuration parameter and custom configuration parameter. Bothare declared in an adaptive configuration JSON file.18Listing 3.1: Example of a normal adaptive configuration definition1 [2 ”parameter 1”: {3 ”value”: 1000,4 ”valueType”: ”discrete”,5 ”values”: [1, 1000],6 ”isCustom”: false7 }8 ]Listing 3.2: Example of a custom adaptive configuration definition. Finchwill call the procedure changeConfParam1 when it needs to change thisspecific configuration1 [2 ”custom param 1”: {3 ”value”: 1000,4 ”valueType”: ”discrete”,5 ”values”: [1000, 1],6 ”isCustom”: true,7 ”adaptationMethod”: ”changeConfParam1”8 }9 ]The normal configuration parameter holds its value both in memory and in theadaptive configuration file. When Finch predicts the optimal configuration, it willchange the parameter value both in-memory and in the adaptive configuration file.It assumes that the target system is reading from one of those sources, and thus,the adaptation is easily carried. The normal configuration parameter is defined bypassing the name of the configuration parameter, the current value, and the possiblevalues or range values. 3.1 is an example of a normal configuration parameterdefinition.However, there are scenarios where a simple in-memory or changing the valuein a file is not enough to change the configuration of an aspect of a system. Forinstance, to change some of Postgres configuration parameters, one must changethe value on its configuration file and then restart the Postgres instance. 3.2 is anexample of a custom configuration parameter definition.19Finch’s custom configuration parameters work in a way that assist this config-uration changing process. When defining a custom configuration parameter, theuser should also provide a procedure to be ran when an adaptation plan is carriedthat involves this configuration parameter. Upon adaptation, Finch calls this proce-dure, passing the appropriate parameters. Using the Postgres example, the user canprovide a procedure that changes the Postgres configuration file using argumentsbeing passed to it and restart the instance. Finch predicts the optimal configurationparameter, then calls this procedure passing the predicted value.3.4.5 Resulting datasetTo construct the dataset, Finch collects four classes of features on the target system;performance and behavior metrics, configuration parameters, the workload, andservice level indicators. The goal of choosing these features was to gather a datasetthat the machine learning models could be trained on, which would later makepredictions based on these features. The reason why I choose to use these particularfeatures is to answer the following question: Given the workload —e.g requestsper seconds— and the performance metrics of the system, what is the optimal setof configuration knobs (parameters) that will prevent SLA violations, in this case,in the requests served?The dataset is constructed in such a way that each row describes the contextof the system —workload, metrics, SLIs, and configuration knobs— at a giventimestamp.The following matrix summarizes how the dataset is organized:∣∣∣∣∣∣∣∣∣∣t1 W1 M11 M21 . . . Mi1 k11 k21 . . . kl1 SLI11 SLI21 . . . SLIδ1t2 W2 M12 M22 . . . Mi2 k12 k22 . . . kl2 SLI12 SLI22 . . . SLIδ2............. . ........... . ........... . ....tn Wn M1n M2n . . . Min k1n k2n . . . kln SLI1n SLI2n . . . SLIδn∣∣∣∣∣∣∣∣∣∣Where:1. n is the number of collected samples2. tφ is the timestamp of the φ th example3. M jφ is the jth instrumented metric in timestamp tφ . j ranges from 1 to i, the20last instrumented metric.4. kcφ is the value in the cth configuration parameter in timestamp tφ . c rangesfrom 1 to l, the last collected configuration knob.5. SLIoφ is the oth service level indicator in the timestamp tφ—which is one ofthe instrumented metrics that was set to be an SLI. o ranges from 1 to δ , thelast collected SLI.This dataset should capture the context of a system with respect to workload,instrumented metrics, and the values in configuration parameters.3.5 Running FinchBecause there are many ways of using Finch as a library in a target system, there isno single right way to use. Here it is illustrated how it was used in a scenario wherethe target system is an HTTP REST service that uses Gorilla mux for its URLrouter and dispatcher, Viper for configuration management, Interpose for HTTPmiddleware, and Postgres as its main database.I start by instantiating and initializing Finch where the target system does thesame tasks in the codebase as shown in listing 3.3. The highlighted lines are thelines that were added to the target system. In the target system’s entry point, theoriginal f uncNew(con f ig) will be normally called, but now it will also configureand initialize Finch.Upon initialization, Finch will try to find two JSON files, defined by the user, inthe target system’s root folder; one that describes each SLA for the target system,as shown in listing 3.6, and another one which defines the adaptive configurationparameters, as shown in the previous section. An SLA description example can befound in 3.7,The next step is to intercept the logging mechanism and use Finch’s loggingAPI to log the necessary data to train the dataset, analyze the context of the targetsystem, and make predictions. This is done by creating a logging middleware thatuses Finch’s logging API, as shown in listing 3.4, and adding it to the middlewareused by the target system, as shown in listing 3.5. With that done, all requests tothis service will be analyzed by Finch so that it can perform its tasks.Initially, Finch will work as passive co-pilot; it collects data, analyzes it, and21Listing 3.3: Initializing the target system and Finch1 func New(config ∗viper.Viper) (∗Application, error) {2 Finch := finchgo.NewFinch()34 Finch.InitMonitoring()56 dsn := config.Get(”dsn”).(string)78 db, err := sqlx.Connect(”postgres”, dsn)9 if err != nil {10 return nil, err11 }1213 cookieStoreSecret := config.Get(”cookie secret”).(string)1415 app := &Application{}16 app.config = config17 app.dsn = dsn18 app.db = db19 app.db.SetMaxIdleConns(10)20 app.sessionStore = sessions.NewCookieStore([]byte(cookieStoreSecret))21 return app, err22 }Listing 3.4: Defining logging middleware for Finch’s logging1 func Log(Finch ∗finchgo.Finch) func(http.Handler) http.Handler {2 return func(next http.Handler) http.Handler {3 return http.HandlerFunc(func(w http.ResponseWriter, r ∗http.Request) {45 // Inject monitoring loop6 Finch.MonitorWorkload(r.Method, r.basePath)7 Finch.MonitorLatency(r)8 Finch.MonitorConfigurationParameters()9 })10 }11 }frequently builds a dataset with this data. After a while, it starts training models onthis dataset, and if the accuracy is acceptable, whenever there’s an SLA violation,it will trigger configuration adaptation in order to try to improve the target systemperformance.22Listing 3.5: Injecting Finch’s logging middleware into target system’s HTTPmiddleware1 func (app ∗Application) MiddlewareStruct() (∗interpose.Middleware, error) {2 middle := interpose.New()3 middle.Use(middlewares.SetDB(app.db))4 middle.Use(middlewares.SetSessionStore(app.sessionStore))5 middle.Use(middlewares.Log(Finch))67 middle.UseHandler(app.Mux())89 return middle, nil10 }11 }Listing 3.6: finch sla.json describe each SLA in the target system1 [2 {3 ”sla”: ”<SLA description>”,4 ”endpoint”: ”<endpoint name>”,5 ”method”: ”PUT | POST | GET | DELETE”,6 ”metric”: ”latency | throughput”,7 ”threshold”: ”<Threshold number that defines what’s an acceptable latency or throughput>”,8 ”agreement”: ”<Percentage number>”9 }10 ]Listing 3.7: Example of an SLA definition in finch sla.json1 [2 {3 ”sla”: ”95% of the POST requests to Users should be under 150ms”,4 ”endpoint”: ”/users”,5 ”method”: ”POST”,6 ”metric”: ”latency”,7 ”threshold”: 150,8 ”agreement”: 959 }10 ]23Chapter 4Architecture and ImplementationThis chapter discuss in more details how the main components of Finch work.Finch was built using the programming language Go, for its excellence with dis-tributed and concurrent programming and its ease to construct reliable systems.Because Finch needs to provide reliability, speed, and a fast way to experimentwith machine learning using an arbitrary dataset, using Python for machine learn-ing and wrapping its service and providing a solid infrastructure using Go was thebest approach for this job.4.1 Architecture overviewPrometheus is used to store the observed metrics and their respective timestamps.Prometheus is a time series database that also offers a very complete monitoringand alerting API. Thus, instead of implementing a time-series store from scratch,I decided to use Prometheus underneath Finch, incurring an extra performanceoverhead. I discuss this performance overhead in chapter 5.Finch’s runtime spawns two main lightweight threads, which in Go is calledGoroutine. These two Goroutines are two observer threads. The first one is re-sponsible for periodically building the dataset. It will, periodically, extract all col-lected metrics from Prometheus through its HTTP API, parse this data, and savethe dataset. Then, it calls the machine learning component to train the models us-ing this dataset, all models, scaler, and encoders—necessary components to make24predictions—are persisted on disk.The second Goroutine is responsible for monitoring the current state of the sys-tem by querying Prometheus every few seconds, and checking the the current SLOvalues. Upon violation of an SLA, it calls the machine learning component, usesthe most recently trained models to predict the most optimal configuration, thencalls each respective adaptation method responsible for changing its configurationin the target system.Both Goroutines are controlled by two variables: one that controls how oftenthe dataset is constructed, and another one controls how frequently the currentcontext is observed. The former has a significant impact on Finch’s performanceand is discussed in more details a later subsections.4.2 The MAPE-K feedback loopThese two Goroutines running in loop compose the main abstraction of Finch: theMAPE-K feedback loop.They communicate internally using Go’s channels, which can be thought aspipes that connect concurrent Goroutines. The philosophy behind Go’s channelis: share by communicating, not by sharing. Instead of sharing memory betweenthreads, the sharing happens by sending messages between channels. Instead ofhandling concurrency by using mutex locks, it’s favored by Go’s community to usechannels and messaging. This idea comes from Hoare’s Communicating Sequen-tial Processes [21]. This turned out to be helpful when implementing somethinghighly concurrent as the MAPE loop in Finch.The main loop spawns two other goroutines and control them with two differentchannels, as shown in the code below.251 func (f ∗Finch) StartMAPELoop() {2 ticker := time.NewTicker(MonitorFrequency ∗ time.Second)3 quitWatcher := make(chan struct{})45 go f.MonitorAndAnalyzeContext(ticker, quitWatcher)67 datasetTicker := time.NewTicker(DatasetBuilderFrequency ∗ time.Minute)8 datasetQuit := make(chan struct{})910 go f.buildDatasetPeriodically(datasetTicker, datasetQuit)11 }The goroutine that periodically builds the dataset starts an inner loop that ex-tract all metrics from Prometheus and builds the necessary dataset for training ev-ery DatasetBuilderFrequency minutes as shown below.1 func buildDatasetPeriodically(tickerBuilder time.Ticker, quitBuilder chan) {2 for {3 select {4 case <−tickerBuilder.C:5 f.DatasetBuilder(true, <dataset range>)6 case <−quitBuilder:7 ticker.Stop()8 return9 }10 }11 }Monitoring and analyzing the current context and state of the target system re-quires more non-trivial work, such as periodically extracting, from Prometheus, asingle row of metrics of that given timestamp, analyzing, extracting, and savingthe current state of all SLAs defined by the user for the target system against thecurrent context, checking if there is any SLA violation, triggering the adaptationprocedure, waiting for the adaptation to fully propagate, and checking for improve-ments in order to prevent unnecessary new adaptations.261 func (f ∗Finch) MonitorAndAnalyzeContext(ticker time.Ticker, quitWatcher chan) {2 SLAMetricsHistory := make(map[SLA][]float64)3 adaptationWasCarried := false4 isImproving := false5 for {6 select {7 case <−ticker.C:8 currentSLAMetrics := f.getSLAMetrics()9 SLAMetricsHistory = f.appendSLAMetricsHistory(SLAMetricsHistory,,↪→ currentSLAMetrics)1011 if f.checkForViolation(currentSLAMetrics) {1213 if adaptationWasCarried {14 isImproving = f.checkForImprovement(SLAMetricsHistory)15 if !isImproving {16 adaptationWasCarried = false17 }18 }1920 if !adaptationWasCarried {21 predictedOptimalKnobs := f.predictOptimalKnobs()2223 f.carryAdaptationPlan(predictedOptimalKnobs)2425 adaptationWasCarried = true26 }27 }28 case <−quitWatcher:29 ticker.Stop()30 return31 }32 }33 }Internally, Finch implements a state machine to keep track of its operationsin order to ensure that the target system is progressing and to control adaptations.Figure 4.1 illustrates this state machine.27Figure 4.1: Finch’s state machine to ensure progress in the target systemThe interoperability between Golang and Python is done by giving Finch’smain components control over the machine learning component (written in Python).Because the machine learning component exposes a simple API, the main compo-nents use this API by executing bash commands.4.3 Machine learning architectureGiven the previously defined dataset, Finch trains many different models, one foreach SLA’s indicator. In the end we want to predict the SLI, given the set of con-figuration parameters and system metrics—including workload.Finch has 2 Machine Learning (ML) pipelines. The first one is training the mod-els, which includes basic standardization, normalization, grid search, and cross val-idation. The Second one is predicting the SLI, given the configuration parameters.After running these pipelines, the last step is finding the optimal set of configura-tion parameters.284.3.1 Training pipelineAs mentioned before, Finch trains a model for each SLA indicator. To elaboratefurther on it; if the user has two SLAs with respect to the latency of endpoint A andB, then the two collected SLIs are the 99th percentiles of these endpoints’ latency.Thus, Finch will train two models, one for each SLI.Training different models requires slicing the original dataset to fit the models’needs. For example, when we want to predict the latency of endpoint A, latency Ais the target, or y, of the model, and the corpus, or X , is the rest of the dataset minusthe other SLIs collected. This way, the system metrics, workload and configurationparameters are isolated for the model training.Learning how to learn: creating adaptive machine learning modelsSince the dataset is very personalized with respect to the target system, there’s noone model to rule them all, for example: we cannnot simply use logistic regressionor a neural network with static hyperparameters. A personalized dataset means thatit can have an arbitrary dimension (number of features) and size, it can have onlycontinuous values or discrete values, or both. Finch cannot know this beforehand.Thus, to work with uncertain datasets, when training the models we must performgrid search.Grid search is a technique to search for the best hyperparameters and mod-els. Hyperparameters are parameters that are not directly learned by the models,parameters that configure certain aspects of a given machine learning models, forinstance: how deep a decision tree should be, how many decision trees (i.e esti-mators) a random forest should have, or how many layers a neural network shouldhave. Each machine learning model performs better when choosing the right modeland the right hyperparameters for a given dataset. Some models-hyperparameterscombination perform better with highly dimensional dataset, some perform bet-ter with well-balanced dataset, some are more resistant to outliers, sometimes theoutliers is what you are trying to find.Given all these options and uncertainties with respect to the collected dataset,grid search is a way to find the most well suited model and hyperparameters for thedataset being collected by the target system.29Grid search exhaustively considers all hyperparameter combinations and manydifferent models, train one model per combination, and select the most best per-forming one. This adds a considerate time and space complexity in the trainingpipeline, but it is something that cannot be avoided when choosing a model thatwill not overfit or underfit on dynamically generated datasets.However, the grid search is only performed in two scenarios: first training cy-cle, where Finch first handles the extracted dataset, after the first training cycle, itwill know the best hyperparameters for the learned models and use them to re-trainthe model with the new data. The second scenario where grid search is performedis when the model’s prediction performance starts to degrade, meaning that thedataset changed in some aspects, thus, needing to re-learn better a model and hy-perparameters for the new dataset.In this pipeline, it is considered models such as linear regression, ridge regres-sion, lasso, support vector machines, and decision trees. However, in the evalu-ations performed in this work (section 5), which used a few variations of datasetstructure, I have found that one machine learning model/technique worked best themajority of times: gradient boosting with decision trees.Gradient boosting is not exactly a machine learning model, but rather a class oftechniques called Ensemble, where instead of using a single model, we train manyslightly different models with different portions of the dataset, and combine themin a variety of ways to achieve a higher performance. Unlike common and moresimple ensemble techniques such as bagging. the trained models are not workingindependently, instead, they work sequentially in such a way that the followingpredictors learn from the mistakes of the previous predictors.To validate the grid search and avoid overfitting, Finch performs cross-validationwith 5 splits, and 30%/70% test/train split ratio.4.3.2 Predicting the optimal configurationThe user, when defining the adaptive configuration parameters, i.e: telling Finchwhich configuration parameters it should learn to configure, also defines the valuerange or the possible values, in case of discrete configuration parameters. Forinstance, a certain configuration parameter A could take values ranging from 1 to30Algorithm 1: Trains a model for each SLI and return a list of models1 TrainSLIModels (Dataset)inputs : A dataset that contains the system’s contextoutput: n SLI models, where n is the number of collected SLIs, andaverageScore2 models← [] ;3 scores← [] ;4 foreach SLI S ∈ Dataset do5 y← Dataset[S];6 X ← Dataset \Dataset[S];/* gridSearch will check if an initial grid search hasalready been performed, it not, it will exhaustivelysearch for the optimal model and its hyperparametersfor the collected dataset, if it already happened, itwill use previously found best hyperparameters andtrain the model on the new data */7 regressor← gridSearch(X ,y);8 score←Cross validation(regressor,X ,y);9 scores.add(score) ;10 models[S]← regressor. f it(X ,y);11 averageScore← computeAverage(scores);12 end13 return models, averageScore;100, and another parameter B could take the following array of discrete values: 1,5, 10, 50.After the models have been trained, we could simply predict the optimal con-figuration by passing the desired SLIs as our X , and in return get the optimal con-figuration as the y coming from the prediction method.However, that showed not to be very effective, and I created an additional al-gorithm on top of this straightforward call to prediction method. This algorithmcame as answer to cope with the following problem: in some cases, a configu-ration parameter does not overlap with respect to its effects on different SLAs,thus, in these cases, a model for a specific SLA predicts the right configurationparameter, but only for that given parameter which affects it directly, and makesinaccurate predictions for the other configuration parameters, since it does not af-31fect it directly. This prediction affects other SLAs negatively. Think of an SLAbeing selfish and only caring about the configuration parameter that affects it, andnot thinking about the other SLAs.To overcome this problem and find the configuration that satisfies all SLAs orthe majority of the SLAs, the algorithm created establishes some sort of consensusbetween the SLAs through a voting mechanism. To start it, it creates a 2D arraywith the Cartesian product of all possible parameter combinations, then, for eachSLA, it predicts its respective SLI value for each of these combinations. The timeto predict all these combinations is negligible, since predictions usually take a shortamount of time, even with big matrices (show this in experiment).Then, for each SLA’s predictions, it filters the configurations that satisfy theSLA plus a tolerance rate. Now we have, for each SLA, a set of configuration thatis both diverse and satisfiable. In the last step, for each configuration parameter, incase of a discrete parameter, we pick the one with the highest occurrence, and incase of a continuous value, we compute the mean of the predicted values for thisparameter.In the end it outputs the set of configuration parameters that, based on pastexperience, is most likely to satisfy all SLAs or the majority of SLAs. Algorithm2 illustrates this process.4.4 Passive and active training modeFinch is always re-training its models with current data. However, the initial train-ing cycles require grid search to be performed, and during these first few cycles,Finch sits passively collecting data and training models, but not making predictionsand adaptation plans. Thus, during this period, it is necessary to collect a diversedataset, Finch needs to know how to target systems respond to different config-urations under different workloads. There are two ways to achieve that: passiveand active training modes. These two options mode can be configured in Finch’sconfiguration file.32Algorithm 2: Predicts the optimal or sub-optimal set of configuration pa-rameters for the target system based on past experience1 PredictOptimalConfiguration (possibleParameterValues,SLOs,models)2 parameterCombinations←ComputeCartesianProduct(possibleParameterValues) ;/* candidateConfiguration will hold the best configurationsfor each SLI */3 candidateCon f igurations←{}/* SLI is the indicator we want to predict. SLO is theobjective value agreed on the SLA. */4 foreach SLI, SLO ∈ SLOs.items() do/* parameterCombinations is a 2D array with the cartesianproducs of all parameter combinations. predictionsgets also a 2D array with the predicted SLI value foreach parameter combination */5 predictions← models[SLI].predict(parameterCombinations) ;6 foreach prediction ∈ predictions do7 if (prediction≤ SLO+ toleranceRate) then8 index← predictions.indexO f (prediction) ;9 candidateCon f igurations[SLI].add(parameterCombinations(index));10 end11 end12 end13 optimalCon f iguration←{} ;14 foreach parameter ∈ possibleParameterValues do15 optimalParameterValue←getHighestOccurrenceValue(candidateCon f igurations, parameter) ;16 optimalCon f iguration[parameter]← optimalParameterValue ;17 end18 return optimalConfiguration;334.4.1 Passive training modeIn passive training mode, Finch will just collect data and train models while thesystem is running, not intervening with the target system’s natural flow. Thus, thiscould mean taking a longer time for Finch to start making adaptation plans, sinceit needs a diverse dataset.4.4.2 Active training modeIn active training mode, as a way to speed up the learning process, Finch willactively and frequently mutate the configuration parameters in the target system inorder to fasten the process of gathering a more diverse dataset. Thus, it will knowsooner how the target system respond to different configuration settings. Running italongside a workload simulator also works well, which is what I did to evaluate anexperiment, discussed in section 5. Since this a risky move to make in production,Finch should be ran alongside a testing/staging environment.4.5 Controlling adaptationsIn order to control the adaptations and prevent Finch to create and carry out adap-tation plans right after a previous adaptation plan was carried out, it needs to knowthat a previous plan has been carried out and whether there is any improvement inthe target system. To know whether there is any improvement is to understand andtake in consideration the propagation effect of the execution of an adaptation plan,in other words: before trying, again, to predict a better configuration, it is better towait for a previous adaptation to take full effect.4.5.1 Detecting improvement after carrying out an adaptationAn algorithm was devised in Finch to detect if the target system is correctly adapt-ing and improving, i.e: making progress. This algorithm came as an answer tothe question: how do we know, given an array of historical SLO values for theextracted SLAs, whether the target system is improving or not?A visual explanation to this problem is the detection of a positive slope ina graph after the adaptation was carried out as shown in the figure 4.2. Upon34detection of improvement, no adaptation procedure should be executed.In order to detect progress, this algorithm performs a sum of the successive dif-ferences of the SLO values in the array containing the past n historical—includingthe current—SLO values. For instance, using the data in the figure 4.2, we have anarray containing the SLO values (in percentage):SLOHistoricalValues =(100 90 80 70 60 50 60 80 85 95 100)We then perform successive subtractions: for each index in the array SLOHistoricalValues,except for the last index, we get the result of the subtraction of:SLOHistoricalValues[index+1]−SLOHistoricalValues[index]Thus, we get the array:SuccessiveDi f f erences =(−10−10−10−10−10 10 20 5 10 5)After that, we sum the values in this SuccessiveDi f f erences array, if it yieldsa number greater than zero, it means the target system is improving, and we re-turn it as a boolean value to prevent firing overlapping adaptation plans. How-ever, we keep checking for improvements, in cases where it had an initial improve-ment, but for instance, the workload pattern changed, causing the the sum of theSuccessiveDi f f erences to be negative, we can return a boolean as false, causingthe adaptation procedure to run again.The number n of past historical values, including the current one, shown to beworking well for n = 5, meaning that we get the current SLO value, plus other 4that were recently extracted.4.6 Finch configurationAlthough Finch can learn how to configure a target system, unfortunately it cannotconfigure itself. Correctly configuring Finch is key to achieve a good performance,as its configuration affects how the learning happens. The main Finch configura-35Figure 4.2: Slope that shows improvement in the target systemtions are:• datasetBuilderFrequency: defines how often the dataset is extracted andbuilt• datasetTimeRange: defines the range, in time, that the dataset will be builtwith. For instance, if it is set as 5 minutes, it will build the dataset startingfrom 5 minutes ago to now• trainingMode: boolean that enables Finch to mutate configuration for learn-ing purposes.• con f igurationMutationFrequency: defines how often Finch should try newconfigurations in order to learn how it affects the target system. This muta-tion stops after the first training cycle.Many different and interesting effects have emerged from within Finch, thus,selecting the right configuration for it has shown to drastically affect its perfor-mance.In the evaluation section we go in more details what has shown to work best.364.6.1 SLI adjustment effectThe experiments ran have shown an important insight: time to collect the datmini-pageaset is the most crucial part of Finch workflow. One of the reasons is that, in ashort period of time, only the tasks that happened more frequently will have morepredominance in the dataset, and thus, only configuration closely related to thesetasks will have accurate predictions. Allowing the dataset to grow bigger and morediverse showed to be a good way to prevent this sort of bias.Another reason, and an even more important one, is what we called SLI ad-jusment effect. Currently, our service-level indicator for a single service is its 99thpercentile of the requests latency. One important characteristics of the 99th per-centile is: since it focuses on a very small sample of the data that went above thethreshold and violated an SLA, it takes time for the newly adapted configuration,now actually showing signs of improvements, to take effect on the service-levelindicator. If we prematurely take this dataset, it will indicate as if the new (andcorrect) configuration is not a good configuration, as the 99th percentile is still vi-olating the agreement, however, the longer we wait for the SLI to adjust to the newconfiguration, the better the SLI will reflect it, balancing the dataset and enablingthe machine learning component to make more accurate predictions.Simple term: the sli points don’t reflect any current improvement now, it willonly show improvements after a certain time.37Chapter 5EvaluationTo guide and evaluate this work, I used three research questions:• RQ1: Can Finch learn the optimal or sub-optimal configuration parametersin a target system?• RQ2: How much performance overhead is incurred by using Finch?• RQ3: How much training data is needed to make accurate plans?In the next section I discuss the experiment setup and the techniques used totest Finch.5.1 ExperimentI needed a production level web service exposed over a REST API. It is very hardto find these as open source, and the ones that I found were usually a simple proofof concept for a tool. They mostly had very few endpoints and a simple businesslogic, which is not realistic enough to test Finch. As a result, I developed a webservice with this goal in mind. This system captured the most common points ofcomplexity in web services, which are:• A backend component that holds all the core logic of the applicationand,containerized with Docker.• Multiple HTTP endpoints served over REST API. In my scenario, theseendpoints are subject to a set of Service Level Agreements measured us-ing APDEX. For instance, endpoint A POST is an HTTP POST endpoint38for the service A, and X% of all requests to it should not take more than Y msto respond.• Another Docker container holding a Postgres databaseAt last, after developing this target system, small modifications on it were madein order to integrate Finch into it, such as instrumentation. These modificationswere discussed in more details on chapter 3. With Finch running alongside thetarget system, all that was needed was time for Finch to learn how to configure thetarget system.5.1.1 Initial training phase with workload simulationI found that, to make accurate and useful predictions, Finch needed a a reasonablysized dataset, about 1000 rows at least. However, my web service was not a sys-tem in production, and the only way to collect the mentioned dataset was throughworkload simulation. I accomplished this by mimicking realistic user actions forthe system. For instance, a user can browse shopping items, add and remove itemsarbitrarily to a shopping cart, and finalize the shopping session by checking out.The simulation I created ran these user actions multiple times in parallel in orderto stress the system in a realistic way.At the end, Finch ran alongside the target system for a while, collecting dataand learning the system’s patterns.5.1.2 Experiment 1: Configuration-controlled throttlingThis first experiment intended to answer the following question: Can Finch inferthe optimal configuration parameters without being explicitly programmed to doso?To validate and answer this initial question, I chose to keep my focus on con-figuration items that Finch can change directly in memory. Focusing on Postgresqlconfigurations would have meant writing an extra method to gracefully handlingthe Postgresql and Docker container restarts, and passing this method to Finch asthe adaptation method to change the Postgres configurations. For my first experi-ment, I wanted to focus on the prediction accuracy and not overcoming integrationchallenges.39To achieve this, I created a script that randomly (and temporarily) generatedthrottling points in the target system. These throttling points block the flow in thecode for either Bi milliseconds or ( 1Bi )∗10000 milliseconds for each throttling pointB∈ 1 . . . i. In other words, a throttling point will block the flow either proportionallyor inversely proportionally to the value of a configuration parameter. For instance,if a configuration value is 1000, in a proportional throttle point, it will block theflow for 1000ms or 1s. In an inversely proportional throttle point, it will block theflow for 1ms. Thus, if this configuration can take a number between 1 and 1000, itcould be one extreme or the other, depending on the type of throttling point.These Bi values are the artificial configuration parameters that affect the per-formance of the target system. Thus, if Finch successfully predicts and executesadaptation plans that bring the performance back to an optimal/sub-optimal level,without being explicitly programmed to do so, then it can be concluded that Finchis achieving its goals. It could also be concluded that the right architecture has beenbuilt and predicting real configuration parameters is a matter of dataset quality, andthe right duration to learn correctly.For each run, the script created random throttling points and it expected to re-ceive the right combination. In the meantime, Finch had no access to the script’sexpectations; the script analyzed Finch’s prediction at the end of each run to vali-date the prediction accuracy, which set each run’s success rate.I ran this experiment three times. For the model accuracy, I used the coefficientof determination R2 of the prediction, where R2 = 1− uv , u is the residual sum ofthe squares ∑ni=1(ytrue−ypred)2 and v is the regression sum of squares ∑ni=1(ytrue−ytrue)2.Experiment 1 resultsAll tests were ran on a personal Dell laptop running Ubuntu 14.04 with 4 Intel Corei7-5500U CPU @ 2.40GHz and 16 GB of memory ram. The machine learning codemakes use of parallelism on all 4 cores when training the models and predicting.Each training cycle had 1 hour duration, and the target system had 5 configurationparameters.The results show that, after the second training cycle, Finch learns to find the40optimal set of configuration parameters, achieving 100% accuracy on its predic-tions (5 out of 5 configuration parameters correct). The results after the 2nd cycleare not included in the results table as the accuracy stabilized in the 2nd cycle andkept at 100% on the subsequent training cycles . Table 5.1 shows the results ofthese 3 experiments.One of the possible bottlenecks of this experiment is the training cycle. How-ever this is an easy bottleneck to overcome, as the user can configure Finch toextract the dataset less frequently after it gets a stable model accuracy.Predicting the optimal configuration parameters was surprisingly fast. Evenwith the algorithm to find the best optimal configuration by performing multipleexhaustive predictions discussed in 3, it took between 100 and 200 milliseconds tomake predictions.For all three runs of the experiment, after the second training cycle, it tookbetween 5 to 10 minutes for the target system to stabilize all its SLA violations.Table 5.1: Data from using Finch with artificially generated configuration pa-rametersRun 1 Configuration precision Models average accuracy Dataset size (# of rows) Training time Prediction timeInitial 40% N.A N.A N.A N.ACycle #1 60% 71% 326 46 seconds 200 millisecondsCycle #2 100% 80% 1082 1 min 20 seconds 117 millisecondsRun 2 Configuration precision Models average accuracy Dataset size (# of rows) Training time Prediction timeInitial 40% N.A N.A N.A N.ACycle #1 60% 68% 374 51 seconds 165 millisecondsCycle #2 100% 80% 1165 1 min 12 seconds 128 millisecondsRun 3 Configuration precision Models average accuracy Dataset size (# of rows) Training time Prediction timeInitial 20% N.A N.A N.A N.ACycle #1 80% 87% 334 43 seconds 125 millisecondsCycle #2 100% 94% 1125 1 min 7 seconds 119 milliseconds5.1.3 Experiment 2: Performance overhead evaluationGiven the results found in experiment 1, the training algorithm is observed to be-come a bottleneck as the dataset gets bigger. To investigate this bottleneck furtherand to closely observe Finch’s resouce usage, I collected a bigger dataset, with atotal of 28147 rows and for 10 hours. The strategy for the configuration parametersused were the same as in the first experiment; random throttling points in the targetsystem’s code.41The first training cycle, the one that performs an expensive grid search, took29 minutes to find the optimal models and their hyperparameters for 5 SLI models.The subsequent training already knew the best model( so it just fitted the modelonto the data), and took 4 minutes to train and 268 milliseconds to predict.How much performance overhead is incurred by Finch and PrometheusGolang’s pprof was used to perform a thorough profiling of both CPU-time andheap of the target system when using Finch.While acting as a passive co-pilot (no training and no adaptations created/-carried out) and monitoring alongside Prometheus, Finch’s performance overheadover a 2-minute profiling window was 90 milliseconds out of 1530 milliseconds(5.88%), as shown in figure 5.1. In this 2-minute window, Finch’s MAPE-K loop,its main component, took 40 milliseconds out of 1530 milliseconds (2.61%), asshown in figure 5.2. Thus, while running only its monitoring/analyzing loop, Finchincurred roughly 8.5% CPU overhead. This answers the second research question(RQ2)Figure 5.1: CPU profile showing the performance overhead incurred byPrometheus5.1.4 Experiment 3: Finding the optimal configuration for PostgresIt is known that, because Postgres contains a very big set of configuration parame-ters, it is a challenging task to adapt Postgres to different scenarios. For instance,for a certain type of query, properly configuring Postgres’ work memory variablecan drastically improve its performance. There are many cases like this one; how-ever, it would not be productive to discuss each one of them.In this example, we use Finch in the same target system from experiment 1 and42Figure 5.2: CPU profile showing the performance overhead incurred Finch’sMAPE Loop2, but now instead of random throttling points, Finch tries to learn how to betterconfigure the Postgres database behind the target system.For this experiment, Postgres version 9.4 was used, and the monitored config-uration parameters were as listed:• Shared buffers• Effective cache size• Work memory• Write ahead log (WAL) buffers• Checkpoint completion target• Maintenance work memory• Checkpoint segmnets• Default statistics target• Random page costThe SLAs were the same as in the previous experiments. However, the adap-tive configuration file was different, as it contained the previously cited Postgresparameters. For each one of them, Finch defines the configuration parameter as acustom one, pointing it to the appropriate file that contains a procedure for Finchto run when adapting this configuration parameter, as shown in listing 5.1.The adaptive method in this experiment, called configurePG, took the predictedoptimal configuration and performed the following actions:43• Load the current Postgres configuration file• Change the predicted parameters to their respective predicted values and per-sist it to disk• Using Docker remote API, create a new Postgres container, which loads thenew configuration file. This Postgres instance points to the same volume asthe current running Postgres instance.• Tell the service what is the new Postgres instance’s ip:port.• Kill the previous Postgres Docker containerThis way, all changes to Postgres were effective because of the full restart ofthe instance.ResultsIn this experiment, the target system started with the default configuration for Post-gres. After a while, given the heavy workload, some of the SLAs were violated.However, that happened before Finch learned its models, so nothing could be donebefore the training. After the training cycle, which took 2 hours, Finch triggered anadaptation, since the target system was in a state of SLA violation. After predictingthe best optimal configuration for that scenario and carrying out the adaptation, the99th percentile latency was reduced by 39.85%, the comparison of before and afterthe adaptation can be seen in figure 5.3. Thus, answering the research question 1(RQ1).How much data is needed to make accurate predictions?The research question 3 (RQ3) touches a question commonly asked in the machinelearning community: how much data is needed to make accurate predictions?The answer to this question is: it depends. It depends on the properties of thedataset, such as size (how many rows), dimension (how many features), and overallquality of the dataset. For both experiments 1 and 3, it was needed at least 1000rows in the dataset to reach a good cross validation accuracy. However, this couldchange if we had many more configuration parameters.This brings up an important takeaway point from this work: the empiricalknowledge collected from running Finch in one target system will not properly44Figure 5.3: Experiment 3: 99th percentile latency of all endpoints before andafter adaptation. Each item in the X axis is an endpoint affected bya configuration parameter. Y axis is the latency. The average latencyreduction was 39.85%.Listing 5.1: Example of one of the adaptive configurations used in experi-ment 31 [2 ”pg shared buffers”: {3 ”value”: 128,4 ”valueType”: ”discrete”,5 ”values”: [16, 128, 4000, 16000],6 ”isCustom”: true,7 ”adaptationMethod”: ”configurePG”8 }9 ]transfer to running Finch in a different target system with different contexts andcomponents. Even thought that is the case, Finch was designed to handle this un-certainty, as its training pipeline performs an extensive grid search to find the bestmodel for a given observed dataset.45Chapter 6DiscussionIn this chapter I discuss the limitations, in both performance and usage, of thecurrent version of Finch and the future work to address these limitations.6.1 LimitationsAs of now, Finch is in its initial version and it is highly experimental. Thus, it hassome limitations that should be addressed by future improvements.6.1.1 Prometheus performance overheadThe main additional overhead that comes with Finch is due to using Prometheusfor storing observed data. This decision was made mostly because of the limiteddevelopment time available, which rendered Prometheus as the most time feasibleoption. Although Prometheus has great performance metrics overall, it offers manyfeatures not needed by Finch, which created unnecessary performance overhead,as shown in section Ever growing datasetBy design, the dataset extraction and creation never stops in Finch. However, overtime some parts of the data may become obsolete, not reflecting the current contextand patterns of the target system: New configuration parameters could have beenadded, usage patterns might have changed, or more unexpected scenarios. To sus-46tain Finch’s success for the long haul, there needs to be a mechanism that regulatesthe dataset collection over time, ideally taking both size and the relevance to re-cently observed patterns into account. Such mechanism would ensure the accuracyof the models and lead to faster training cycles.6.1.3 Local training pipelineIn some training cycles, grid search is automatically used in order to improve thequality of the accuracy. This special and costly training happens during the firstfew cycles, and when the accuracy starts dropping, usually because of change inthe usage patterns.Because this is a very computationally expensive operation, this can negativelyaffect the target system by using too much compute power during the training op-eration.6.1.4 Lack of control interfaceAll the information about the current state of Finch is currently shown throughlogs in a Docker container. It can be very time consuming to find the right infor-mation. Some Finch configurations could be changed through a GUI, which wouldnot only make interaction with Finch easy, but also would show a more completeinformation about its processes.6.2 Future workIn this section I discuss some future plans to improve Finch by addressing both tolimitations and to additional features.To address the Prometheus performance overhead, a simple time-series databaseto store its logged events is enough for Finch. Such a database could be imple-mented from scratch as a simple log storage, or by using TimescaleDB, a simpleimplementation of time-series constructs on top of Postgres.To address the local training performance overhead, distributing this trainingpipeline to other machines or simply delegating the training to an external servicewill solve the problem. Currently the model training is performed by an externalcomponent, written in Python. Encapsulating this component in an external service47and calling this service from Finch is a simple and good solution.To address the lack of a control interface, a web UI that talks directly to Finchover a REST API should be implemented. Such UI should provide control overFinch ’s workflow. To elaborate, this UI could allow the user to trigger a newtraining cycle, add/remove configuration parameters and trigger/cancel the creationof adaptation plans on the fly, all the while visualizing the impact of an adaptationplan on the SLAs and its SLIs even before executing it. All these features arerealistic expectations about Finch’s growth, since the ML models predict the SLIsbased on the configuration settings.Another future feature for Finch is scheduled adaptations. Rather than adaptingwhen an SLA is violated, Finch would be able to predict future workload (includingusage pattern) based on historical patterns, before it happens. It can predict theoptimal configuration beforehand, and schedule this adaptation for a certain periodbefore the predicted future workload, preventing SLA violation altogether.48Chapter 7ConclusionsThis thesis presented Finch; a tool that was designed to enable self-adaptation insystems without requiring complex architectural changes. As of now, the toolingfor building self-adaptive systems is very scarce and complex in the software in-dustry. I propose Finch as a step in the direction to build tools that make it possibleto enable configuration self-adaptation in non-autonomous systems.I show that Finch learns how to properly configure a target system after it ranalongside the system for a while. Once Finch is imported into a system, it startscollecting data on the context of the system. When the workload pattern changes orthe performance of the system degrades, Finch executes adaptations that change thesystem’s configurations, which successfully optimizes the system’s performance.The success of the adaptation stems from the machine learning-based MAPE-Kfeedback loop that is injected into the target system. As a result, besides enablingconfiguration self-adaptation in systems, Finch also addresses the complexity ofconfiguring software systems that have a high degree of uncertainty in their envi-ronment.As the goal of Finch is to make integration to systems easier, it also provides asmall and concise API, and incurs a performance overhead no higher than 8.5%.For future work, I discuss how the design of Finch can be improved in order todecrease its performance overhead, and make setting up Finch easier. The principlebehind this work is that, rather than having hard-wired heuristics, systems shouldbe able to adapt to different usage patterns by changing aspects of itself.49Bibliography[1] Using Probabilistic Reasoning to Automate Software Tuning. URLhttp://ftp.deas.harvard.edu/techreports/tr-08-04.pdf. → page 1[2] Apdex. Apdex, 2018. URL apdex.org. → page 17[3] P. Arcaini, E. Riccobene, and P. Scandurra. Modeling and AnalyzingMAPE-K Feedback Loops for Self-Adaptation. 2015 IEEE/ACM 10thInternational Symposium on Software Engineering for Adaptive andSelf-Managing Systems, pages 13–23, 2015. doi:10.1109/seams.2015.10.URL http://dx.doi.org/10.1109/seams.2015.10. → page 4[4] C. Barna, H. Ghanbari, M. Litoiu, and M. Shtern. Hogna: a Platform forSelf-Adaptive Applications in Cloud Environments. 2015 IEEE/ACM 10thInternational Symposium on Software Engineering for Adaptive andSelf-Managing Systems, pages 83–87, 2015. doi:10.1109/seams.2015.26.URL http://dx.doi.org/10.1109/seams.2015.26. → page 7[5] N. Bobroff, A. Kochut, and K. Beaty. Dynamic Placement of VirtualMachines for Managing SLA Violations. 2007 10th IFIP/IEEE InternationalSymposium on Integrated Network Management, 7 2007. ISBN9781424407982. doi:10.1109/inm.2007.374776. URLhttp://dx.doi.org/10.1109/inm.2007.374776. → page 5[6] Y. Brun, G. D. M. Serugendo, C. Gacek, H. Giese, H. Kienle, M. Litoiu,H. Mu¨ller, M. Pezze`, and M. Shaw. Engineering Self-Adaptive Systemsthrough Feedback Loops. In Software engineering for self-adaptive systems,pages 48–70. Springer, 2009. → pages 1, 12[7] Y. Brun, R. Desmarais, K. Geihs, M. Litoiu, A. Lopes, M. Shaw, andM. Smit. A Design Space for Self-Adaptive Systems, pages 33–50. SpringerBerlin Heidelberg, Berlin, Heidelberg, 2013. ISBN 978-3-642-35813-5.doi:10.1007/978-3-642-35813-5 2. URLhttps://doi.org/10.1007/978-3-642-35813-5 2. → page 450[8] B. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic Instrumentationof Production Systems. In USENIX Annual Technical Conference, GeneralTrack, pages 15–28, 2004. → page 15[9] E. Caron, F. Desprez, and A. Muresan. Pattern Matching Based Forecast ofNon-periodic Repetitive Behavior for Cloud Clients. Journal of GridComputing, 9(1):49–64, 11 2011. ISSN 1570-7873.doi:10.1007/s10723-010-9178-4. URLhttp://dx.doi.org/10.1007/s10723-010-9178-4. → page 5[10] A. Computing. An Architectural Blueprint for Autonomic Computing. IBMWhite Paper, 31, 2006. → page 4[11] Datadog. Datadog, 2018. URL https://www.datadoghq.com/. → page 15[12] R. de Lemos, H. Giese, H. A. Mu¨ller, and M. Shaw, editors. SoftwareEngineering for Self-Adaptive Systems II, volume 7475 of Lecture Notes inComputer Science. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.ISBN 978-3-642-35812-8 978-3-642-35813-5. URLhttp://link.springer.com/10.1007/978-3-642-35813-5. DOI:10.1007/978-3-642-35813-5. → pages 1, 4[13] F. Faniyi, P. R. Lewis, R. Bahsoon, and X. Yao. Architecting Self-AwareSoftware Systems. In 2014 IEEE/IFIP Conference on Software Architecture,pages 91–94, Apr. 2014. doi:10.1109/WICSA.2014.18. → page 1[14] A. Filieri, M. Maggio, K. Angelopoulos, N. D’Ippolito, I. Gerostathopoulos,A. B. Hempel, H. Hoffmann, P. Jamshidi, E. Kalyvianaki, C. Klein,F. Krikava, S. Misailovic, A. V. Papadopoulos, S. Ray, A. M. Sharifloo,S. Shevtsov, M. Ujma, and T. Vogel. Software Engineering Meets ControlTheory. 2015 IEEE/ACM 10th International Symposium on SoftwareEngineering for Adaptive and Self-Managing Systems, pages 71–82, 2015.doi:10.1109/seams.2015.12. URLhttp://dx.doi.org/10.1109/seams.2015.12. → page 4[15] A. S. Ganapathi. Predicting and Optimizing System Utilization andPerformance via Statistical Machine Learning. PhD thesis, EECSDepartment, University of California, Berkeley, Dec. 2009. URLhttp://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-181.html.→ page 1[16] A. G. Ganek and T. A. Corbi. The Dawning of the Autonomic ComputingEra. IBM systems Journal, 42(1):5–18, 2003. → page 151[17] J. Gillula. Fusing Machine Learning & Control Theory, November 2010.URL http://chess.eecs.berkeley.edu/pubs/714.html. Presented at weeklyActionWebs meeting. → page 5[18] D. Gmach, J. Rolia, L. Cherkasova, and A. Kemper. Workload Analysis andDemand Prediction of Enterprise Data Center Applications. 2007 IEEE10th International Symposium on Workload Characterization, 7 2007. ISBN9781424415618. doi:10.1109/iiswc.2007.4362193. URLhttp://dx.doi.org/10.1109/iiswc.2007.4362193. → page 5[19] S. Hedwig, Markus; Malkowski and D. Neumann. Towards AutonomicCost-Aware Allocation of Cloud Resources. 2010. URLhttps://aisel.aisnet.org/icis2010 submissions/180. → page 5[20] N. R. Herbst, N. Huber, S. Kounev, and E. Amrehn. Self-adaptive workloadclassification and forecasting for proactive resource provisioning.Concurrency and Computation: Practice and Experience, 26(12):2053–2078, Aug. 2014. ISSN 15320626. doi:10.1002/cpe.3224. URLhttp://doi.wiley.com/10.1002/cpe.3224. → page 1[21] C. A. R. Hoare. Communicating Sequential Processes. 21(8):12, 1978. →page 25[22] J. O. Kephart and D. M. Chess. The Vision of Autonomic Computing.Computer, 36(1):41–50, Jan. 2003. ISSN 0018-9162.doi:10.1109/MC.2003.1160055. → page 4[23] J. V. Kistowski, N. Herbst, S. Kounev, H. Groenda, C. Stier, and S. Lehrig.Modeling and Extracting Load Intensity Profiles. ACM Transactions onAutonomous and Adaptive Systems (TAAS), 11(4):23, 2017. ISSN1556-4665. doi:10.1145/3019596. URL http://dx.doi.org/10.1145/3019596.→ page 6[24] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The Case forLearned Index Structures. arXiv:1712.01208 [cs], Dec. 2017. URLhttp://arxiv.org/abs/1712.01208. arXiv: 1712.01208. → page 9[25] X. Meng, C. Isci, J. Kephart, L. Zhang, E. Bouillet, and D. Pendarakis.Efficient Resource Provisioning in Compute Clouds via VM Multiplexing. 102010. ISBN 9781450300742. doi:10.1145/1809049.1809052. URLhttp://dx.doi.org/10.1145/1809049.1809052. → page 5[26] Nagios. Nagios, 2018. URL https://www.nagios.org/. → page 1552[27] A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. Mowry,M. Perron, I. Quah, S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, Z. Wang,Y. Wu, R. Xian, and T. Zhang. Self-Driving Database Management Systems.2017. → pages 1, 8[28] B. Porter, M. Grieves, R. R. Filho, and D. Leslie. REX: A DevelopmentPlatform and Online Learning Approach for Runtime Emergent SoftwareSystems. In 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI 16), pages 333–348, GA, 2016. USENIXAssociation. ISBN 978-1-931971-33-1. URL https://www.usenix.org/conference/osdi16/technical-sessions/presentation/porter. → page 1[29] Prometheus. Prometheus, 2018. URL https://prometheus.io/. → page 15[30] M. Salehie and L. Tahvildari. Self-Adaptive Software: Landscape andResearch Challenges. ACM Transactions on Autonomous and AdaptiveSystems (TAAS), 2009. → pages 1, 4[31] J. P. Sousa. Towards User Tailoring of Self-Adaptation in UbiquitousComputing. In Software Engineering for Self-Adaptive Systems II, LectureNotes in Computer Science, pages 324–353. Springer, Berlin, Heidelberg,2013. ISBN 978-3-642-35812-8 978-3-642-35813-5.doi:10.1007/978-3-642-35813-5 13. URLhttps://link.springer.com/chapter/10.1007/978-3-642-35813-5 13. → pagesx, 6, 7[32] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MITPress, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981. → page553


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items